hezar.preprocessors.tokenizers.sentencepiece_bpe module¶

class hezar.preprocessors.tokenizers.sentencepiece_bpe.SentencePieceBPEConfig(max_length: 'int' = 'deprecated', truncation: 'str' = 'deprecated', truncation_side: str = 'right', padding: 'str' = 'deprecated', padding_side: str = 'right', stride: int = 0, pad_to_multiple_of: int = 0, pad_token_type_id: 'int' = 0, bos_token: str = '<s>', eos_token: str = '</s>', unk_token: str = '<unk>', sep_token: str = '<sep>', pad_token: str = '<pad>', cls_token: str = '<cls>', mask_token: str = '<mask>', additional_special_tokens: List[str] = None, dropout: float = None, continuing_subword_prefix: str = '', replacement: str = '_', add_prefix_space: bool = True, end_of_word_suffix: str = '', fuse_unk: bool = False, vocab_size: int = 30000, min_frequency: int = 2, limit_alphabet: int = 1000, initial_alphabet: list = <factory>, show_progress: bool = True)[source]¶

Bases: TokenizerConfig

add_prefix_space: bool = True¶

additional_special_tokens: List[str] = None¶

bos_token: str = '<s>'¶

cls_token: str = '<cls>'¶

continuing_subword_prefix: str = ''¶

dropout: float = None¶

end_of_word_suffix: str = ''¶

eos_token: str = '</s>'¶

fuse_unk: bool = False¶

initial_alphabet: list¶

limit_alphabet: int = 1000¶

mask_token: str = '<mask>'¶

min_frequency: int = 2¶

name: str = 'sentencepiece_bpe_tokenizer'¶

pad_to_multiple_of: int = 0¶

pad_token: str = '<pad>'¶

padding_side: str = 'right'¶

replacement: str = '_'¶

sep_token: str = '<sep>'¶

show_progress: bool = True¶

stride: int = 0¶

truncation_side: str = 'right'¶

unk_token: str = '<unk>'¶

vocab_size: int = 30000¶

class hezar.preprocessors.tokenizers.sentencepiece_bpe.SentencePieceBPETokenizer(config, tokenizer_file=None, **kwargs)[source]¶

Bases: Tokenizer

A standard SentencePiece BPE tokenizer using 🤗HuggingFace Tokenizers

Parameters:

config – Preprocessor config for the tokenizer
**kwargs – Extra/manual config parameters

build()[source]¶

Build the tokenizer.

Returns:: The built tokenizer.
Return type:: HFTokenizer

required_backends: List[str | Backends] = [Backends.TOKENIZERS]¶

token_ids_name = 'token_ids'¶

tokenizer_config_filename = 'tokenizer_config.yaml'¶

tokenizer_filename = 'tokenizer.json'¶

train(files: List[str], **train_kwargs)[source]¶: Train the model using the given files

train_from_iterator(dataset: List[str], **train_kwargs)[source]¶: Train the model using the given files