hezar.preprocessors.tokenizers.bpe module

class hezar.preprocessors.tokenizers.bpe.BPEConfig(max_length: int = 512, truncation_strategy: str = 'longest_first', truncation_direction: str = 'right', stride: int = 0, padding_strategy: str = 'longest', padding_direction: str = 'right', pad_to_multiple_of: int = 0, pad_token_type_id: 'int' = 0, bos_token: str = '<s>', eos_token: str = '</s>', unk_token: str = '<unk>', sep_token: str = '<sep>', pad_token: str = '<pad>', cls_token: str = '<cls>', mask_token: str = '<mask>', additional_special_tokens: List[str] = None, dropout: float = None, continuing_subword_prefix: str = '', end_of_word_suffix: str = '', fuse_unk: bool = False, vocab_size: int = 30000, min_frequency: int = 2, limit_alphabet: int = 1000, initial_alphabet: list = <factory>, show_progress: bool = True)[source]

Bases: TokenizerConfig

additional_special_tokens: List[str] = None
bos_token: str = '<s>'
cls_token: str = '<cls>'
continuing_subword_prefix: str = ''
dropout: float = None
end_of_word_suffix: str = ''
eos_token: str = '</s>'
fuse_unk: bool = False
initial_alphabet: list
limit_alphabet: int = 1000
mask_token: str = '<mask>'
max_length: int = 512
min_frequency: int = 2
name: str = 'bpe_tokenizer'
pad_to_multiple_of: int = 0
pad_token: str = '<pad>'
padding_direction: str = 'right'
padding_strategy: str = 'longest'
sep_token: str = '<sep>'
show_progress: bool = True
stride: int = 0
truncation_direction: str = 'right'
truncation_strategy: str = 'longest_first'
unk_token: str = '<unk>'
vocab_size: int = 30000
class hezar.preprocessors.tokenizers.bpe.BPETokenizer(config, tokenizer_file=None, **kwargs)[source]

Bases: Tokenizer

A standard Byte-level BPE tokenizer using 🤗HuggingFace Tokenizers

Parameters:
  • config – Preprocessor config for the tokenizer

  • **kwargs – Extra/manual config parameters

build()[source]

Build the tokenizer.

Returns:

The built tokenizer.

Return type:

HFTokenizer

required_backends: List[str | Backends] = [Backends.TOKENIZERS]
token_ids_name = 'token_ids'
tokenizer_config_filename = 'tokenizer_config.yaml'
tokenizer_filename = 'tokenizer.json'
train(files: List[str], **train_kwargs)[source]

Train the model using the given files

train_from_iterator(dataset: List[str], **train_kwargs)[source]

Train the model using the given files