hezar.preprocessors.tokenizers.bpe module¶
- class hezar.preprocessors.tokenizers.bpe.BPEConfig(max_length: 'int' = 'deprecated', truncation: 'str' = 'deprecated', truncation_side: str = 'right', padding: 'str' = 'deprecated', padding_side: str = 'right', stride: int = 0, pad_to_multiple_of: int = 0, pad_token_type_id: 'int' = 0, bos_token: str = '<s>', eos_token: str = '</s>', unk_token: str = '<unk>', sep_token: str = '<sep>', pad_token: str = '<pad>', cls_token: str = '<cls>', mask_token: str = '<mask>', additional_special_tokens: List[str] = None, dropout: float = None, continuing_subword_prefix: str = '', end_of_word_suffix: str = '', fuse_unk: bool = False, vocab_size: int = 30000, min_frequency: int = 2, limit_alphabet: int = 1000, initial_alphabet: list = <factory>, show_progress: bool = True)[source]¶
- Bases: - TokenizerConfig- additional_special_tokens: List[str] = None¶
 - bos_token: str = '<s>'¶
 - cls_token: str = '<cls>'¶
 - continuing_subword_prefix: str = ''¶
 - dropout: float = None¶
 - end_of_word_suffix: str = ''¶
 - eos_token: str = '</s>'¶
 - fuse_unk: bool = False¶
 - initial_alphabet: list¶
 - limit_alphabet: int = 1000¶
 - mask_token: str = '<mask>'¶
 - min_frequency: int = 2¶
 - name: str = 'bpe_tokenizer'¶
 - pad_to_multiple_of: int = 0¶
 - pad_token: str = '<pad>'¶
 - padding_side: str = 'right'¶
 - sep_token: str = '<sep>'¶
 - show_progress: bool = True¶
 - stride: int = 0¶
 - truncation_side: str = 'right'¶
 - unk_token: str = '<unk>'¶
 - vocab_size: int = 30000¶
 
- class hezar.preprocessors.tokenizers.bpe.BPETokenizer(config, tokenizer_file=None, **kwargs)[source]¶
- Bases: - Tokenizer- A standard Byte-level BPE tokenizer using 🤗HuggingFace Tokenizers - Parameters:
- config – Preprocessor config for the tokenizer 
- **kwargs – Extra/manual config parameters 
 
 - token_ids_name = 'token_ids'¶
 - tokenizer_config_filename = 'tokenizer_config.yaml'¶
 - tokenizer_filename = 'tokenizer.json'¶