hezar.preprocessors.tokenizers.wordpiece module¶
- class hezar.preprocessors.tokenizers.wordpiece.WordPieceConfig(max_length: 'int' = 'deprecated', truncation: 'str' = 'deprecated', truncation_side: str = 'right', padding: 'str' = 'deprecated', padding_side: str = 'right', stride: int = 0, pad_to_multiple_of: int = 0, pad_token_type_id: int = 0, bos_token: 'str' = None, eos_token: 'str' = None, unk_token: str = '[UNK]', sep_token: str = '[SEP]', pad_token: str = '[PAD]', cls_token: str = '[CLS]', mask_token: str = '[MASK]', additional_special_tokens: List[str] = None, wordpieces_prefix: str = '##', vocab_size: int = 30000, min_frequency: int = 2, limit_alphabet: int = 1000, initial_alphabet: list = <factory>, show_progress: bool = True)[source]¶
Bases:
TokenizerConfig
- additional_special_tokens: List[str] = None¶
- cls_token: str = '[CLS]'¶
- initial_alphabet: list¶
- limit_alphabet: int = 1000¶
- mask_token: str = '[MASK]'¶
- min_frequency: int = 2¶
- name: str = 'wordpiece_tokenizer'¶
- pad_to_multiple_of: int = 0¶
- pad_token: str = '[PAD]'¶
- pad_token_type_id: int = 0¶
- padding_side: str = 'right'¶
- sep_token: str = '[SEP]'¶
- show_progress: bool = True¶
- stride: int = 0¶
- truncation_side: str = 'right'¶
- unk_token: str = '[UNK]'¶
- vocab_size: int = 30000¶
- wordpieces_prefix: str = '##'¶
- class hezar.preprocessors.tokenizers.wordpiece.WordPieceTokenizer(config, tokenizer_file=None, **kwargs)[source]¶
Bases:
Tokenizer
A standard WordPiece tokenizer using 🤗HuggingFace Tokenizers
- Parameters:
config – Preprocessor config for the tokenizer
**kwargs – Extra/manual config parameters
- token_ids_name = 'token_ids'¶
- tokenizer_config_filename = 'tokenizer_config.yaml'¶
- tokenizer_filename = 'tokenizer.json'¶