hezar.preprocessors.tokenizers.tokenizer module¶
- class hezar.preprocessors.tokenizers.tokenizer.Tokenizer(config: TokenizerConfig, tokenizer_file=None, **kwargs)[source]¶
- Bases: - Preprocessor- Base tokenizer class. Mostly copied from - BaseTokenizer.- Parameters:
- config – A TokenizerConfig instance. 
- tokenizer_file (str) – A tokenizer.json file to load the whole tokenizer from. 
- **kwargs – Extra config parameters that merge into the main config. 
 
 - property bos_token¶
 - property bos_token_id¶
 - property cls_token¶
 - property cls_token_id¶
 - decode(ids: List[int], skip_special_tokens: bool = True, **kwargs)[source]¶
- Decode a list of token IDs. - Parameters:
- ids (List[int]) – List of token IDs. 
- skip_special_tokens (bool) – Whether to skip special tokens during decoding. 
- **kwargs – Additional keyword arguments. 
 
- Returns:
- List of decoded strings. 
- Return type:
- List[str] 
 
 - property decoder: Decoder¶
 - enable_padding(direction: str = 'right', pad_to_multiple_of: int | None = None, pad_id: int = 0, pad_type_id: int = 0, pad_token: str | None = None, length: int | None = None)[source]¶
 - encode(inputs, is_pretokenized: bool = False, add_special_tokens: bool = True, **kwargs)[source]¶
- Tokenize a list of inputs (could be raw or tokenized inputs). - Parameters:
- inputs – List of inputs. 
- is_pretokenized – Whether the inputs are already tokenized. 
- add_special_tokens – Whether to add special tokens to the inputs. Defaults to True. 
- **kwargs – Additional keyword arguments. 
 
- Returns:
- List of dictionaries containing tokenized inputs. 
- Return type:
- List[Dict] 
 
 - property eos_token¶
 - property eos_token_id¶
 - static from_file(path)[source]¶
- Create a tokenizer from a file. - Parameters:
- path (str) – Path to the tokenizer file. 
- Returns:
- The created tokenizer. 
- Return type:
- HFTokenizer 
 
 - get_added_vocab() Dict[str, int][source]¶
- Returns the added tokens in the vocabulary as a dictionary of token to index. - Returns:
- The added tokens. 
- Return type:
- Dict[str, int] 
 
 - get_tokens_from_offsets(text: str | List[str], ids: List[int], offsets_mapping: List[Tuple[int, int]])[source]¶
- Extract human-readable tokens using the original text and offsets mapping :param text: Raw string text :param ids: Token ids :param offsets_mapping: A list of tuples representing offsets - Returns:
- A list of tokens 
 
 - classmethod load(hub_or_local_path, subfolder=None, config_filename=None, tokenizer_filename=None, cache_dir=None, **kwargs) Tokenizer[source]¶
- Load a tokenizer from a specified path or Hub repository. - Parameters:
- cls – Class reference. 
- hub_or_local_path – Path or Hub repository ID. 
- subfolder – Subfolder containing tokenizer files. 
- config_filename – Tokenizer config filename. 
- tokenizer_filename – Tokenizer filename. 
- cache_dir – Path to cache directory 
- **kwargs – Additional arguments. 
 
- Returns:
- Loaded tokenizer. 
- Return type:
 
 - property mask_token¶
 - property mask_token_id¶
 - property model: Model¶
 - pad_encoded_batch(inputs, padding: str | PaddingType | None = None, max_length: int | None = None, truncation: bool = True, return_tensors: str | None = None, include_keys: List[str] | None = None, exclude_keys: List | None = None)[source]¶
- Pad a batch of encoded inputs. - Parameters:
- inputs – Input batch of encoded tokens. 
- padding (str | PaddingType) – Padding type. 
- max_length (Optional[int]) – Max input length (only if padding is set to “max_length”). 
- truncation (bool) – Whether to allow truncation. 
- return_tensors (Optional[str]) – The type of tensors to return. 
- include_keys – (Optional[List[str]]): Only pad these given set of keys 
- exclude_keys (List) – A list of keys to exclude when padding. 
 
- Returns:
- Padded inputs. 
- Return type:
- Dict 
 
 - property pad_token¶
 - property pad_token_id¶
 - property padding¶
 - push_to_hub(repo_id, commit_message=None, subfolder=None, tokenizer_filename=None, config_filename=None, private=False)[source]¶
- Push tokenizer and config to the Hub - Parameters:
- repo_id – The path (id or repo name) on the hub 
- commit_message – Commit message for this push 
- subfolder – subfolder to save the files 
- tokenizer_filename – tokenizer filename 
- config_filename – tokenizer config filename 
- private – If the repo should be private (ignored if the repo exists) 
 
 
 - save(path, save_config=True, pretty=True)[source]¶
- Save the tokenizer and its configuration. - Parameters:
- path (str) – Path to save the tokenizer. 
- save_config (bool) – Whether to save the configuration. 
- pretty (bool) – Whether to format the saved JSON file with indentation. 
 
 
 - property sep_token¶
 - property sep_token_id¶
 - set_truncation_and_padding(padding=None, truncation=None, padding_side=None, truncation_side=None, max_length: int | None = None, stride: int | None = None, pad_to_multiple_of: int | None = None)[source]¶
 - property special_ids¶
 - token_ids_name = 'token_ids'¶
 - tokenizer_config_filename = 'tokenizer_config.yaml'¶
 - tokenizer_filename = 'tokenizer.json'¶
 - property truncation: dict¶
 - uncastable_keys = ['word_ids', 'tokens', 'offsets_mapping']¶
 - property unk_token¶
 - property unk_token_id¶
 - property vocab¶
 - property vocab_size: int¶
- Size of the base vocabulary (without the added tokens). - Type:
- int 
 
 
- class hezar.preprocessors.tokenizers.tokenizer.TokenizerConfig(max_length: int = 'deprecated', truncation: str = 'deprecated', truncation_side: str | None = None, padding: str = 'deprecated', padding_side: str | None = None, stride: int | None = None, pad_to_multiple_of: int = 'deprecated', pad_token_type_id: int = 0, bos_token: str | None = None, eos_token: str | None = None, unk_token: str | None = None, sep_token: str | None = None, pad_token: str | None = None, cls_token: str | None = None, mask_token: str | None = None, additional_special_tokens: List[str] | None = None)[source]¶
- Bases: - PreprocessorConfig- Configuration for the Tokenizer. - Parameters:
- truncation_side (str) – Truncation direction for tokenization. 
- stride (int) – Stride for tokenization. 
- padding_side (str) – Padding direction for tokenization. 
- pad_to_multiple_of (int) – Pad to a multiple of this value. 
- pad_token_type_id (int) – ID of the padding token type. 
- bos_token (str) – Beginning of sequence token. 
- eos_token (str) – End of sequence token. 
- unk_token (str) – Unknown token. 
- sep_token (str) – Separator token. 
- pad_token (str) – Padding token. 
- cls_token (str) – Classification token. 
- mask_token (str) – Mask token. 
- additional_special_tokens (List[str]) – Additional special tokens. 
 
 - additional_special_tokens: List[str] = None¶
 - bos_token: str = None¶
 - cls_token: str = None¶
 - eos_token: str = None¶
 - mask_token: str = None¶
 - max_length: int = 'deprecated'¶
 - name: str = 'tokenizer'¶
 - pad_to_multiple_of: int = 'deprecated'¶
 - pad_token: str = None¶
 - pad_token_type_id: int = 0¶
 - padding: str = 'deprecated'¶
 - padding_side: str = None¶
 - sep_token: str = None¶
 - stride: int = None¶
 - truncation: str = 'deprecated'¶
 - truncation_side: str = None¶
 - unk_token: str = None¶