hezar.preprocessors.tokenizers.tokenizer module¶
- class hezar.preprocessors.tokenizers.tokenizer.Tokenizer(config: TokenizerConfig, tokenizer_file=None, **kwargs)[source]¶
Bases:
Preprocessor
Base tokenizer class. Mostly copied from
BaseTokenizer
.- Parameters:
config – A TokenizerConfig instance.
tokenizer_file (str) – A tokenizer.json file to load the whole tokenizer from.
**kwargs – Extra config parameters that merge into the main config.
- property bos_token¶
- property bos_token_id¶
- property cls_token¶
- property cls_token_id¶
- decode(ids: List[int], skip_special_tokens: bool = True, **kwargs)[source]¶
Decode a list of token IDs.
- Parameters:
ids (List[int]) – List of token IDs.
skip_special_tokens (bool) – Whether to skip special tokens during decoding.
**kwargs – Additional keyword arguments.
- Returns:
List of decoded strings.
- Return type:
List[str]
- property decoder: Decoder¶
- enable_padding(direction: str = 'right', pad_to_multiple_of: int | None = None, pad_id: int = 0, pad_type_id: int = 0, pad_token: str | None = None, length: int | None = None)[source]¶
- encode(inputs, is_pretokenized: bool = False, add_special_tokens: bool = True, **kwargs)[source]¶
Tokenize a list of inputs (could be raw or tokenized inputs).
- Parameters:
inputs – List of inputs.
is_pretokenized – Whether the inputs are already tokenized.
add_special_tokens – Whether to add special tokens to the inputs. Defaults to True.
**kwargs – Additional keyword arguments.
- Returns:
List of dictionaries containing tokenized inputs.
- Return type:
List[Dict]
- property eos_token¶
- property eos_token_id¶
- static from_file(path)[source]¶
Create a tokenizer from a file.
- Parameters:
path (str) – Path to the tokenizer file.
- Returns:
The created tokenizer.
- Return type:
HFTokenizer
- get_added_vocab() Dict[str, int] [source]¶
Returns the added tokens in the vocabulary as a dictionary of token to index.
- Returns:
The added tokens.
- Return type:
Dict[str, int]
- get_tokens_from_offsets(text: str | List[str], ids: List[int], offsets_mapping: List[Tuple[int, int]])[source]¶
Extract human-readable tokens using the original text and offsets mapping :param text: Raw string text :param ids: Token ids :param offsets_mapping: A list of tuples representing offsets
- Returns:
A list of tokens
- classmethod load(hub_or_local_path, subfolder=None, config_filename=None, tokenizer_filename=None, cache_dir=None, **kwargs) Tokenizer [source]¶
Load a tokenizer from a specified path or Hub repository.
- Parameters:
cls – Class reference.
hub_or_local_path – Path or Hub repository ID.
subfolder – Subfolder containing tokenizer files.
config_filename – Tokenizer config filename.
tokenizer_filename – Tokenizer filename.
cache_dir – Path to cache directory
**kwargs – Additional arguments.
- Returns:
Loaded tokenizer.
- Return type:
- property mask_token¶
- property mask_token_id¶
- property model: Model¶
- pad_encoded_batch(inputs, padding: str | PaddingType | None = None, max_length: int | None = None, truncation: bool = True, return_tensors: str | None = None, include_keys: List[str] | None = None, exclude_keys: List | None = None)[source]¶
Pad a batch of encoded inputs.
- Parameters:
inputs – Input batch of encoded tokens.
padding (str | PaddingType) – Padding type.
max_length (Optional[int]) – Max input length (only if padding is set to “max_length”).
truncation (bool) – Whether to allow truncation.
return_tensors (Optional[str]) – The type of tensors to return.
include_keys – (Optional[List[str]]): Only pad these given set of keys
exclude_keys (List) – A list of keys to exclude when padding.
- Returns:
Padded inputs.
- Return type:
Dict
- property pad_token¶
- property pad_token_id¶
- property padding¶
- push_to_hub(repo_id, commit_message=None, subfolder=None, tokenizer_filename=None, config_filename=None, private=False)[source]¶
Push tokenizer and config to the Hub
- Parameters:
repo_id – The path (id or repo name) on the hub
commit_message – Commit message for this push
subfolder – subfolder to save the files
tokenizer_filename – tokenizer filename
config_filename – tokenizer config filename
private – If the repo should be private (ignored if the repo exists)
- save(path, save_config=True, pretty=True)[source]¶
Save the tokenizer and its configuration.
- Parameters:
path (str) – Path to save the tokenizer.
save_config (bool) – Whether to save the configuration.
pretty (bool) – Whether to format the saved JSON file with indentation.
- property sep_token¶
- property sep_token_id¶
- set_truncation_and_padding(padding=None, truncation=None, padding_side=None, truncation_side=None, max_length: int | None = None, stride: int | None = None, pad_to_multiple_of: int | None = None)[source]¶
- property special_ids¶
- token_ids_name = 'token_ids'¶
- tokenizer_config_filename = 'tokenizer_config.yaml'¶
- tokenizer_filename = 'tokenizer.json'¶
- property truncation: dict¶
- uncastable_keys = ['word_ids', 'tokens', 'offsets_mapping']¶
- property unk_token¶
- property unk_token_id¶
- property vocab¶
- property vocab_size: int¶
Size of the base vocabulary (without the added tokens).
- Type:
int
- class hezar.preprocessors.tokenizers.tokenizer.TokenizerConfig(max_length: int = 'deprecated', truncation: str = 'deprecated', truncation_side: str | None = None, padding: str = 'deprecated', padding_side: str | None = None, stride: int | None = None, pad_to_multiple_of: int = 'deprecated', pad_token_type_id: int = 0, bos_token: str | None = None, eos_token: str | None = None, unk_token: str | None = None, sep_token: str | None = None, pad_token: str | None = None, cls_token: str | None = None, mask_token: str | None = None, additional_special_tokens: List[str] | None = None)[source]¶
Bases:
PreprocessorConfig
Configuration for the Tokenizer.
- Parameters:
truncation_side (str) – Truncation direction for tokenization.
stride (int) – Stride for tokenization.
padding_side (str) – Padding direction for tokenization.
pad_to_multiple_of (int) – Pad to a multiple of this value.
pad_token_type_id (int) – ID of the padding token type.
bos_token (str) – Beginning of sequence token.
eos_token (str) – End of sequence token.
unk_token (str) – Unknown token.
sep_token (str) – Separator token.
pad_token (str) – Padding token.
cls_token (str) – Classification token.
mask_token (str) – Mask token.
additional_special_tokens (List[str]) – Additional special tokens.
- additional_special_tokens: List[str] = None¶
- bos_token: str = None¶
- cls_token: str = None¶
- eos_token: str = None¶
- mask_token: str = None¶
- max_length: int = 'deprecated'¶
- name: str = 'tokenizer'¶
- pad_to_multiple_of: int = 'deprecated'¶
- pad_token: str = None¶
- pad_token_type_id: int = 0¶
- padding: str = 'deprecated'¶
- padding_side: str = None¶
- sep_token: str = None¶
- stride: int = None¶
- truncation: str = 'deprecated'¶
- truncation_side: str = None¶
- unk_token: str = None¶