hezar.preprocessors.tokenizers.tokenizer module

class hezar.preprocessors.tokenizers.tokenizer.Tokenizer(config: TokenizerConfig, tokenizer_file=None, **kwargs)[source]

Bases: Preprocessor

Base tokenizer class. Mostly copied from BaseTokenizer.

Parameters:
  • config – A TokenizerConfig instance.

  • tokenizer_file (str) – A tokenizer.json file to load the whole tokenizer from.

  • **kwargs – Extra config parameters that merge into the main config.

add_special_tokens(special_tokens) int[source]
add_tokens(tokens) int[source]
property bos_token
property bos_token_id
build()[source]

Build the tokenizer.

Returns:

The built tokenizer.

Return type:

HFTokenizer

property cls_token
property cls_token_id
convert_ids_to_tokens(ids: int | List[int], skip_special_tokens: bool = False)[source]
convert_tokens_to_ids(tokens: str | List[str]) int | List[int][source]
decode(ids: List[int], skip_special_tokens: bool = True, **kwargs)[source]

Decode a list of token IDs.

Parameters:
  • ids (List[int]) – List of token IDs.

  • skip_special_tokens (bool) – Whether to skip special tokens during decoding.

  • **kwargs – Additional keyword arguments.

Returns:

List of decoded strings.

Return type:

List[str]

property decoder: Decoder
enable_padding(direction: str = 'right', pad_to_multiple_of: int | None = None, pad_id: int = 0, pad_type_id: int = 0, pad_token: str | None = None, length: int | None = None)[source]
enable_truncation(max_length, stride=0, strategy='longest_first', direction='right')[source]
encode(inputs, is_pretokenized: bool = False, add_special_tokens: bool = True, **kwargs)[source]

Tokenize a list of inputs (could be raw or tokenized inputs).

Parameters:
  • inputs – List of inputs.

  • is_pretokenized – Whether the inputs are already tokenized.

  • add_special_tokens – Whether to add special tokens to the inputs. Defaults to True.

  • **kwargs – Additional keyword arguments.

Returns:

List of dictionaries containing tokenized inputs.

Return type:

List[Dict]

property eos_token
property eos_token_id
static from_file(path)[source]

Create a tokenizer from a file.

Parameters:

path (str) – Path to the tokenizer file.

Returns:

The created tokenizer.

Return type:

HFTokenizer

get_added_vocab() Dict[str, int][source]

Returns the added tokens in the vocabulary as a dictionary of token to index.

Returns:

The added tokens.

Return type:

Dict[str, int]

get_tokens_from_offsets(text: str | List[str], ids: List[int], offsets_mapping: List[Tuple[int, int]])[source]

Extract human-readable tokens using the original text and offsets mapping :param text: Raw string text :param ids: Token ids :param offsets_mapping: A list of tuples representing offsets

Returns:

A list of tokens

get_vocab(with_added_tokens: bool = True) Dict[str, int][source]
get_vocab_size(with_added_tokens: bool = True) int[source]
id_to_token(id: int) str[source]
classmethod load(hub_or_local_path, subfolder=None, config_filename=None, tokenizer_filename=None, cache_dir=None, **kwargs) Tokenizer[source]

Load a tokenizer from a specified path or Hub repository.

Parameters:
  • cls – Class reference.

  • hub_or_local_path – Path or Hub repository ID.

  • subfolder – Subfolder containing tokenizer files.

  • config_filename – Tokenizer config filename.

  • tokenizer_filename – Tokenizer filename.

  • cache_dir – Path to cache directory

  • **kwargs – Additional arguments.

Returns:

Loaded tokenizer.

Return type:

Tokenizer

property mask_token
property mask_token_id
property model: Model
no_padding()[source]
no_truncation()[source]
num_special_tokens_to_add(is_pair: bool) int[source]
pad_encoded_batch(inputs, padding: str | PaddingType | None = None, max_length: int | None = None, truncation: bool = True, return_tensors: str | None = None, include_keys: List[str] | None = None, exclude_keys: List | None = None)[source]

Pad a batch of encoded inputs.

Parameters:
  • inputs – Input batch of encoded tokens.

  • padding (str | PaddingType) – Padding type.

  • max_length (Optional[int]) – Max input length (only if padding is set to “max_length”).

  • truncation (bool) – Whether to allow truncation.

  • return_tensors (Optional[str]) – The type of tensors to return.

  • include_keys – (Optional[List[str]]): Only pad these given set of keys

  • exclude_keys (List) – A list of keys to exclude when padding.

Returns:

Padded inputs.

Return type:

Dict

property pad_token
property pad_token_id
property padding
push_to_hub(repo_id, commit_message=None, subfolder=None, tokenizer_filename=None, config_filename=None, private=False)[source]

Push tokenizer and config to the Hub

Parameters:
  • repo_id – The path (id or repo name) on the hub

  • commit_message – Commit message for this push

  • subfolder – subfolder to save the files

  • tokenizer_filename – tokenizer filename

  • config_filename – tokenizer config filename

  • private – If the repo should be private (ignored if the repo exists)

required_backends: List[str | Backends] = []
save(path, save_config=True, pretty=True)[source]

Save the tokenizer and its configuration.

Parameters:
  • path (str) – Path to save the tokenizer.

  • save_config (bool) – Whether to save the configuration.

  • pretty (bool) – Whether to format the saved JSON file with indentation.

property sep_token
property sep_token_id
set_truncation_and_padding(padding=None, truncation=None, padding_side=None, truncation_side=None, max_length: int | None = None, stride: int | None = None, pad_to_multiple_of: int | None = None)[source]
property special_ids
token_ids_name = 'token_ids'
token_to_id(token: str) int[source]
tokenizer_config_filename = 'tokenizer_config.yaml'
tokenizer_filename = 'tokenizer.json'
property truncation: dict
uncastable_keys = ['word_ids', 'tokens', 'offsets_mapping']
property unk_token
property unk_token_id
property vocab
property vocab_size: int

Size of the base vocabulary (without the added tokens).

Type:

int

class hezar.preprocessors.tokenizers.tokenizer.TokenizerConfig(max_length: int = 'deprecated', truncation: str = 'deprecated', truncation_side: str | None = None, padding: str = 'deprecated', padding_side: str | None = None, stride: int | None = None, pad_to_multiple_of: int = 'deprecated', pad_token_type_id: int = 0, bos_token: str | None = None, eos_token: str | None = None, unk_token: str | None = None, sep_token: str | None = None, pad_token: str | None = None, cls_token: str | None = None, mask_token: str | None = None, additional_special_tokens: List[str] | None = None)[source]

Bases: PreprocessorConfig

Configuration for the Tokenizer.

Parameters:
  • truncation_side (str) – Truncation direction for tokenization.

  • stride (int) – Stride for tokenization.

  • padding_side (str) – Padding direction for tokenization.

  • pad_to_multiple_of (int) – Pad to a multiple of this value.

  • pad_token_type_id (int) – ID of the padding token type.

  • bos_token (str) – Beginning of sequence token.

  • eos_token (str) – End of sequence token.

  • unk_token (str) – Unknown token.

  • sep_token (str) – Separator token.

  • pad_token (str) – Padding token.

  • cls_token (str) – Classification token.

  • mask_token (str) – Mask token.

  • additional_special_tokens (List[str]) – Additional special tokens.

additional_special_tokens: List[str] = None
bos_token: str = None
cls_token: str = None
eos_token: str = None
mask_token: str = None
max_length: int = 'deprecated'
name: str = 'tokenizer'
pad_to_multiple_of: int = 'deprecated'
pad_token: str = None
pad_token_type_id: int = 0
padding: str = 'deprecated'
padding_side: str = None
sep_token: str = None
stride: int = None
truncation: str = 'deprecated'
truncation_side: str = None
unk_token: str = None