hezar.preprocessors.tokenizers.tokenizer module¶

class hezar.preprocessors.tokenizers.tokenizer.Tokenizer(config: TokenizerConfig, tokenizer_file=None, **kwargs)[source]¶

Bases: Preprocessor

Base tokenizer class. Mostly copied from BaseTokenizer.

Parameters:

config – A TokenizerConfig instance.
tokenizer_file (str) – A tokenizer.json file to load the whole tokenizer from.
**kwargs – Extra config parameters that merge into the main config.

add_special_tokens(special_tokens) → int[source]¶

add_tokens(tokens) → int[source]¶

property bos_token¶

property bos_token_id¶

build()[source]¶

Build the tokenizer.

Returns:: The built tokenizer.
Return type:: HFTokenizer

property cls_token¶

property cls_token_id¶

convert_ids_to_tokens(ids: int | List[int], skip_special_tokens: bool = False)[source]¶

convert_tokens_to_ids(tokens: str | List[str]) → int | List[int][source]¶

decode(ids: List[int], skip_special_tokens: bool = True, **kwargs)[source]¶

Decode a list of token IDs.

Parameters:

ids (List[int]) – List of token IDs.
skip_special_tokens (bool) – Whether to skip special tokens during decoding.
**kwargs – Additional keyword arguments.

Returns:

List of decoded strings.

Return type:

List[str]

property decoder: Decoder¶

enable_padding(direction: str = 'right', pad_to_multiple_of: int | None = None, pad_id: int = 0, pad_type_id: int = 0, pad_token: str | None = None, length: int | None = None)[source]¶

enable_truncation(max_length, stride=0, strategy='longest_first', direction='right')[source]¶

encode(inputs, is_pretokenized: bool = False, add_special_tokens: bool = True, **kwargs)[source]¶

Tokenize a list of inputs (could be raw or tokenized inputs).

Parameters:

inputs – List of inputs.
is_pretokenized – Whether the inputs are already tokenized.
add_special_tokens – Whether to add special tokens to the inputs. Defaults to True.
**kwargs – Additional keyword arguments.

Returns:

List of dictionaries containing tokenized inputs.

Return type:

List[Dict]

property eos_token¶

property eos_token_id¶

static from_file(path)[source]¶

Create a tokenizer from a file.

Parameters:: path (str) – Path to the tokenizer file.
Returns:: The created tokenizer.
Return type:: HFTokenizer

get_added_vocab() → Dict[str, int][source]¶

Returns the added tokens in the vocabulary as a dictionary of token to index.

Returns:: The added tokens.
Return type:: Dict[str, int]

get_tokens_from_offsets(text: str | List[str], ids: List[int], offsets_mapping: List[Tuple[int, int]])[source]¶

Extract human-readable tokens using the original text and offsets mapping :param text: Raw string text :param ids: Token ids :param offsets_mapping: A list of tuples representing offsets

Returns:: A list of tokens

get_vocab(with_added_tokens: bool = True) → Dict[str, int][source]¶

get_vocab_size(with_added_tokens: bool = True) → int[source]¶

id_to_token(id: int) → str[source]¶

classmethod load(hub_or_local_path, subfolder=None, config_filename=None, tokenizer_filename=None, cache_dir=None, **kwargs) → Tokenizer[source]¶

Load a tokenizer from a specified path or Hub repository.

Parameters:

cls – Class reference.
hub_or_local_path – Path or Hub repository ID.
subfolder – Subfolder containing tokenizer files.
config_filename – Tokenizer config filename.
tokenizer_filename – Tokenizer filename.
cache_dir – Path to cache directory
**kwargs – Additional arguments.

Returns:

Loaded tokenizer.

Return type:

Tokenizer

property mask_token¶

property mask_token_id¶

property model: Model¶

no_padding()[source]¶

no_truncation()[source]¶

num_special_tokens_to_add(is_pair: bool) → int[source]¶

Pad a batch of encoded inputs.

Parameters:

inputs – Input batch of encoded tokens.
padding (str | PaddingType) – Padding type.
max_length (Optional[int]) – Max input length (only if padding is set to “max_length”).
truncation (bool) – Whether to allow truncation.
return_tensors (Optional[str]) – The type of tensors to return.
include_keys – (Optional[List[str]]): Only pad these given set of keys
exclude_keys (List) – A list of keys to exclude when padding.

Returns:

Padded inputs.

Return type:

Dict

property pad_token¶

property pad_token_id¶

property padding¶

push_to_hub(repo_id, commit_message=None, subfolder=None, tokenizer_filename=None, config_filename=None, private=False)[source]¶

Push tokenizer and config to the Hub

Parameters:

repo_id – The path (id or repo name) on the hub
commit_message – Commit message for this push
subfolder – subfolder to save the files
tokenizer_filename – tokenizer filename
config_filename – tokenizer config filename
private – If the repo should be private (ignored if the repo exists)

required_backends: List[str | Backends] = []¶

save(path, save_config=True, pretty=True)[source]¶

Save the tokenizer and its configuration.

Parameters:

path (str) – Path to save the tokenizer.
save_config (bool) – Whether to save the configuration.
pretty (bool) – Whether to format the saved JSON file with indentation.

property sep_token¶

property sep_token_id¶

set_truncation_and_padding(padding=None, truncation=None, padding_side=None, truncation_side=None, max_length: int | None = None, stride: int | None = None, pad_to_multiple_of: int | None = None)[source]¶

property special_ids¶

token_ids_name = 'token_ids'¶

token_to_id(token: str) → int[source]¶

tokenizer_config_filename = 'tokenizer_config.yaml'¶

tokenizer_filename = 'tokenizer.json'¶

property truncation: dict¶

uncastable_keys = ['word_ids', 'tokens', 'offsets_mapping']¶

property unk_token¶

property unk_token_id¶

property vocab¶

property vocab_size: int¶

Size of the base vocabulary (without the added tokens).

Type:: int

class hezar.preprocessors.tokenizers.tokenizer.TokenizerConfig(max_length: int = 'deprecated', truncation: str = 'deprecated', truncation_side: str | None = None, padding: str = 'deprecated', padding_side: str | None = None, stride: int | None = None, pad_to_multiple_of: int = 'deprecated', pad_token_type_id: int = 0, bos_token: str | None = None, eos_token: str | None = None, unk_token: str | None = None, sep_token: str | None = None, pad_token: str | None = None, cls_token: str | None = None, mask_token: str | None = None, additional_special_tokens: List[str] | None = None)[source]¶

Bases: PreprocessorConfig

Configuration for the Tokenizer.

Parameters:

truncation_side (str) – Truncation direction for tokenization.
stride (int) – Stride for tokenization.
padding_side (str) – Padding direction for tokenization.
pad_to_multiple_of (int) – Pad to a multiple of this value.
pad_token_type_id (int) – ID of the padding token type.
bos_token (str) – Beginning of sequence token.
eos_token (str) – End of sequence token.
unk_token (str) – Unknown token.
sep_token (str) – Separator token.
pad_token (str) – Padding token.
cls_token (str) – Classification token.
mask_token (str) – Mask token.
additional_special_tokens (List[str]) – Additional special tokens.

additional_special_tokens: List[str] = None¶

bos_token: str = None¶

cls_token: str = None¶

eos_token: str = None¶

mask_token: str = None¶

max_length: int = 'deprecated'¶

name: str = 'tokenizer'¶

pad_to_multiple_of: int = 'deprecated'¶

pad_token: str = None¶

pad_token_type_id: int = 0¶

padding: str = 'deprecated'¶

padding_side: str = None¶

sep_token: str = None¶

stride: int = None¶

truncation: str = 'deprecated'¶

truncation_side: str = None¶

unk_token: str = None¶