hezar.preprocessors.tokenizers package¶
Submodules¶
- hezar.preprocessors.tokenizers.bpe module
BPEConfigBPEConfig.additional_special_tokensBPEConfig.bos_tokenBPEConfig.cls_tokenBPEConfig.continuing_subword_prefixBPEConfig.dropoutBPEConfig.end_of_word_suffixBPEConfig.eos_tokenBPEConfig.fuse_unkBPEConfig.initial_alphabetBPEConfig.limit_alphabetBPEConfig.mask_tokenBPEConfig.min_frequencyBPEConfig.nameBPEConfig.pad_to_multiple_ofBPEConfig.pad_tokenBPEConfig.padding_sideBPEConfig.sep_tokenBPEConfig.show_progressBPEConfig.strideBPEConfig.truncation_sideBPEConfig.unk_tokenBPEConfig.vocab_size
BPETokenizer
- hezar.preprocessors.tokenizers.sentencepiece_bpe module
SentencePieceBPEConfigSentencePieceBPEConfig.add_prefix_spaceSentencePieceBPEConfig.additional_special_tokensSentencePieceBPEConfig.bos_tokenSentencePieceBPEConfig.cls_tokenSentencePieceBPEConfig.continuing_subword_prefixSentencePieceBPEConfig.dropoutSentencePieceBPEConfig.end_of_word_suffixSentencePieceBPEConfig.eos_tokenSentencePieceBPEConfig.fuse_unkSentencePieceBPEConfig.initial_alphabetSentencePieceBPEConfig.limit_alphabetSentencePieceBPEConfig.mask_tokenSentencePieceBPEConfig.min_frequencySentencePieceBPEConfig.nameSentencePieceBPEConfig.pad_to_multiple_ofSentencePieceBPEConfig.pad_tokenSentencePieceBPEConfig.padding_sideSentencePieceBPEConfig.replacementSentencePieceBPEConfig.sep_tokenSentencePieceBPEConfig.show_progressSentencePieceBPEConfig.strideSentencePieceBPEConfig.truncation_sideSentencePieceBPEConfig.unk_tokenSentencePieceBPEConfig.vocab_size
SentencePieceBPETokenizer
- hezar.preprocessors.tokenizers.sentencepiece_unigram module
SentencePieceUnigramConfigSentencePieceUnigramConfig.add_prefix_spaceSentencePieceUnigramConfig.additional_special_tokensSentencePieceUnigramConfig.bos_tokenSentencePieceUnigramConfig.cls_tokenSentencePieceUnigramConfig.continuing_subword_prefixSentencePieceUnigramConfig.dropoutSentencePieceUnigramConfig.end_of_word_suffixSentencePieceUnigramConfig.eos_tokenSentencePieceUnigramConfig.fuse_unkSentencePieceUnigramConfig.initial_alphabetSentencePieceUnigramConfig.limit_alphabetSentencePieceUnigramConfig.mask_tokenSentencePieceUnigramConfig.min_frequencySentencePieceUnigramConfig.nameSentencePieceUnigramConfig.pad_to_multiple_ofSentencePieceUnigramConfig.pad_tokenSentencePieceUnigramConfig.padding_sideSentencePieceUnigramConfig.replacementSentencePieceUnigramConfig.sep_tokenSentencePieceUnigramConfig.show_progressSentencePieceUnigramConfig.strideSentencePieceUnigramConfig.truncation_sideSentencePieceUnigramConfig.unk_tokenSentencePieceUnigramConfig.vocab_size
SentencePieceUnigramTokenizerSentencePieceUnigramTokenizer.build()SentencePieceUnigramTokenizer.required_backendsSentencePieceUnigramTokenizer.token_ids_nameSentencePieceUnigramTokenizer.tokenizer_config_filenameSentencePieceUnigramTokenizer.tokenizer_filenameSentencePieceUnigramTokenizer.train()SentencePieceUnigramTokenizer.train_from_iterator()
- hezar.preprocessors.tokenizers.tokenizer module
TokenizerTokenizer.add_special_tokens()Tokenizer.add_tokens()Tokenizer.bos_tokenTokenizer.bos_token_idTokenizer.build()Tokenizer.cls_tokenTokenizer.cls_token_idTokenizer.convert_ids_to_tokens()Tokenizer.convert_tokens_to_ids()Tokenizer.decode()Tokenizer.decoderTokenizer.enable_padding()Tokenizer.enable_truncation()Tokenizer.encode()Tokenizer.eos_tokenTokenizer.eos_token_idTokenizer.from_file()Tokenizer.get_added_vocab()Tokenizer.get_tokens_from_offsets()Tokenizer.get_vocab()Tokenizer.get_vocab_size()Tokenizer.id_to_token()Tokenizer.load()Tokenizer.mask_tokenTokenizer.mask_token_idTokenizer.modelTokenizer.no_padding()Tokenizer.no_truncation()Tokenizer.num_special_tokens_to_add()Tokenizer.pad_encoded_batch()Tokenizer.pad_tokenTokenizer.pad_token_idTokenizer.paddingTokenizer.push_to_hub()Tokenizer.required_backendsTokenizer.save()Tokenizer.sep_tokenTokenizer.sep_token_idTokenizer.set_truncation_and_padding()Tokenizer.special_idsTokenizer.token_ids_nameTokenizer.token_to_id()Tokenizer.tokenizer_config_filenameTokenizer.tokenizer_filenameTokenizer.truncationTokenizer.uncastable_keysTokenizer.unk_tokenTokenizer.unk_token_idTokenizer.vocabTokenizer.vocab_size
TokenizerConfigTokenizerConfig.additional_special_tokensTokenizerConfig.bos_tokenTokenizerConfig.cls_tokenTokenizerConfig.eos_tokenTokenizerConfig.mask_tokenTokenizerConfig.max_lengthTokenizerConfig.nameTokenizerConfig.pad_to_multiple_ofTokenizerConfig.pad_tokenTokenizerConfig.pad_token_type_idTokenizerConfig.paddingTokenizerConfig.padding_sideTokenizerConfig.sep_tokenTokenizerConfig.strideTokenizerConfig.truncationTokenizerConfig.truncation_sideTokenizerConfig.unk_token
- hezar.preprocessors.tokenizers.wordpiece module
WordPieceConfigWordPieceConfig.additional_special_tokensWordPieceConfig.cls_tokenWordPieceConfig.initial_alphabetWordPieceConfig.limit_alphabetWordPieceConfig.mask_tokenWordPieceConfig.min_frequencyWordPieceConfig.nameWordPieceConfig.pad_to_multiple_ofWordPieceConfig.pad_tokenWordPieceConfig.pad_token_type_idWordPieceConfig.padding_sideWordPieceConfig.sep_tokenWordPieceConfig.show_progressWordPieceConfig.strideWordPieceConfig.truncation_sideWordPieceConfig.unk_tokenWordPieceConfig.vocab_sizeWordPieceConfig.wordpieces_prefix
WordPieceTokenizer