Preprocessors

A really important group of modules in Hezar is the preprocessors. Preprocessors are responsible for every single processing of inputs from their rawest form to the point that they’re ready to be fed to the model.

Preprocessors include all the tokenizers, feature extractors, normalizers, etc. and all of them are considered as a preprocessor type.

Loading preprocessors

Following the common pattern among all modules in Hezar, preprocessors also can be loaded in the same way.

Loading with the corresponding module
You can load any preprocessor of any type with its base class like Tokenizer, AudioFeatureExtractor, etc.

from hezar.preprocessors import Tokenizer, AudioFeatureExtractor, TextNormalizer

tokenizer = Tokenizer.load("hezarai/bert-base-fa")
normalizer = TextNormalizer.load("hezarai/roberta-base-fa")
feature_extractor = AudioFeatureExtractor.load("hezarai/whisper-small-fa")
...

Loading with the Preprocessor module
Some models might need multiple types of preprocessors. For example encoder-decoder multimodal models like image captioning models or even audio models need both feature extractor and text tokenizer or even a text normalizer. In order to load all preprocessors in a path, you can use the Preprocessor.load. The output of this method depends on whether the path contains single or multiple preprocessors.

  • If path contains only one preprocessor the output is a preprocessor object of the right type.

  • If path contains multiple preprocessors, the output is a PreprocessorContainer which is a dict-like object that holds each preprocessor by its registry name.

from hezar.preprocessors import Tokenizer

tokenizer = Tokenizer.load("hezarai/bert-base-fa")
print(tokenizer)
<hezar.preprocessors.tokenizers.wordpiece.WordPieceTokenizer object at 0x7f636d951e50>
from hezar.preprocessors import Preprocessor

whisper_preprocessors = Preprocessor.load("hezarai/whisper-small-fa")
print(whisper_preprocessors)
PreprocessorsContainer(
    [
        ('whisper_feature_extractor',
         < hezar.preprocessors.feature_extractors.audio.whisper_feature_extractor.WhisperFeatureExtractor at 0x7f6316fdcbb0 >),
        ('whisper_bpe_tokenizer',
         < hezar.preprocessors.tokenizers.whisper_bpe.WhisperBPETokenizer at 0x7f643cb13f40 >)
    ]
)

In order to access specific preprocessor objects within a PreprocessorContainer instance, you can use the following properties:

  • tokenizer: Returns the tokenizer of the container if exists.

  • text_normalizer: Returns the text normalizer of the container if exists.

  • image_processor: Returns the image processor of the container if exists.

  • audio_feature_extractor: Returns the audio feature extractor of the container if exists.

Saving & Pushing to the Hub

Although preprocessor have their own type, they all implement the load, save and push_to_hub methods.

from hezar.preprocessors import TextNormalizer, TextNormalizerConfig

normalizer = TextNormalizer(TextNormalizerConfig(nfkc=False))
normalizer.save("my-normalizer")
normalizer.push_to_hub("arxyzan/my-normalizer")

Folder structure of the preprocessors

All preprocessors are saved under the preprocessor subfolder by default. Changing this behaviour is possible from all three methods:

  • load(..., subfolder="SUBFOLDER")

  • save(..., subfolder="SUBFOLDER")

  • push_to_hub(..., subfolder="SUBFOLDER")

The folder structure of the preprocessors for any save model (locally or in a repo) is something like below:

hezarai/whisper-small-fa
├── model_config.yaml
├── model.pt
└── preprocessor
    ├── feature_extractor_config.yaml
    ├── tokenizer_config.yaml
    └── tokenizer.json