hezar.preprocessors.preprocessor module

class hezar.preprocessors.preprocessor.Preprocessor(config: PreprocessorConfig, **kwargs)[source]

Bases: object

Base class for all data preprocessors.

Parameters:

config – Preprocessor properties

classmethod load(hub_or_local_path, subfolder: str | None = None, force_return_dict: bool = False, cache_dir: str | None = None, **kwargs)[source]

Load a preprocessor or a pipeline of preprocessors from a local or Hub path. This method automatically detects any preprocessor in the path. If there’s only one preprocessor, returns it and if there are more, returns a dictionary of preprocessors.

This method must also be overridden by subclasses as it internally calls this method for every possible preprocessor found in the repo.

Parameters:
  • hub_or_local_path – Path to hub or local repo

  • subfolder – Subfolder for the preprocessor.

  • force_return_dict – Whether to return a dict even if there’s only one preprocessor available on the repo

  • cache_dir – Path to cache directory

  • **kwargs – Extra kwargs

Returns:

A Preprocessor subclass or a dict of Preprocessor subclass instances

preprocessor_subfolder = 'preprocessor'
push_to_hub(repo_id, subfolder=None, commit_message=None, private=None, **kwargs)[source]
required_backends: List[str | Backends] = []
save(path, **kwargs)[source]
class hezar.preprocessors.preprocessor.PreprocessorsContainer[source]

Bases: OrderedDict

A class to hold the preprocessors by their name

property audio_feature_extractor

Return audio feature extractor if available

property image_processor

Return image processor if available

push_to_hub(repo_id, subfolder=None, commit_message=None, private=None)[source]

Push every preprocessor item in the container

save(path)[source]

Save every preprocessor item in the container

property text_normalizer

Return text normalizer if available

property tokenizer

Return tokenizer if available