hezar.preprocessors.text_normalizer module

class hezar.preprocessors.text_normalizer.TextNormalizer(config: TextNormalizerConfig, **kwargs)[source]

Bases: Preprocessor

A simple configurable text normalizer

classmethod load(hub_or_local_path, subfolder=None, config_filename=None, cache_dir=None, **kwargs) TextNormalizer[source]

Load a preprocessor or a pipeline of preprocessors from a local or Hub path. This method automatically detects any preprocessor in the path. If there’s only one preprocessor, returns it and if there are more, returns a dictionary of preprocessors.

This method must also be overridden by subclasses as it internally calls this method for every possible preprocessor found in the repo.

Parameters:
  • hub_or_local_path – Path to hub or local repo

  • subfolder – Subfolder for the preprocessor.

  • force_return_dict – Whether to return a dict even if there’s only one preprocessor available on the repo

  • cache_dir – Path to cache directory

  • **kwargs – Extra kwargs

Returns:

A Preprocessor subclass or a dict of Preprocessor subclass instances

normalizer_config_file = 'normalizer_config.yaml'
preprocessor_subfolder = 'preprocessor'
push_to_hub(repo_id, commit_message: str | None = None, subfolder: str | None = None, config_filename: str | None = None, private: bool | None = None)[source]

Push normalizer config and other optional files to the Hub.

Parameters:
  • repo_id – Repo id on the Hub

  • commit_message – Commit message

  • subfolder – Optional subfolder for the normalizer

  • config_filename – Optional normalizer config filename

  • private – Whether to create a private repo if it does not exist already

required_backends: List[str | Backends] = [Backends.TOKENIZERS]
save(path, subfolder=None, config_filename=None)[source]
class hezar.preprocessors.text_normalizer.TextNormalizerConfig(replace_patterns: 'List[Tuple[str, str]] | List[List[str]] | List[Dict[str, List]]' = None, nfkd: 'bool' = True, nfkc: 'bool' = True)[source]

Bases: PreprocessorConfig

name: str = 'text_normalizer'
nfkc: bool = True
nfkd: bool = True
replace_patterns: List[Tuple[str, str]] | List[List[str]] | List[Dict[str, List]] = None