hezar.preprocessors.text_normalizer module¶
- class hezar.preprocessors.text_normalizer.TextNormalizer(config: TextNormalizerConfig, **kwargs)[source]¶
Bases:
Preprocessor
A simple configurable text normalizer
- classmethod load(hub_or_local_path, subfolder=None, config_filename=None, cache_dir=None, **kwargs) TextNormalizer [source]¶
Load a preprocessor or a pipeline of preprocessors from a local or Hub path. This method automatically detects any preprocessor in the path. If there’s only one preprocessor, returns it and if there are more, returns a dictionary of preprocessors.
This method must also be overridden by subclasses as it internally calls this method for every possible preprocessor found in the repo.
- Parameters:
hub_or_local_path – Path to hub or local repo
subfolder – Subfolder for the preprocessor.
force_return_dict – Whether to return a dict even if there’s only one preprocessor available on the repo
cache_dir – Path to cache directory
**kwargs – Extra kwargs
- Returns:
A Preprocessor subclass or a dict of Preprocessor subclass instances
- normalizer_config_file = 'normalizer_config.yaml'¶
- preprocessor_subfolder = 'preprocessor'¶
- push_to_hub(repo_id, commit_message: str | None = None, subfolder: str | None = None, config_filename: str | None = None, private: bool | None = None)[source]¶
Push normalizer config and other optional files to the Hub.
- Parameters:
repo_id – Repo id on the Hub
commit_message – Commit message
subfolder – Optional subfolder for the normalizer
config_filename – Optional normalizer config filename
private – Whether to create a private repo if it does not exist already
- class hezar.preprocessors.text_normalizer.TextNormalizerConfig(replace_patterns: 'List[Tuple[str, str]] | List[List[str]] | List[Dict[str, List]]' = None, nfkd: 'bool' = True, nfkc: 'bool' = True)[source]¶
Bases:
PreprocessorConfig
- name: str = 'text_normalizer'¶
- nfkc: bool = True¶
- nfkd: bool = True¶
- replace_patterns: List[Tuple[str, str]] | List[List[str]] | List[Dict[str, List]] = None¶