hezar.preprocessors.text_normalizer module¶
- class hezar.preprocessors.text_normalizer.TextNormalizer(config: TextNormalizerConfig, **kwargs)[source]¶
- Bases: - Preprocessor- A simple configurable text normalizer - classmethod load(hub_or_local_path, subfolder=None, config_filename=None, cache_dir=None, **kwargs) TextNormalizer[source]¶
- Load a preprocessor or a pipeline of preprocessors from a local or Hub path. This method automatically detects any preprocessor in the path. If there’s only one preprocessor, returns it and if there are more, returns a dictionary of preprocessors. - This method must also be overridden by subclasses as it internally calls this method for every possible preprocessor found in the repo. - Parameters:
- hub_or_local_path – Path to hub or local repo 
- subfolder – Subfolder for the preprocessor. 
- force_return_dict – Whether to return a dict even if there’s only one preprocessor available on the repo 
- cache_dir – Path to cache directory 
- **kwargs – Extra kwargs 
 
- Returns:
- A Preprocessor subclass or a dict of Preprocessor subclass instances 
 
 - normalizer_config_file = 'normalizer_config.yaml'¶
 - preprocessor_subfolder = 'preprocessor'¶
 - push_to_hub(repo_id, commit_message: str | None = None, subfolder: str | None = None, config_filename: str | None = None, private: bool | None = None)[source]¶
- Push normalizer config and other optional files to the Hub. - Parameters:
- repo_id – Repo id on the Hub 
- commit_message – Commit message 
- subfolder – Optional subfolder for the normalizer 
- config_filename – Optional normalizer config filename 
- private – Whether to create a private repo if it does not exist already 
 
 
 
- class hezar.preprocessors.text_normalizer.TextNormalizerConfig(replace_patterns: 'List[Tuple[str, str]] | List[List[str]] | List[Dict[str, List]]' = None, nfkd: 'bool' = True, nfkc: 'bool' = True)[source]¶
- Bases: - PreprocessorConfig- name: str = 'text_normalizer'¶
 - nfkc: bool = True¶
 - nfkd: bool = True¶
 - replace_patterns: List[Tuple[str, str]] | List[List[str]] | List[Dict[str, List]] = None¶