hezar.data.datasets.ocr_dataset module¶
- class hezar.data.datasets.ocr_dataset.OCRDataset(config: OCRDatasetConfig, split=None, preprocessor=None, **kwargs)[source]¶
- Bases: - Dataset- General OCR dataset class. - OCR dataset supports two types of image to text dataset. One is for tokenizer-based models in which the labels are tokens and the other is char-level models in which the labels are separated by character and the converted to ids. This behavior is specified by the text_split_type in config which can be either tokenize or char_split. 
- class hezar.data.datasets.ocr_dataset.OCRDatasetConfig(path: str | None = None, task: ~hezar.constants.TaskType = TaskType.IMAGE2TEXT, max_size: int | float | None = None, hf_load_kwargs: dict | None = None, text_split_type: str | ~hezar.data.datasets.ocr_dataset.TextSplitType = TextSplitType.CHAR_SPLIT, id2label: ~typing.Dict[int, str] = <factory>, text_column: str = 'label', images_paths_column: str = 'image_path', max_length: int | None = None, invalid_characters: list | None = None, reverse_text: bool | None = None, reverse_digits: bool | None = None)[source]¶
- Bases: - DatasetConfig- Configuration class for OCR datasets. - Parameters:
- path (str) – Path to the dataset. 
- text_split_type (TextSplitType) – Type of text splitting (CHAR_SPLIT or TOKENIZE). 
- id2label (Dict[int, str]) – Mapping of label IDs to characters. 
- text_column (str) – Column name for text in the dataset. 
- images_paths_column (str) – Column name for image paths in the dataset. 
- max_length (int) – Maximum length of text. 
- invalid_characters (list) – List of invalid characters. 
- reverse_digits (bool) – Whether to reverse the digits in text. 
 
 - id2label: Dict[int, str]¶
 - images_paths_column: str = 'image_path'¶
 - invalid_characters: list = None¶
 - max_length: int = None¶
 - name: str = 'ocr'¶
 - path: str = None¶
 - reverse_digits: bool = None¶
 - reverse_text: bool = None¶
 - text_column: str = 'label'¶
 - text_split_type: str | TextSplitType = 'char_split'¶