hezar.data.datasets.ocr_dataset module

class hezar.data.datasets.ocr_dataset.OCRDataset(config: OCRDatasetConfig, split=None, **kwargs)[source]

Bases: Dataset

General OCR dataset class.

OCR dataset supports two types of image to text dataset. One is for tokenizer-based models in which the labels are tokens and the other is char-level models in which the labels are separated by character and the converted to ids. This behavior is specified by the text_split_type in config which can be either tokenize or char_split.

required_backends: List[str | Backends] = [Backends.SCIKIT]
class hezar.data.datasets.ocr_dataset.OCRDatasetConfig(task: TaskType = TaskType.IMAGE2TEXT, path: str = None, text_split_type: str | TextSplitType = TextSplitType.TOKENIZE, tokenizer_path: str = None, id2label: Dict[int, str] = <factory>, text_column: str = 'label', images_paths_column: str = 'image_path', max_length: int = None, invalid_characters: list = None, reverse_text: bool = None, reverse_digits: bool = None, image_processor_config: ImageProcessorConfig = None)[source]

Bases: DatasetConfig

Configuration class for OCR datasets.

Parameters:
  • path (str) – Path to the dataset.

  • text_split_type (TextSplitType) – Type of text splitting (CHAR_SPLIT or TOKENIZE).

  • tokenizer_path (str) – Path to the tokenizer file.

  • id2label (Dict[int, str]) – Mapping of label IDs to characters.

  • text_column (str) – Column name for text in the dataset.

  • images_paths_column (str) – Column name for image paths in the dataset.

  • max_length (int) – Maximum length of text.

  • invalid_characters (list) – List of invalid characters.

  • reverse_digits (bool) – Whether to reverse the digits in text.

  • image_processor_config (ImageProcessorConfig) – Configuration for image processing.

id2label: Dict[int, str]
image_processor_config: ImageProcessorConfig = None
images_paths_column: str = 'image_path'
invalid_characters: list = None
max_length: int = None
name: str = 'ocr'
path: str = None
reverse_digits: bool = None
reverse_text: bool = None
task: TaskType = 'image2text'
text_column: str = 'label'
text_split_type: str | TextSplitType = 'tokenize'
tokenizer_path: str = None
class hezar.data.datasets.ocr_dataset.TextSplitType(value)[source]

Bases: str, Enum

An enumeration.

CHAR_SPLIT = 'char_split'
TOKENIZE = 'tokenize'