hezar.data.dataset_processors module

Dataset processors are a bunch of callable classes to be passed as map functions for any dataset on the Hub. Note that the main dataset classes are already implemented in a way that the processing is done in the __getitem__ method and these classes are only used for when the dataset has been loaded using the HuggingFace datasets library, and you want to get advantage of the multiprocessing/batch processing/caching functionalities of the HF datasets.

Example: >>> from datasets import load_dataset >>> from hezar.data import SpeechRecognitionDatasetProcessor

>>> data_processor = SpeechRecognitionDatasetProcessor(feature_extractor=feature_extractor,tokenizer=tokenizer)
>>> dataset = load_dataset("hezarai/common-voice-13-fa")
>>> dataset = dataset.map(data_processor, batched=True, batch_size=1000)
class hezar.data.dataset_processors.DatasetProcessor(*args, **kwargs)[source]

Bases: object

The base callable dataset processor class that can handle both single and batched mode dataset mapping.

process_batch(data: LazyBatch, return_tensors=None, **kwargs)[source]

Process a batch of data examples.

Parameters:
  • data – A data sample dict

  • return_tensors – The type of the returning tensors (list, torch, numpy)

  • **kwargs – Additional arguments

Returns:

The updated data dict

process_single(data: LazyRow, return_tensors=None, **kwargs)[source]

Process a single data example.

Parameters:
  • data – A data sample dict

  • return_tensors – The type of the returning tensors (list, torch, numpy)

  • **kwargs – Additional arguments

Returns:

The updated data dict

required_backends = [Backends.DATASETS]
class hezar.data.dataset_processors.ImageCaptioningDatasetProcessor(image_processor, tokenizer, max_length=None, padding=None)[source]

Bases: DatasetProcessor

Dataset processor for image captioning datasets. This class handles tokenization and image processing.

process_batch(data, return_tensors=None, padding=None, max_length=None)[source]

Process image and tokenize captions for a batch of data samples.

Parameters:
  • data – A batch of data examples containing the images and their captions

  • padding – Padding type e.g, max_length, longest.

  • max_length – Max length value if padding is set to max_length or the labels must be truncated.

  • return_tensors – The type of the returning tensors (list, torch, numpy)

Returns:

A dict of pixel values tensor of the processed images and labels token ids and attention masks.

process_single(data, return_tensors=None, padding=None, max_length=None)[source]

Process image and tokenize captions for a single data sample.

Parameters:
  • data – A data example containing the image and its caption

  • padding – Padding type e.g, max_length, longest.

  • max_length – Max length value if padding is set to max_length or the labels must be truncated.

  • return_tensors – The type of the returning tensors (list, torch, numpy)

Returns:

A dict of pixel values tensor of the processed image and labels token ids and attention mask.

class hezar.data.dataset_processors.OCRDatasetProcessor(image_processor, tokenizer=None, text_split_type='char_split', max_length=None, reverse_digits=False, id2label=None, image_field='image_path', text_field='text')[source]

Bases: DatasetProcessor

Dataset processor class for OCR which can handle both tokenizer-based or character-split-based datasets.

process_batch(data, return_tensors=None)[source]

Process a batch of image-to-text OCR examples.

Parameters:
  • data – A batch of data examples containing image paths and corresponding texts.

  • return_tensors – The type of the returning tensors (list, torch, numpy)

Returns:

Batch of processed inputs with pixel values and text labels.

Return type:

dict

process_single(data, return_tensors=None)[source]

Process a single image-to-text OCR example.

Parameters:
  • data – A data example containing an image path and corresponding text.

  • return_tensors – The type of the returning tensors (list, torch, numpy)

Returns:

Processed inputs with pixel values and text labels.

Return type:

dict

class hezar.data.dataset_processors.SequenceLabelingDatasetProcessor(tokenizer, label_all_tokens=True, ignore_index=-100, max_length=None, padding=None)[source]

Bases: DatasetProcessor

Dataset processor class for sequence labeling datasets. Handles tokenization and label alignment.

process_batch(data, return_tensors=None, padding=None, max_length=None)[source]

Process a batch of sequence labeling examples.

Parameters:
  • data – A batch of examples, containing tokens and labels.

  • return_tensors – The type of the returning tensors (list, torch, numpy)

  • padding – Padding strategy.

  • max_length – Maximum sequence length.

Returns:

Tokenized and aligned batch data.

Return type:

dict

process_single(data, return_tensors=None, padding=None, max_length=None)[source]

Process a single example of sequence labeling data.

Parameters:
  • data – A single data example containing tokens and labels.

  • return_tensors – The type of the returning tensors (list, torch, numpy)

  • padding – Padding strategy.

  • max_length – Maximum sequence length.

Returns:

Tokenized and aligned input data.

Return type:

dict

class hezar.data.dataset_processors.SpeechRecognitionDatasetProcessor(feature_extractor, tokenizer, sampling_rate=16000, audio_array_padding=None, max_audio_array_length=None, labels_padding=None, labels_max_length=None, audio_column='audio', transcript_column='transcript')[source]

Bases: DatasetProcessor

Processor class for speech recognition datasets. Handles audio feature extraction and labels tokenization.

process_batch(data, return_tensors=None)[source]

Process a batch of speech recognition examples.

Parameters:
  • data – A batch of data examples containing audio arrays and their corresponding transcripts.

  • return_tensors – The type of the returning tensors (list, torch, numpy)

Returns:

Batch of processed input features and labels.

Return type:

dict

process_single(data, return_tensors=None)[source]

Process a single speech recognition example.

Parameters:
  • data – A data example containing audio and its transcript.

  • return_tensors – The type of the returning tensors (list, torch, numpy)

Returns:

Processed input features and labels.

Return type:

dict

class hezar.data.dataset_processors.TextClassificationDatasetProcessor(tokenizer, max_length=None, padding=None)[source]

Bases: DatasetProcessor

Processor class for text classification datasets. Handles tokenization of the texts.

process_batch(data, return_tensors=None, padding=None, max_length=None)[source]

Process a batch of examples for text classification.

Parameters:
  • data – A single data example dict

  • return_tensors – The type of the returning tensors (list, torch, numpy)

  • padding – Token ids padding type

  • max_length – Max input length

Returns:

The updated data dictionary

process_single(data, return_tensors=None, padding=None, max_length=None)[source]

Process a single example for text classification.

Parameters:
  • data – A single data example dict

  • return_tensors – The type of the returning tensors (list, torch, numpy)

  • padding – Token ids padding type

  • max_length – Max input length

Returns:

The updated data dictionary

class hezar.data.dataset_processors.TextSummarizationDatasetProcessor(tokenizer, prefix=None, max_length=None, labels_max_length=None, text_field='text', summary_field='summary', padding=None)[source]

Bases: DatasetProcessor

Processor class for text summarization datasets. Handles tokenization of the inputs and labels.

process_batch(data, return_tensors=None, padding=None, max_length=None, labels_max_length=None)[source]

Process a batch of examples for text summarization.

Parameters:
  • data – A batch of examples containing texts and summaries.

  • return_tensors – The type of the returning tensors (list, torch, numpy)

  • padding – Padding strategy.

  • max_length – Max length for input texts.

  • labels_max_length – Max length for summary labels.

Returns:

Tokenized inputs and labels for summarization task.

Return type:

dict

process_single(data, return_tensors=None, padding=None, max_length=None, labels_max_length=None)[source]

Process a single example for text summarization.

Parameters:
  • data – A data example containing text and summary.

  • return_tensors – The type of the returning tensors (list, torch, numpy)

  • padding – Padding strategy.

  • max_length – Max length for input text.

  • labels_max_length – Max length for summary labels.

Returns:

Tokenized inputs and labels for summarization task.

Return type:

dict