hezar.data.dataset_processors module¶
Dataset processors are a bunch of callable classes to be passed as map functions for any dataset on the Hub. Note that the main dataset classes are already implemented in a way that the processing is done in the __getitem__ method and these classes are only used for when the dataset has been loaded using the HuggingFace datasets library, and you want to get advantage of the multiprocessing/batch processing/caching functionalities of the HF datasets.
Example: >>> from datasets import load_dataset >>> from hezar.data import SpeechRecognitionDatasetProcessor
>>> data_processor = SpeechRecognitionDatasetProcessor(feature_extractor=feature_extractor,tokenizer=tokenizer)
>>> dataset = load_dataset("hezarai/common-voice-13-fa")
>>> dataset = dataset.map(data_processor, batched=True, batch_size=1000)
- class hezar.data.dataset_processors.DatasetProcessor(*args, **kwargs)[source]¶
Bases:
object
The base callable dataset processor class that can handle both single and batched mode dataset mapping.
- process_batch(data: LazyBatch, return_tensors=None, **kwargs)[source]¶
Process a batch of data examples.
- Parameters:
data – A data sample dict
return_tensors – The type of the returning tensors (list, torch, numpy)
**kwargs – Additional arguments
- Returns:
The updated data dict
- process_single(data: LazyRow, return_tensors=None, **kwargs)[source]¶
Process a single data example.
- Parameters:
data – A data sample dict
return_tensors – The type of the returning tensors (list, torch, numpy)
**kwargs – Additional arguments
- Returns:
The updated data dict
- required_backends = [Backends.DATASETS]¶
- class hezar.data.dataset_processors.ImageCaptioningDatasetProcessor(image_processor, tokenizer, max_length=None, padding=None)[source]¶
Bases:
DatasetProcessor
Dataset processor for image captioning datasets. This class handles tokenization and image processing.
- process_batch(data, return_tensors=None, padding=None, max_length=None)[source]¶
Process image and tokenize captions for a batch of data samples.
- Parameters:
data – A batch of data examples containing the images and their captions
padding – Padding type e.g, max_length, longest.
max_length – Max length value if padding is set to max_length or the labels must be truncated.
return_tensors – The type of the returning tensors (list, torch, numpy)
- Returns:
A dict of pixel values tensor of the processed images and labels token ids and attention masks.
- process_single(data, return_tensors=None, padding=None, max_length=None)[source]¶
Process image and tokenize captions for a single data sample.
- Parameters:
data – A data example containing the image and its caption
padding – Padding type e.g, max_length, longest.
max_length – Max length value if padding is set to max_length or the labels must be truncated.
return_tensors – The type of the returning tensors (list, torch, numpy)
- Returns:
A dict of pixel values tensor of the processed image and labels token ids and attention mask.
- class hezar.data.dataset_processors.OCRDatasetProcessor(image_processor, tokenizer=None, text_split_type='char_split', max_length=None, reverse_digits=False, id2label=None, image_field='image_path', text_field='text')[source]¶
Bases:
DatasetProcessor
Dataset processor class for OCR which can handle both tokenizer-based or character-split-based datasets.
- process_batch(data, return_tensors=None)[source]¶
Process a batch of image-to-text OCR examples.
- Parameters:
data – A batch of data examples containing image paths and corresponding texts.
return_tensors – The type of the returning tensors (list, torch, numpy)
- Returns:
Batch of processed inputs with pixel values and text labels.
- Return type:
dict
- process_single(data, return_tensors=None)[source]¶
Process a single image-to-text OCR example.
- Parameters:
data – A data example containing an image path and corresponding text.
return_tensors – The type of the returning tensors (list, torch, numpy)
- Returns:
Processed inputs with pixel values and text labels.
- Return type:
dict
- class hezar.data.dataset_processors.SequenceLabelingDatasetProcessor(tokenizer, label_all_tokens=True, ignore_index=-100, max_length=None, padding=None)[source]¶
Bases:
DatasetProcessor
Dataset processor class for sequence labeling datasets. Handles tokenization and label alignment.
- process_batch(data, return_tensors=None, padding=None, max_length=None)[source]¶
Process a batch of sequence labeling examples.
- Parameters:
data – A batch of examples, containing tokens and labels.
return_tensors – The type of the returning tensors (list, torch, numpy)
padding – Padding strategy.
max_length – Maximum sequence length.
- Returns:
Tokenized and aligned batch data.
- Return type:
dict
- process_single(data, return_tensors=None, padding=None, max_length=None)[source]¶
Process a single example of sequence labeling data.
- Parameters:
data – A single data example containing tokens and labels.
return_tensors – The type of the returning tensors (list, torch, numpy)
padding – Padding strategy.
max_length – Maximum sequence length.
- Returns:
Tokenized and aligned input data.
- Return type:
dict
- class hezar.data.dataset_processors.SpeechRecognitionDatasetProcessor(feature_extractor, tokenizer, sampling_rate=16000, audio_array_padding=None, max_audio_array_length=None, labels_padding=None, labels_max_length=None, audio_column='audio', transcript_column='transcript')[source]¶
Bases:
DatasetProcessor
Processor class for speech recognition datasets. Handles audio feature extraction and labels tokenization.
- process_batch(data, return_tensors=None)[source]¶
Process a batch of speech recognition examples.
- Parameters:
data – A batch of data examples containing audio arrays and their corresponding transcripts.
return_tensors – The type of the returning tensors (list, torch, numpy)
- Returns:
Batch of processed input features and labels.
- Return type:
dict
- process_single(data, return_tensors=None)[source]¶
Process a single speech recognition example.
- Parameters:
data – A data example containing audio and its transcript.
return_tensors – The type of the returning tensors (list, torch, numpy)
- Returns:
Processed input features and labels.
- Return type:
dict
- class hezar.data.dataset_processors.TextClassificationDatasetProcessor(tokenizer, max_length=None, padding=None)[source]¶
Bases:
DatasetProcessor
Processor class for text classification datasets. Handles tokenization of the texts.
- process_batch(data, return_tensors=None, padding=None, max_length=None)[source]¶
Process a batch of examples for text classification.
- Parameters:
data – A single data example dict
return_tensors – The type of the returning tensors (list, torch, numpy)
padding – Token ids padding type
max_length – Max input length
- Returns:
The updated data dictionary
- process_single(data, return_tensors=None, padding=None, max_length=None)[source]¶
Process a single example for text classification.
- Parameters:
data – A single data example dict
return_tensors – The type of the returning tensors (list, torch, numpy)
padding – Token ids padding type
max_length – Max input length
- Returns:
The updated data dictionary
- class hezar.data.dataset_processors.TextSummarizationDatasetProcessor(tokenizer, prefix=None, max_length=None, labels_max_length=None, text_field='text', summary_field='summary', padding=None)[source]¶
Bases:
DatasetProcessor
Processor class for text summarization datasets. Handles tokenization of the inputs and labels.
- process_batch(data, return_tensors=None, padding=None, max_length=None, labels_max_length=None)[source]¶
Process a batch of examples for text summarization.
- Parameters:
data – A batch of examples containing texts and summaries.
return_tensors – The type of the returning tensors (list, torch, numpy)
padding – Padding strategy.
max_length – Max length for input texts.
labels_max_length – Max length for summary labels.
- Returns:
Tokenized inputs and labels for summarization task.
- Return type:
dict
- process_single(data, return_tensors=None, padding=None, max_length=None, labels_max_length=None)[source]¶
Process a single example for text summarization.
- Parameters:
data – A data example containing text and summary.
return_tensors – The type of the returning tensors (list, torch, numpy)
padding – Padding strategy.
max_length – Max length for input text.
labels_max_length – Max length for summary labels.
- Returns:
Tokenized inputs and labels for summarization task.
- Return type:
dict