hezar.data.data_collators module¶
- class hezar.data.data_collators.CharLevelOCRDataCollator(pad_token_id: int = 0)[source]¶
Bases:
object
A data collator for character-level OCR.
- Parameters:
pad_token_id (int) – Token ID for padding characters.
- class hezar.data.data_collators.ImageCaptioningDataCollator(tokenizer: Tokenizer, padding: str = 'longest', padding_side: str = 'right', max_length: int | None = None)[source]¶
Bases:
object
Data collator for image captioning.
- Parameters:
tokenizer (Tokenizer) – A Hezar tokenizer instance.
padding (str) – Specifies padding strategy, either longest or max_length.
padding_side (str) – Specifies from which side of each tensor to add paddings, either left or right
max_length (int) – If padding is set to max_length this must be specified. Forces all tensors to have this value as length.
- class hezar.data.data_collators.SequenceLabelingDataCollator(tokenizer: Tokenizer, padding: str = 'longest', padding_side: str = 'right', label_pad_token_id: int = -100, max_length: int | None = None)[source]¶
Bases:
object
A data collator for sequence labeling.
- Parameters:
tokenizer (Tokenizer) – A Hezar tokenizer instance.
padding (str) – Specifies padding strategy, either longest or max_length.
padding_side (str) – Specifies from which side of each tensor to add paddings, either left or right.
label_pad_token_id (int) – Token ID for padding labels.
max_length (int) – If padding is set to max_length this must be specified. Forces all tensors to have this value as length.
- class hezar.data.data_collators.SpeechRecognitionDataCollator(feature_extractor: AudioFeatureExtractor, tokenizer: Tokenizer, inputs_padding: str = 'longest', inputs_max_length: int | None = None, labels_padding: str = 'longest', labels_max_length: int | None = None)[source]¶
Bases:
object
- class hezar.data.data_collators.TextGenerationDataCollator(tokenizer: Tokenizer, padding: str = 'longest', padding_side: str = 'right', max_length: int | None = None, labels_max_length: int | None = None)[source]¶
Bases:
object
A data collator for text to text generation
- Parameters:
tokenizer (Tokenizer) – A Hezar tokenizer instance.
padding (str) – Specifies padding strategy, either longest or max_length.
padding_side (str) – Specifies from which side of each tensor to add paddings, either left or right
max_length (int) – If padding is set to max_length this must be specified. Forces all tensors to have this value as length.
labels_max_length (int) – Maximum target length for text generation.
- class hezar.data.data_collators.TextPaddingDataCollator(tokenizer: Tokenizer, padding: str = 'longest', padding_side: str = 'right', max_length: int | None = None, return_tensors: str = 'torch')[source]¶
Bases:
object
A data collator that pads a batch of tokenized inputs.
- Parameters:
tokenizer (Tokenizer) – A Hezar tokenizer instance.
padding (str) – Specifies padding strategy, either longest or max_length.
padding_side (str) – Specifies from which side of each tensor to add paddings, either left or right
max_length (int) – If padding is set to max_length this must be specified. Forces all tensors to have this value as length.
return_tensors (str) – Specifies the dtype of the returning tensors in the batch. (numpy, list, torch)