hezar.data.data_collators module¶
- class hezar.data.data_collators.CharLevelOCRDataCollator(pad_token_id: int = 0)[source]¶
Bases:
object
A data collator for character-level OCR.
- Parameters:
pad_token_id (int) – Token ID for padding characters.
- class hezar.data.data_collators.ImageCaptioningDataCollator(tokenizer: Tokenizer, padding_type: str = 'longest', padding_side: str = 'right', max_length: int | None = None, return_tensors: str = 'pt')[source]¶
Bases:
object
- class hezar.data.data_collators.SequenceLabelingDataCollator(tokenizer: Tokenizer, padding_type: str = 'longest', padding_side: str = 'right', label_pad_token_id: int = -100, max_length: int | None = None, return_tensors: str = 'pt')[source]¶
Bases:
object
A data collator for sequence labeling.
- Parameters:
tokenizer (Tokenizer) – A Hezar tokenizer instance. (only its config is going to be used)
padding_type (str) – Specifies padding strategy. Defaults to longest, but can also be max_length (in this case max_length cannot be None)
padding_side (str) – Specifies from which side of each tensor to add paddings. Defaults to right, but can also be left.
label_pad_token_id (int) – Token ID for padding labels.
max_length (int) – If padding_type is set to max_length this parameter must be specified. Forces all tensors to have this value as length.
return_tensors (str) – Specifies the dtype of the returning tensors in the batch. Defaults to pt(torch.Tensor), but can also be np or list.
- class hezar.data.data_collators.SpeechRecognitionDataCollator(feature_extractor: AudioFeatureExtractor, tokenizer: Tokenizer, inputs_padding_type: str = 'longest', inputs_max_length: int | None = None, labels_padding_type: str = 'longest', labels_max_length: int | None = None)[source]¶
Bases:
object
- class hezar.data.data_collators.TextGenerationDataCollator(tokenizer: Tokenizer, padding_type: str = 'longest', padding_side: str = 'right', max_length: int | None = None, max_target_length: int | None = None, return_tensors: str = 'pt')[source]¶
Bases:
object
A data collator for text to text generation
- Parameters:
tokenizer (Tokenizer) – A Hezar tokenizer instance. (only its config is going to be used)
padding_type (str) – Specifies padding strategy. Defaults to longest, but can also be max_length (in this case max_length cannot be None)
padding_side (str) – Specifies from which side of each tensor to add paddings. Defaults to right, but can also be left.
max_length (int) – If padding_type is set to max_length this parameter must be specified. Forces all tensors to have this value as length.
max_target_length (int) – Maximum target length for text generation.
return_tensors (str) – Specifies the dtype of the returning tensors in the batch. Defaults to pt(torch.Tensor), but can also be np or list.
- class hezar.data.data_collators.TextPaddingDataCollator(tokenizer: Tokenizer, padding_type: str = 'longest', padding_side: str = 'right', max_length: int | None = None, return_tensors: str = 'pt')[source]¶
Bases:
object
A data collator that pads a batch of tokenized inputs.
- Parameters:
tokenizer (Tokenizer) – A Hezar tokenizer instance. (only its config is going to be used)
padding_type (str) – Specifies padding strategy. Defaults to longest, but can also be max_length (in this case max_length cannot be None)
padding_side (str) – Specifies from which side of each tensor to add paddings. Defaults to right, but can also be left.
max_length (int) – If padding_type is set to max_length this parameter must be specified. Forces all tensors to have this value as length.
return_tensors (str) – Specifies the dtype of the returning tensors in the batch. Defaults to pt(torch.Tensor), but can also be np or list.