hezar.data.data_collators module

class hezar.data.data_collators.CharLevelOCRDataCollator(pad_token_id: int = 0)[source]

Bases: object

A data collator for character-level OCR.

Parameters:

pad_token_id (int) – Token ID for padding characters.

class hezar.data.data_collators.ImageCaptioningDataCollator(tokenizer: Tokenizer, padding: str = 'longest', padding_side: str = 'right', max_length: int | None = None)[source]

Bases: object

Data collator for image captioning.

Parameters:
  • tokenizer (Tokenizer) – A Hezar tokenizer instance.

  • padding (str) – Specifies padding strategy, either longest or max_length.

  • padding_side (str) – Specifies from which side of each tensor to add paddings, either left or right

  • max_length (int) – If padding is set to max_length this must be specified. Forces all tensors to have this value as length.

class hezar.data.data_collators.SequenceLabelingDataCollator(tokenizer: Tokenizer, padding: str = 'longest', padding_side: str = 'right', label_pad_token_id: int = -100, max_length: int | None = None)[source]

Bases: object

A data collator for sequence labeling.

Parameters:
  • tokenizer (Tokenizer) – A Hezar tokenizer instance.

  • padding (str) – Specifies padding strategy, either longest or max_length.

  • padding_side (str) – Specifies from which side of each tensor to add paddings, either left or right.

  • label_pad_token_id (int) – Token ID for padding labels.

  • max_length (int) – If padding is set to max_length this must be specified. Forces all tensors to have this value as length.

class hezar.data.data_collators.SpeechRecognitionDataCollator(feature_extractor: AudioFeatureExtractor, tokenizer: Tokenizer, inputs_padding: str = 'longest', inputs_max_length: int | None = None, labels_padding: str = 'longest', labels_max_length: int | None = None)[source]

Bases: object

class hezar.data.data_collators.TextGenerationDataCollator(tokenizer: Tokenizer, padding: str = 'longest', padding_side: str = 'right', max_length: int | None = None, labels_max_length: int | None = None)[source]

Bases: object

A data collator for text to text generation

Parameters:
  • tokenizer (Tokenizer) – A Hezar tokenizer instance.

  • padding (str) – Specifies padding strategy, either longest or max_length.

  • padding_side (str) – Specifies from which side of each tensor to add paddings, either left or right

  • max_length (int) – If padding is set to max_length this must be specified. Forces all tensors to have this value as length.

  • labels_max_length (int) – Maximum target length for text generation.

class hezar.data.data_collators.TextPaddingDataCollator(tokenizer: Tokenizer, padding: str = 'longest', padding_side: str = 'right', max_length: int | None = None, return_tensors: str = 'torch')[source]

Bases: object

A data collator that pads a batch of tokenized inputs.

Parameters:
  • tokenizer (Tokenizer) – A Hezar tokenizer instance.

  • padding (str) – Specifies padding strategy, either longest or max_length.

  • padding_side (str) – Specifies from which side of each tensor to add paddings, either left or right

  • max_length (int) – If padding is set to max_length this must be specified. Forces all tensors to have this value as length.

  • return_tensors (str) – Specifies the dtype of the returning tensors in the batch. (numpy, list, torch)