hezar.data.data_collators module

class hezar.data.data_collators.CharLevelOCRDataCollator(pad_token_id: int = 0)[source]

Bases: object

A data collator for character-level OCR.

Parameters:

pad_token_id (int) – Token ID for padding characters.

class hezar.data.data_collators.ImageCaptioningDataCollator(tokenizer: Tokenizer, padding_type: str = 'longest', padding_side: str = 'right', max_length: int | None = None, return_tensors: str = 'pt')[source]

Bases: object

class hezar.data.data_collators.SequenceLabelingDataCollator(tokenizer: Tokenizer, padding_type: str = 'longest', padding_side: str = 'right', label_pad_token_id: int = -100, max_length: int | None = None, return_tensors: str = 'pt')[source]

Bases: object

A data collator for sequence labeling.

Parameters:
  • tokenizer (Tokenizer) – A Hezar tokenizer instance. (only its config is going to be used)

  • padding_type (str) – Specifies padding strategy. Defaults to longest, but can also be max_length (in this case max_length cannot be None)

  • padding_side (str) – Specifies from which side of each tensor to add paddings. Defaults to right, but can also be left.

  • label_pad_token_id (int) – Token ID for padding labels.

  • max_length (int) – If padding_type is set to max_length this parameter must be specified. Forces all tensors to have this value as length.

  • return_tensors (str) – Specifies the dtype of the returning tensors in the batch. Defaults to pt(torch.Tensor), but can also be np or list.

class hezar.data.data_collators.SpeechRecognitionDataCollator(feature_extractor: AudioFeatureExtractor, tokenizer: Tokenizer, inputs_padding_type: str = 'longest', inputs_max_length: int | None = None, labels_padding_type: str = 'longest', labels_max_length: int | None = None)[source]

Bases: object

class hezar.data.data_collators.TextGenerationDataCollator(tokenizer: Tokenizer, padding_type: str = 'longest', padding_side: str = 'right', max_length: int | None = None, max_target_length: int | None = None, return_tensors: str = 'pt')[source]

Bases: object

A data collator for text to text generation

Parameters:
  • tokenizer (Tokenizer) – A Hezar tokenizer instance. (only its config is going to be used)

  • padding_type (str) – Specifies padding strategy. Defaults to longest, but can also be max_length (in this case max_length cannot be None)

  • padding_side (str) – Specifies from which side of each tensor to add paddings. Defaults to right, but can also be left.

  • max_length (int) – If padding_type is set to max_length this parameter must be specified. Forces all tensors to have this value as length.

  • max_target_length (int) – Maximum target length for text generation.

  • return_tensors (str) – Specifies the dtype of the returning tensors in the batch. Defaults to pt(torch.Tensor), but can also be np or list.

class hezar.data.data_collators.TextPaddingDataCollator(tokenizer: Tokenizer, padding_type: str = 'longest', padding_side: str = 'right', max_length: int | None = None, return_tensors: str = 'pt')[source]

Bases: object

A data collator that pads a batch of tokenized inputs.

Parameters:
  • tokenizer (Tokenizer) – A Hezar tokenizer instance. (only its config is going to be used)

  • padding_type (str) – Specifies padding strategy. Defaults to longest, but can also be max_length (in this case max_length cannot be None)

  • padding_side (str) – Specifies from which side of each tensor to add paddings. Defaults to right, but can also be left.

  • max_length (int) – If padding_type is set to max_length this parameter must be specified. Forces all tensors to have this value as length.

  • return_tensors (str) – Specifies the dtype of the returning tensors in the batch. Defaults to pt(torch.Tensor), but can also be np or list.