hezar.data.datasets.text_summarization_dataset module

class hezar.data.datasets.text_summarization_dataset.TextSummarizationDataset(config: TextSummarizationDatasetConfig, split=None, **kwargs)[source]

Bases: Dataset

A text summarization dataset class. As of now this class is intended for datasets existing on the Hub!

Parameters:
  • config (TextSummarizationDatasetConfig) – Dataset config object.

  • split – Which split to use.

  • **kwargs – Extra config parameters to assign to the original config.

class hezar.data.datasets.text_summarization_dataset.TextSummarizationDatasetConfig(task: TaskType = TaskType.TEXT_GENERATION, path: str | None = None, tokenizer_path: str | None = None, prefix: str | None = None, text_field: str | None = None, summary_field: str | None = None, title_field: str | None = None, max_length: int | None = None, max_target_length: int | None = None)[source]

Bases: DatasetConfig

Configuration class for text summarization datasets.

Parameters:
  • path (str) – Path to the dataset.

  • tokenizer_path (str) – Path to the tokenizer file.

  • prefix (str) – Prefix for conditional generation.

  • text_field (str) – Field name for text in the dataset.

  • summary_field (str) – Field name for summary in the dataset.

  • title_field (str) – Field name for title in the dataset.

  • max_length (int) – Maximum length of text.

  • max_target_length (int) – Maximum length of the target summary.

max_length: int = None
max_target_length: int = None
name: str = 'text_summarization'
path: str = None
prefix: str = None
summary_field: str = None
task: TaskType = 'text_generation'
text_field: str = None
title_field: str = None
tokenizer_path: str = None