hezar.data.datasets.sequence_labeling_dataset module

class hezar.data.datasets.sequence_labeling_dataset.SequenceLabelingDataset(config: SequenceLabelingDatasetConfig, split=None, **kwargs)[source]

Bases: Dataset

A sequence labeling dataset class. As of now this class is intended for datasets existing on the Hub!

Parameters:
  • config (SequenceLabelingDatasetConfig) – Dataset config object.

  • split – Which split to use.

  • **kwargs – Extra config parameters to assign to the original config.

class hezar.data.datasets.sequence_labeling_dataset.SequenceLabelingDatasetConfig(task: TaskType = TaskType.SEQUENCE_LABELING, path: str | None = None, tokenizer_path: str | None = None, tags_field: str | None = None, tokens_field: str | None = None, max_length: int | None = None, ignore_index: int = -100, label_all_tokens: bool = True, is_iob_schema: bool = False)[source]

Bases: DatasetConfig

Configuration class for sequence labeling datasets.

Parameters:
  • path (str) – Path to the dataset.

  • tokenizer_path (str) – Path to the tokenizer file.

  • tags_field (str) – Field name for tags in the dataset.

  • tokens_field (str) – Field name for tokens in the dataset.

  • max_length (int) – Maximum length of tokens.

  • ignore_index (int) – Index to ignore in the loss function.

  • label_all_tokens (bool) – Whether to label all tokens or just the first token in a word.

  • is_iob_schema (bool) – Whether the dataset follows the IOB schema.

ignore_index: int = -100
is_iob_schema: bool = False
label_all_tokens: bool = True
max_length: int = None
name: str = 'sequence_labeling'
path: str = None
tags_field: str = None
task: TaskType = 'sequence_labeling'
tokenizer_path: str = None
tokens_field: str = None