hezar.data.datasets.sequence_labeling_dataset module¶

class hezar.data.datasets.sequence_labeling_dataset.SequenceLabelingDataset(config: SequenceLabelingDatasetConfig, split=None, preprocessor=None, **kwargs)[source]¶

Bases: Dataset

A sequence labeling dataset class. As of now this class is intended for datasets existing on the Hub!

Parameters:

config (SequenceLabelingDatasetConfig) – Dataset config object.
split – Which split to use.
**kwargs – Extra config parameters to assign to the original config.

class hezar.data.datasets.sequence_labeling_dataset.SequenceLabelingDatasetConfig(path: str | None = None, task: TaskType = TaskType.SEQUENCE_LABELING, max_size: int | float | None = None, hf_load_kwargs: dict | None = None, tags_field: str | None = None, tokens_field: str | None = None, max_length: int | None = None, ignore_index: int = -100, label_all_tokens: bool = True, is_iob_schema: bool = False)[source]¶

Bases: DatasetConfig

Configuration class for sequence labeling datasets.

Parameters:

path (str) – Path to the dataset.
tags_field (str) – Field name for tags in the dataset.
tokens_field (str) – Field name for tokens in the dataset.
max_length (int) – Maximum length of tokens.
ignore_index (int) – Index to ignore in the loss function.
label_all_tokens (bool) – Whether to label all tokens or just the first token in a word.
is_iob_schema (bool) – Whether the dataset follows the IOB schema.

ignore_index: int = -100¶

is_iob_schema: bool = False¶

label_all_tokens: bool = True¶

max_length: int = None¶

name: str = 'sequence_labeling'¶

path: str = None¶

tags_field: str = None¶

task: TaskType = 'sequence_labeling'¶

tokens_field: str = None¶