hezar.data.datasets.sequence_labeling_dataset module¶
- class hezar.data.datasets.sequence_labeling_dataset.SequenceLabelingDataset(config: SequenceLabelingDatasetConfig, split=None, preprocessor=None, **kwargs)[source]¶
Bases:
Dataset
A sequence labeling dataset class. As of now this class is intended for datasets existing on the Hub!
- Parameters:
config (SequenceLabelingDatasetConfig) – Dataset config object.
split – Which split to use.
**kwargs – Extra config parameters to assign to the original config.
- class hezar.data.datasets.sequence_labeling_dataset.SequenceLabelingDatasetConfig(path: str | None = None, task: TaskType = TaskType.SEQUENCE_LABELING, max_size: int | float | None = None, hf_load_kwargs: dict | None = None, tags_field: str | None = None, tokens_field: str | None = None, max_length: int | None = None, ignore_index: int = -100, label_all_tokens: bool = True, is_iob_schema: bool = False)[source]¶
Bases:
DatasetConfig
Configuration class for sequence labeling datasets.
- Parameters:
path (str) – Path to the dataset.
tags_field (str) – Field name for tags in the dataset.
tokens_field (str) – Field name for tokens in the dataset.
max_length (int) – Maximum length of tokens.
ignore_index (int) – Index to ignore in the loss function.
label_all_tokens (bool) – Whether to label all tokens or just the first token in a word.
is_iob_schema (bool) – Whether the dataset follows the IOB schema.
- ignore_index: int = -100¶
- is_iob_schema: bool = False¶
- label_all_tokens: bool = True¶
- max_length: int = None¶
- name: str = 'sequence_labeling'¶
- path: str = None¶
- tags_field: str = None¶
- tokens_field: str = None¶