hezar.data.datasets.dataset module¶

class hezar.data.datasets.dataset.Dataset(config: DatasetConfig, split: str = 'train', preprocessor: str | Preprocessor | PreprocessorsContainer | None = None, **kwargs)[source]¶

Bases: Dataset

Base class for all datasets in Hezar.

Parameters:

config – The configuration object for the dataset.
split – Dataset split name e.g, train, test, validation, etc.
preprocessor – Preprocessor object or path (note that Hezar datasets classes require this argument).
**kwargs – Additional keyword arguments.

required_backends¶

List of required backends for the dataset.

config_filename¶

Default dataset config file name.

cache_dir¶

Default cache directory for the dataset.

static create_preprocessor(preprocessor: str | Preprocessor | PreprocessorsContainer)[source]¶

Create the preprocessor for the dataset.

Parameters:: preprocessor (str | Preprocessor | PreprocessorsContainer) – Preprocessor for the dataset

Load the dataset from a hub path.

Parameters:

hub_path (str | os.PathLike) – Path to dataset from hub or locally.
split (Optional[str | SplitType]) – Dataset split, defaults to “train”.
preprocessor (str | Preprocessor | PreprocessorsContainer) – Preprocessor object for the dataset
config – (DatasetConfig): A config object to ignore the config in the repo or in case the repo has no dataset_config.yaml file
config_filename (Optional[str]) – Dataset config file name. Falls back to dataset_config.yaml if not given.
cache_dir (str) – Path to cache directory, defaults to Hezar’s cache directory
**kwargs – Config parameters as keyword arguments.

Returns:

An instance of the loaded dataset.

Return type:

Dataset