hezar.data.datasets.dataset module

class hezar.data.datasets.dataset.Dataset(config: DatasetConfig, split=None, **kwargs)[source]

Bases: Dataset

Base class for all datasets in Hezar.

Parameters:
  • config – The configuration object for the dataset.

  • **kwargs – Additional keyword arguments.

required_backends

List of required backends for the dataset.

Type:

List[str | Backends]

config_filename

Default dataset config file name.

Type:

str

cache_dir

Default cache directory for the dataset.

Type:

str

cache_dir = '/home/runner/.cache/hezar/datasets'
config_filename = 'dataset_config.yaml'
classmethod load(hub_path: str | os.PathLike, config: DatasetConfig = None, config_filename: str | None = None, split: str | SplitType | None = None, cache_dir: str = None, **kwargs) Dataset[source]

Load the dataset from a hub path.

Parameters:
  • hub_path (str | os.PathLike) – Path to dataset from hub or locally.

  • config – (DatasetConfig): A config object to ignore the config in the repo or in case the repo has no dataset_config.yaml file

  • config_filename (Optional[str]) – Dataset config file name. Falls back to dataset_config.yaml if not given.

  • split (Optional[str | SplitType]) – Dataset split, defaults to “train”.

  • cache_dir (str) – Path to cache directory, defaults to Hezar’s cache directory

  • **kwargs – Config parameters as keyword arguments.

Returns:

An instance of the loaded dataset.

Return type:

Dataset

required_backends: List[str | Backends] = None