hezar.data.datasets.dataset module¶
- class hezar.data.datasets.dataset.Dataset(config: DatasetConfig, split: str = 'train', preprocessor: str | Preprocessor | PreprocessorsContainer | None = None, **kwargs)[source]¶
Bases:
Dataset
Base class for all datasets in Hezar.
- Parameters:
config – The configuration object for the dataset.
split – Dataset split name e.g, train, test, validation, etc.
preprocessor – Preprocessor object or path (note that Hezar datasets classes require this argument).
**kwargs – Additional keyword arguments.
- config_filename¶
Default dataset config file name.
- Type:
str
- cache_dir¶
Default cache directory for the dataset.
- Type:
str
- cache_dir = '/home/runner/.cache/hezar/datasets'¶
- config_filename = 'dataset_config.yaml'¶
- static create_preprocessor(preprocessor: str | Preprocessor | PreprocessorsContainer)[source]¶
Create the preprocessor for the dataset.
- Parameters:
preprocessor (str | Preprocessor | PreprocessorsContainer) – Preprocessor for the dataset
- classmethod load(hub_path: str | PathLike, split: str | SplitType | None = None, preprocessor: str | Preprocessor | PreprocessorsContainer | None = None, config: DatasetConfig | None = None, config_filename: str | None = None, cache_dir: str | None = None, **kwargs) Dataset [source]¶
Load the dataset from a hub path.
- Parameters:
hub_path (str | os.PathLike) – Path to dataset from hub or locally.
split (Optional[str | SplitType]) – Dataset split, defaults to “train”.
preprocessor (str | Preprocessor | PreprocessorsContainer) – Preprocessor object for the dataset
config – (DatasetConfig): A config object to ignore the config in the repo or in case the repo has no dataset_config.yaml file
config_filename (Optional[str]) – Dataset config file name. Falls back to dataset_config.yaml if not given.
cache_dir (str) – Path to cache directory, defaults to Hezar’s cache directory
**kwargs – Config parameters as keyword arguments.
- Returns:
An instance of the loaded dataset.
- Return type: