Datasets¶
Hezar provides both dataset class implementations and ready-to-use data files for the community.
Hub Datasets¶
Hezar datasets are all hosted on the Hugging Face Hub and can be loaded just like any dataset on the Hub.
Load using Hugging Face datasets¶
from datasets import load_dataset
sentiment_dataset = load_dataset("hezarai/sentiment-dksf")
lscp_dataset = load_dataset("hezarai/lscp-pos-500k")
xlsum_dataset = load_dataset("hezarai/xlsum-fa")
...
Load using Hezar Dataset¶
from hezar.data import Dataset
sentiment_dataset = Dataset.load("hezarai/sentiment-dksf", preprocessor="hezarai/bert-base-fa") # A TextClassificationDataset instance
lscp_dataset = Dataset.load("hezarai/lscp-pos-500k", preprocessor="hezarai/roberta-base-fa") # A SequenceLabelingDataset instance
xlsum_dataset = Dataset.load("hezarai/xlsum-fa", preprocessor="hezarai/t5-base-fa") # A TextSummarizationDataset instance
...
Note
The preprocessor
can also be a Preprocessor
instance and does not have to be a path.
from hezar.data import Dataset
from hezar.preprocessors import Preprocessor
tokenizer = Preprocessor.load("hezarai/bert-base-fa")
dataset = Dataset.load("hezarai/sentiment-dksf", preprocessor=tokenizer)
The difference between using Hezar vs Hugging Face datasets is the output class. In Hezar when you load
a dataset using the Dataset
class, it automatically finds the proper class for that dataset and creates a
PyTorch Dataset
instance so that it can be easily passed to a PyTorch DataLoader
class. That’s why it also requires
the preprocessor
to be filled since iterating the dataset needs the preprocessor to process the samples.
from torch.utils.data import DataLoader
from hezar.data.datasets import Dataset
dataset = Dataset.load("hezarai/lscp-pos-500k", preprocessor="hezarai/distilbert-base-fa")
loader = DataLoader(dataset, batch_size=16, shuffle=True, collate_fn=dataset.data_collator)
itr = iter(loader)
print(next(itr))
But when loading using Hugging Face datasets, the output is an HF Dataset instance.
Note
Any Hezar dataset can be loaded using HF datasets but not vise-versa!
(Because Hezar looks out for a dataset_config.yaml
file in any dataset repo so non-Hezar datasets cannot be
loaded using Hezar Dataset
class in most cases!)
Dataset classes¶
Hezar categorizes datasets based on their target task. The dataset classes all inherit from the base Dataset
class
which is a PyTorch Dataset subclass. (hence having __getitem__
and __len__
methods.)
Some examples of the dataset classes are TextClassificationDataset
, TextSummarizationDataset
, SequenceLabelingDataset
, etc.
Dataset Configs¶
All dataset classes must have a dataset config (as a dataclass of type DatasetConfig
) implemented. Any field in such
dataclass is acceptable but the base dataset config takes the following for all configs:
task
: A mandatory field representing the task of the dataset of typehezar.constants.TaskType
path
: Path to the files of the dataset to be loaded. Can be a Hub dataset or local but remember that the loading procedure must be implemented in the_load
method of the dataset.max_size
: Maximum number of data samples. Overwrites the main length of the dataset when callinglen(dataset)
. If set to a float value between 0 and 1, will be interpreted as a fraction value, e.g, 0.3 means 30% of the whole length.hf_load_kwargs
: Keyword arguments to pass to the HFdatasets.load_dataset()
Custom Datasets¶
As mentioned, the Dataset
class is a wrapper around the torch.utils.data.Dataset
class to make it compatible with
Hezar’s structure and standards (working with configurations most importantly). So writing your own custom dataset is
pretty easy.
The base class for all dataset classes in Hezar is the hezar.data.Dataset
class which subclasses the torch.utils.data.Dataset
and adds its own functionalities to it.
Every dataset class must have the following signature:
from dataclasses import dataclass
from hezar.data import Dataset, DatasetConfig
from hezar.registry import register_dataset
@dataclass
class ImageCaptioningDatasetConfig(DatasetConfig):
name = "image_captioning" # Must be unique among all datasets names
attr1: int = ...
attr2: str = ...
attr3: bool = ...
@register_dataset("image_captioning", config_class=ImageCaptioningDatasetConfig) # registering is optional
class ImageCaptioningDataset(Dataset):
def __init__(self, config: ImageCaptioningDatasetConfig, split=None, preprocessor=None, **kwargs):
super().__init__(config=config, split=split, preprocessor=preprocessor, **kwargs)
# Custom initializations here
def _load(self, split):
"""
Load the data which can be a csv, HF dataset, etc. It will be assigned to the `self.data` attribute.
"""
pass
def __len__(self):
"""
Typically returns the length of the `self.data` or the `config.max_size` if set.
"""
pass
def __getitem__(self, index):
pass
Loading with 🤗Datasets¶
You can load all Hezar datasets using the 🤗Datasets library too. Doing so has the following pros and cons:
Pros:
You can work with any dataset on the Hub and use it easily with Hezar.
You can leverage multiprocessing and batch processing feature of such datasets (which is not available using torch datasets).
You can leverage mapped dataset caching provided by 🤗Datasets.
No integration needed for your old data pipeline codes to make them work with Hezar.
Cons:
You have to take care of the data processing yourself unless one of the dataset processors at
hezar.data.dataset_processors
suits your needs.
Using Hezar’s Dataset Processors¶
In order to replicate the same behavior of the hezar.data.Dataset
classes for 🤗 loaded dataset, Hezar also implements
a group of dataset processor classes so that you can use them to map the loaded datasets and get the same processed instances
when iterating over your loaded 🤗 datasets.
Below is a comparison of both methods, using Hezar’s torch compatible datasets vs 🤗 Datasets:
Loading and Processing with Hezar
from torch.utils.data import DataLoader
from hezar.data import SpeechRecognitionDataset, SpeechRecognitionDatasetConfig
# You can also use the regular `Dataset.load("hezarai/common-voice-13-fa")`, below is for better understanding.
dataset = SpeechRecognitionDataset(
SpeechRecognitionDatasetConfig(
path="hezarai/common-voice-13-fa",
sampling_rate=16000,
audio_file_path_column="path",
audio_column="audio",
audio_array_column="array",
transcript_column="sentence",
),
split="train",
preprocessor="hezarai/whisper-small-fa",
)
loader = DataLoader(dataset, batch_size=16, collate_fn=dataset.data_collator)
itr = iter(loader)
print(next(itr))
Loading and Processing with 🤗Datasets and Hezar Dataset Processors
from datasets import load_dataset, Audio
from torch.utils.data import DataLoader
from hezar.data import SpeechRecognitionDatasetProcessor, SpeechRecognitionDataCollator
from hezar.preprocessors import Preprocessor
dataset = load_dataset("hezarai/common-voice-13-fa", split="train")
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
dataset = dataset.select_columns(["sentence", "audio"])
preprocesssor = Preprocessor.load("hezarai/whisper-small-fa")
dataset_processor = SpeechRecognitionDatasetProcessor(
tokenizer=preprocesssor.tokenizer,
feature_extractor=preprocesssor.audio_feature_extractor,
transcript_column="sentence",
audio_array_padding="max_length",
)
data_collator = SpeechRecognitionDataCollator(
feature_extractor=preprocesssor.audio_feature_extractor,
tokenizer=preprocesssor.tokenizer,
labels_padding="max_length",
labels_max_length=256,
)
processed_dataset = dataset.map(
dataset_processor,
batched=True,
batch_size=100,
desc="Processing dataset..."
)
processed_dataset = processed_dataset.select_columns(["input_features", "labels", "attention_mask"])
data_loader = DataLoader(processed_dataset, batch_size=16, collate_fn=data_collator)
x = next(iter(data_loader))
print(x)
Both codes above, give you the same kind of results. Although using dataset processors is more complicated, but it gives you more control and better integration with typical data pipelines used nowadays.
Note
You don’t necessarily need to use the dataset processor classes in Hezar. They are just there to implement the same procedures and reproduce the same results. This means that any code that uses 🤗 Datasets will work with Hezar’s Trainer.
Loading Regular HF Datasets¶
All the current datasets provided in Hezar’s Hugging Face, have the dataset_config.yaml
in their repos which does not
exist for regular HF datasets. If you need to load such datasets (that have the correct structure and fields) in Hezar
using the Dataset.load()
method, you have to provide the dataset config manually.
Note
If the dataset needs a configuration name to be specified (as datasets.load_dataset(path, name=<configuration name>)
),
you can pass it either in DatasetConfig.hf_load_kwargs
or like Dataset.load("hezarai/dataset:config_name")
. See the
example below.
from hezar.data import Dataset, SpeechRecognitionDatasetConfig
dataset_path = "mozilla-foundation/common_voice_17_0:fa" # `fa` is the config name of the dataset
dataset_config = SpeechRecognitionDatasetConfig(
path=dataset_path,
labels_max_length=64,
) # You can modify other fields too
dataset = Dataset.load(dataset_path, split="train", config=dataset_config)
Dataset Templates (Deprecated)¶
We try to have a simple yet practical pattern for all datasets on the Hub. Every dataset on the Hub needs to have a dataset loading script. Some ready to use templates are located in the templates/dataset_scripts folder. To add a new Hezar compatible dataset to the Hub you can follow the guide provided there.
Note
Dataset scripts are no longer recommended in the Hugging Face Hub but instead, the datasets must be uploaded in Parquet format that already has all the required metadata inside them. For more info on how you can create or convert your datasets to Parquet format see this tutorial.