Datasets¶
Hezar provides both dataset class implementations and ready-to-use data files for the community.
Hub Datasets¶
Hezar datasets are all hosted on the Hugging Face Hub and can be loaded just like any dataset on the Hub.
Load using Hugging Face datasets¶
from datasets import load_dataset
sentiment_dataset = load_dataset("hezarai/sentiment-dksf")
lscp_dataset = load_dataset("hezarai/lscp-pos-500k")
xlsum_dataset = load_dataset("hezarai/xlsum-fa")
...
Load using Hezar Dataset¶
from hezar.data import Dataset
sentiment_dataset = Dataset.load("hezarai/sentiment-dksf") # A TextClassificationDataset instance
lscp_dataset = Dataset.load("hezarai/lscp-pos-500k") # A SequenceLabelingDataset instance
xlsum_dataset = Dataset.load("hezarai/xlsum-fa") # A TextSummarizationDataset instance
...
The difference between using Hezar vs Hugging Face datasets is the output class. In Hezar when you load
a dataset using the Dataset
class, it automatically finds the proper class for that dataset and creates a
PyTorch Dataset
instance so that it can be easily passed to a PyTorch DataLoader
class.
from torch.utils.data import DataLoader
from hezar.data.datasets import Dataset
dataset = Dataset.load(
"hezarai/lscp-pos-500k",
tokenizer_path="hezarai/distilbert-base-fa", # tokenizer_path is necessary for data collator
)
loader = DataLoader(dataset, batch_size=16, shuffle=True, collate_fn=dataset.data_collator)
itr = iter(loader)
print(next(itr))
But when loading using Hugging Face datasets, the output is an HF Dataset instance.
So in a nutshell, any Hezar dataset can be loaded using HF datasets but not vise-versa!
(Because Hezar looks out for a dataset_config.yaml
file in any dataset repo so non-Hezar datasets cannot be
loaded using Hezar Dataset
class.)
Dataset classes¶
Hezar categorizes datasets based on their target task. The dataset classes all inherit from the base Dataset
class
which is a PyTorch Dataset subclass. (hence having __getitem__
and __len__
methods.)
Some examples of the dataset classes are TextClassificationDataset
, TextSummarizationDataset
, SequenceLabelingDataset
, etc.
Dataset Templates¶
We try to have a simple yet practical pattern for all datasets on the Hub. Every dataset on the Hub needs to have a dataset loading script. Some ready to use templates are located in the templates/dataset_scripts folder. To add a new Hezar compatible dataset to the Hub you can follow the guide provided there.