Hazer’s Architecture¶
Right from the first lines of code, Hezar was built having simplicity, modularity and extensibility in mind. Hezar has a simple yet flexible design pattern that can be seen among most of its main modules. In this guide we demonstrate the main ideas behind the design.
Going forward, by the term module, we mean any main class
like Model
, Dataset
, Metric
, Trainer
, Preprocessor
,
etc.
Concept 1: Configurable Modules¶
Every single module object in Hezar can be constructed from a key-value container. This container is the module’s config
which contains everything needed to build an object from that module. In order to have a portable serializable config
that can be also converted to Python code, there lies Hezar’s most important class called Config
. The Config
class
is a simple Python dataclass that is equipped with extra methods for importing, exporting, pushing to the Hub, etc.
The Config
class is defined in hezar/configs.py
among other config derivatives.
Right now the config derivatives are:
ModelConfig
DatasetConfig
PreprocessorConfig
TrainerConfig
EmbeddingConfig
MetricConfig
So every module must have its own config inherited from Config
. When defining a new config dataclass, one must define
a unique name (as the parameter name
), responsible for identifying the module type that uses that config class. We’ll
discuss why this name
parameter is necessary in the registry section.
To give some examples:
Let’s assume you want to write a new model class called AwesomeModel
. The first step is to provide a config dataclass:
from dataclasses import dataclass
from hezar.models import ModelConfig, Model
@dataclass
class MyAwesomeModelConfig(ModelConfig):
name = "my_awesome_model" # this has to be a unique name among all models configs
my_param: str = "awesome"
other_param: str = "more_awesome"
class MyAwesomeModel(Model):
def __init__(self, config, **kwargs):
super().__init__(config, **kwargs)
# Define the layers or any other stuff here
...
def forward(self, inputs, **kwargs):
# Casual PyTorch forward method
...
Wait, what’s that name for? Why would you need to define a name for everything? The short answer is Hezar’s registry system. So let’s dive into it!
Concept 2: Modules’ Registries¶
There are lots of base modules in Hezar and many of which might have dozens of subclasses, but as you might have seen by now, almost every module can load its class using the same base class in a single line. Take a look at the below snippets:
# Load a model
from hezar.models import Model
roberta_tc = Model.load("hezarai/roberta-fa-sentiment-dksf") # roberta_tc is a RobertaTextClassification instance
bert_pos = Model.load("hezarai/bert-fa-pos-lscp-500k") # bert_pos is a BertSequenceLabeling instance
whisper_speech = Model.load("hezarai/whisper-small-fa") # whisper_speech is a WhisperSpeechRecognition instance
...
# Load a dataset
from hezar.data import Dataset
sentiment_dataset = Dataset.load("hezarai/sentiment-dksf") # A TextClassificationDataset instance
lscp_dataset = Dataset.load("hezarai/lscp-pos-500k") # A SequenceLabelingDataset instance
xlsum_dataset = Dataset.load("hezarai/xlsum-fa") # A TextSummarizationDataset instance
...
# Load preprocessors
from hezar.preprocessors import Preprocessor
wordpiece = Preprocessor.load("hezarai/bert-base-fa") # A WordPieceTokenizer instance
whisper_bpe = Preprocessor.load("hezarai/whisper-small-fa") # A WhisperBPETokenizer instance
sp_unigram_bpe = Preprocessor.load("hezarai/t5-base-fa") # A SentencePieceUnigramTokenizer instance
...
# Load embedding
from hezar.embeddings import Embedding
fasttext = Embedding.load("hezarai/fasttext-fa-300") # A FastText instance
word2vec = Embedding.load("hezarai/word2vec-skipgram-fa-wikipedia") # A Word2Vec instance
...
So, what’s going on under the hood that handles module loading and initiation?
Registry System
Well, there are ways to tackle this challenge, but Hezar manages this by using a global registry for every module type. These registries are simple Python dictionaries that hold the properties for every module class, module config, etc. The general structure is like below:
# Models registry for example
models_registry = {
"bert_text_classification": Registry(
module_class=hezar.models.text_classification.bert.bert_text_classification.BertTextClassification,
config_class=hezar.models.text_classification.bert.bert_text_classification_config.BertTextClassificationConfig,
description="SOME MODEL DESCRIPTION ..."
),
"AND SO ON...": Registry(...)
}
Each registry value is a Registry
(data)class that has 3 properties: config_class
, module_class
and description
.
module_class
: Holds the class object for the module. Using this property you can actually create the module object.config_class
: Holds the config class and can be passed to the module class so that the module can be created.description
: Holds the description of the model if given.
But how are the modules inserted into the registries? The answer is registry class decorators
register_*()
Class Decorators
In the file hezar/registry.py
, there are a bunch of decorator functions that fulfill the task of registering any module
into the right registry automagically!
These decorators take two parameters:
name
: A string name that has to be the same as the one in configconfig_class
: The config classdescription
: Optional description for the module
The example below demonstrates registering a model:
...
from hezar.models import Model, ModelConfig
@dataclass
class MyBertConfig(ModelConfig):
name = "my_bert"
vocab_size: int = 1000
hidden_size: int = 768
# Below line is all you need to add `my_bert` to `models_registry`
@register_model("my_bert", config_class=MyBertConfig)
class MyBert(Model):
def __init__(self, config: MyBertConfig, **kwargs):
super().__init__(config, **kwargs)
def forward(self, inputs, **kwargs):
...
Registry decorators currently include:
register_model
register_preprocessor
register_dataset
register_embedding
register_metric
register_trainer
Getting Available Modules
To figure out what modules are available in a registry, there are also utils for that:
from hezar import utils
print(utils.list_available_models())
print(utils.list_available_preprocessors())
print(utils.list_available_datasets())
print(utils.list_available_metrics())
print(utils.list_available_embeddings())
...
Creating Modules from Registry Names
So now it’s pretty easy to create modules objects using their name
! Let’s say you want to create a
BPE tokenizer. You can do it this way:
from hezar.registry import preprocessors_registry
module_cls = preprocessors_registry["bpe_tokenizer"].module_class
config_cls = preprocessors_registry["bpe_tokenizer"].config_class
bpe = module_cls(config_cls())
Although, this is not how it’s actually done in Hezar because it’s long and ugly! To handle this properly we use another internal feature of Hezar called the builders!
Builders
Using builders you can build modules from their registry names in a single line of code. These family of functions take 3 main parameters:
name
: A registry key name representing that module. This name has to be present in the corresponding registry!config
: Optionally you can pass a config object to control how the module is built. The config has to be of a type that the module accepts.**kwargs
: Optionally you can pass config parameters as keyword arguments to override the default config. (The override priority iskwargs
>config
> default config)
from hezar import builders
bert = builders.build_model("bert_mask_filling", hidden_size=768, vocab_size=50000)
sp_bpe = builders.build_preprocessor("sentencepiece_bpe_tokenizer")
tc_dataset = builders.build_dataset("text_classification", path="hezarai/sentiment-dksf", tokenizer_path="hezarai/bert-base-fa")
...
Available builders include:
build_model
build_dataset
build_preprocessor
build_embedding
build_metric
So why would you need to use builders or registries when you can import everything normally? like below:
from hezar.models import WhisperSpeechRecognition, WhisperSpeechRecognitionConfig
whisper = WhisperSpeechRecognition(WhisperSpeechRecognitionConfig(max_new_tokens=400))
The answer is that if you want to do it in a straightforward way, you can always use the classes directly. But the fact is that everything works with configs and a config must have at least some identifiers so that a module can be initialized from it. The main usage of the registries is to be able to create everything from the configs! So lets slide into the next section, the Hub!
Concept 3: Hugging Face Hub Integration¶
In Hezar, EVERY module can be uploaded to or downloaded from the Hugging Face Hub with ease! Modules have 3 main methods to do so:
load
: A method implemented in any type of base class that loads the module from the Hub or local disk automagically!save
: A method to save all the necessary files and configurations to a path on the local disk.push_to_hub
: A method implemented in any type of base class that pushes all the necessary files and configurations to the Hub so that the module can be loaded from the Hub again.
Loading
All base modules implement their own load
method based on their characteristics. But the first step in every load
process is loading the configuration as all the info lies there, and then any other file is loaded.
For example the class Model
first loads its config and builds the model using build_model
and the config parameters.
Then the state dict is loaded to the model. If the path contains preprocessor files and configs, it would load them too.
On the other hand, some simple modules like metric might just load the config to create a metric instance.
One important feature of any load
method is that like builders, it accepts config parameters as keyword arguments so
that you can override config properties.
Saving
Almost every module has the save
method implemented which is responsible for saving config and other related files to the
disk. This method takes a path
parameter which is just the base folder path and any necessary subfolder will be created
automatically based on the module type. For example, if you save a tokenizer at path my_tokenizer/
, the Tokenizer
’s
save
method will create a preprocessor
folder and saves the tokenizer.json
and tokenizer_config.yaml
on that
folder. You can control the subfolder
parameter and other file/path names if the base class gives you the option.
Pushing to the Hub
Pushing to the Hugging Face Hub is so much like the save method. The only difference is that the files are then uploaded to the Hub after saving.
Concept 4: Task-based Modeling & Training¶
Hezar is a practical library not a framework (it can be though!). That’s why we decided to categorize models, trainers,
datasets, etc. under task names e.g, speech_recognition
, mask_filling
, etc. If you’ve worked with other
libraries, this might somewhat seem irrational, but trust us! For most users and usages this fits better!
Currently, all models, trainers and datasets are categorized by task name, but this does not mean that for every task, there exists a model, trainer, dataset, etc.
Concept 5: Integration with Other Tools¶
Re-inventing the wheel has no place in Hezar. It’s strongly recommended that if something already exists somewhere, and
we want it, just copy and paste it into the code!
In terms of backbone frameworks and libraries, we carefully R&D the present tools and choose the one that is the simplest
yet popular.
More specifically, here’s a simple summary of the core modules in Hezar:
Models: Every model is a
hezar.models.Model
instance which is in fact, a PyTorchnn.Module
wrapper with extra features for saving, loading, exporting, etc.Datasets: Every dataset is a
hezar.data.Dataset
instance which is a PyTorch Dataset implemented specifically for each task that can load the data files from the Hugging Face Hub.Preprocessors: All preprocessors are preferably backed by a robust library like Tokenizers, pillow, etc.
Embeddings: All embeddings are developed on top of Gensim and can be easily loaded from the Hub and used in just 2 lines of code!
Trainer: Trainer is the base class for training almost any model in Hezar or even your own custom models backed by Hezar. The Trainer comes with a lot of features and is also exportable to the Hub!
Metrics: Metrics are also another configurable and portable modules backed by Scikit-learn, seqeval, etc. and can be easily used in the trainers!
Concept 6: Our Inspirations¶
Hezar was built using the best practices we’ve learned from working with dozens of industry leading open source software in the AI world. Our biggest inspirations are:
Transformers by Hugging Face
Fairseq by Meta AI
Flair by FlairAI
Ludwig by Ludwig AI
UniLM by Microsoft
PyTorch Ignite by PyTorch
Lightning by Lightning AI
Hazm by Roshan