hezar.configs module

Configs are at the core of Hezar. All core modules like Model, Preprocessor, Trainer, etc. take their parameters as a config container which is an instance of Config or its derivatives. A Config is a Python dataclass with auxiliary methods for loading, saving, uploading to the hub, etc.

Examples

>>> from hezar.configs import ModelConfig
>>> config = ModelConfig.load("hezarai/bert-base-fa")
>>> from hezar.models import BertMaskFillingConfig
>>> bert_config = BertMaskFillingConfig(vocab_size=50000, hidden_size=768)
>>> bert_config.save("saved/bert", filename="model_config.yaml")
>>> bert_config.push_to_hub("hezarai/bert-custom", filename="model_config.yaml")
class hezar.configs.Config[source]

Bases: object

Base class for all configs in Hezar.

All configs are simple dataclasses that have some customized functionalities to manage their attributes. There are also some Hezar specific methods: load, save and push_to_hub.

config_type: str = 'base'
dict()[source]

Returns the config object as a dictionary (works on nested dataclasses too)

Returns:

The config object as a dictionary

classmethod fields()[source]
classmethod from_dict(dict_config: Dict | DictConfig, **kwargs)[source]

Load config from a dict-like object. Nested configs are also recursively converted to their classes if possible.

get(key, default=None)[source]
keys()[source]
classmethod load(hub_or_local_path: str | PathLike, filename: str | None = None, subfolder: str | None = None, repo_type: str | None = None, cache_dir: str | None = None, **kwargs) Config[source]

Load config from Hub or locally if it already exists on disk (handled by HfApi)

Parameters:
  • hub_or_local_path – Local or Hub path for the config

  • filename – Configuration filename

  • subfolder – Optional subfolder path where the config is in

  • repo_type – Repo type e.g, model, dataset, etc

  • cache_dir – Path to cache directory

  • **kwargs – Manual config parameters to override

Returns:

A Config instance

name: str = None
push_to_hub(repo_id: str, filename: str, subfolder: str | None = None, repo_type: str | None = 'model', skip_none_fields: bool | None = True, private: bool | None = False, commit_message: str | None = None)[source]

Push the config file to the hub

Parameters:
  • repo_id (str) – Repo name or id on the Hub

  • filename (str) – config file name

  • subfolder (str) – subfolder to save the config

  • repo_type (str) – Type of the repo e.g, model, dataset, space

  • skip_none_fields (bool) – Whether to skip saving None values or not

  • private (bool) – Whether the repo type should be private or not (ignored if the repo exists)

  • commit_message (str) – Push commit message

save(save_dir: str | PathLike, filename: str, subfolder: str | None = None, skip_none_fields: bool | None = True)[source]

Save the *config.yaml file to a local path

Parameters:
  • save_dir – Save directory path

  • filename – Config file name

  • subfolder – Subfolder to save the config file

  • skip_none_fields (bool) – Whether to skip saving None values or not

update(d: dict, **kwargs)[source]

Update config with a given dictionary or keyword arguments. If a key does not exist in the attributes, prints a warning but sets it anyway.

Parameters:
  • d – A dictionary

  • **kwargs – Key/value pairs in the form of keyword arguments

Returns:

The config object itself but the operation happens in-place anyway

class hezar.configs.DatasetConfig(path: str | None = None, task: TaskType | List[TaskType] | None = None, max_size: int | float | None = None, hf_load_kwargs: dict | None = None)[source]

Bases: Config

Base dataclass for all dataset configs

Parameters:
  • path (str) – Path to the dataset either on the Hub or local. Supported syntax is either <path> or <path>:<name> where <name> is the parameter name in the load_dataset()

  • task (str) – A supported task for the dataset

  • max_size (int | float) – Maximum number of data samples. Overwrites the main length of the dataset when calling len(dataset). If set to a float value between 0 and 1, will be interpreted as a fraction value, e.g, 0.3 means 30% of the whole length.

  • hf_load_kwargs (dict) – Keyword arguments to pass to the HF datasets.load_dataset()

config_type: str = 'dataset'
hf_load_kwargs: dict = None
max_size: int | float = None
name: str = None
path: str = None
task: TaskType | List[TaskType] = None
class hezar.configs.EmbeddingConfig(bypass_version_check: bool = False)[source]

Bases: Config

Base dataclass for all embedding configs

bypass_version_check: bool = False
config_type: str = 'embedding'
name: str = None
class hezar.configs.MetricConfig(objective: Literal['maximize', 'minimize'] | None = None, output_keys: List | Tuple | None = None, n_decimals: int = 4)[source]

Bases: Config

Base dataclass config for all metric configs

config_type: str = 'metric'
n_decimals: int = 4
name: str = None
objective: Literal['maximize', 'minimize'] = None
output_keys: List | Tuple = None
class hezar.configs.ModelConfig[source]

Bases: Config

Base dataclass for all model configs

config_type: str = 'model'
name: str = None
class hezar.configs.PreprocessorConfig[source]

Bases: Config

Base dataclass for all preprocessor configs

config_type: str = 'preprocessor'
name: str = None
class hezar.configs.TrainerConfig(output_dir: str, task: str | TaskType, device: str = 'cuda', num_epochs: int | None = None, init_weights_from: str | None = None, resume_from_checkpoint: bool | str | PathLike | None = None, max_steps: int | None = None, num_dataloader_workers: int = 0, dataloader_shuffle: bool = True, seed: int = 42, optimizer: str | OptimizerType | None = None, learning_rate: float = 2e-05, weight_decay: float = 0.0, lr_scheduler: str | LRSchedulerType | None = None, lr_scheduler_kwargs: Dict[str, Any] | None = None, lr_scheduling_steps: int | None = None, batch_size: int | None = None, eval_batch_size: int | None = None, gradient_accumulation_steps: int = 1, distributed: bool = False, mixed_precision: PrecisionType | str | None = None, use_cpu: bool = False, do_evaluate: bool = True, evaluate_with_generate: bool = True, metrics: List[str | MetricConfig] | None = None, metric_for_best_model: str = 'loss', save_enabled: bool = True, save_freq: int = 'deprecated', save_steps: int | None = None, log_steps: int | None = None, checkpoints_dir: str = 'checkpoints', logs_dir: str = 'logs')[source]

Bases: Config

Base dataclass for all trainer configs

Parameters:
  • task (str, TaskType) – The training task. Must be a valid name from TaskType.

  • output_dir (str) – Path to the directory to save trainer properties.

  • device (str) – Hardware device e.g, cuda:0, cpu, etc.

  • num_epochs (int) – Number of total epochs to train the model.

  • init_weights_from (str) – Path to a model from disk or Hub to load the initial weights from. Note that this only loads the model weights and ignores other checkpoint-related states if the path is a checkpoint. To resume training from a checkpoint use the resume parameter.

  • resume_from_checkpoint (bool, str, os.PathLike) – Resume training from a checkpoint. If set to True, the trainer will load the latest checkpoint, otherwise if a path to a checkpoint is given, it will load that checkpoint and all the other states corresponding to that checkpoint.

  • max_steps (int) – Maximum number of iterations to train. This helps to limit how many batches you want to train in total.

  • num_dataloader_workers (int) – Number of dataloader workers, defaults to 4 .

  • dataloader_shuffle (bool) – Control dataloaders shuffle argument.

  • seed (int) – Control determinism of the run by setting a seed value. defaults to 42.

  • optimizer (OptimizerType) – Name of the optimizer, available values include properties in OptimizerType enum.

  • learning_rate (float) – Initial learning rate for the optimizer.

  • weight_decay (float) – Optimizer weight decay value.

  • lr_scheduler (LRSchedulerType) – Optional learning rate scheduler among LRSchedulerType enum.

  • lr_scheduler_kwargs (Dict[str, Any]) – LR scheduler instructor kwargs depending on the scheduler type

  • lr_scheduling_steps (int) – Number of steps to perform scheduler stepping. If left as None, will default to the steps in one full epoch.

  • batch_size (int) – Training batch size.

  • eval_batch_size (int) – Evaluation batch size, defaults to batch_size if None.

  • gradient_accumulation_steps (int) – Number of updates steps to accumulate before performing a backward/update pass, defaults to 1.

  • distributed (bool) – Whether to use distributed training (via the accelerate package)

  • mixed_precision (PrecisionType | str) – Mixed precision type e.g, fp16, bf16, etc. (disabled by default)

  • use_cpu (bool) – Whether to train using the CPU only even if CUDA is available.

  • do_evaluate (bool) – Whether to run evaluation when calling Trainer.train

  • evaluate_with_generate (bool) – Whether to use generate() in the evaluation step or not. (only applicable for generative models).

  • metrics (List[str | MetricConfig]) – A list of metrics. Depending on the valid_metrics in the specific MetricsHandler of the Trainer.

  • metric_for_best_model (str) – Reference metric key to watch for determining the best model. Recommended to have a {train. | evaluation.} prefix (e.g, evaluation.f1, train.accuracy, etc.) but if not, defaults to evaluation.{metric_for_best_model}.

  • save_freq (int) (DEPRECATED) – Deprecated and renamed to save_steps.

  • save_enabled (bool) – Whether to save checkpoints at all. False disables even the saves in-between the epochs.

  • save_steps (int) – Save the trainer outputs every save_steps steps. Leave as None to ignore saving in-between training steps. If set to a float value between 0 and 1, it will be interpreted as a fraction of the total steps.

  • log_steps (int) – Save training metrics every log_steps steps. If set to a float value between 0 and 1, it will be interpreted as a fraction of the total steps.

  • checkpoints_dir (str) – Path to the checkpoints’ folder. The actual files will be saved under {output_dir}/{checkpoints_dir}.

  • logs_dir (str) – Path to the logs’ folder. The actual log files will be saved under {output_dir}/{logs_dir}.

batch_size: int = None
checkpoints_dir: str = 'checkpoints'
config_type: str = 'trainer'
dataloader_shuffle: bool = True
device: str = 'cuda'
distributed: bool = False
do_evaluate: bool = True
eval_batch_size: int = None
evaluate_with_generate: bool = True
gradient_accumulation_steps: int = 1
init_weights_from: str = None
learning_rate: float = 2e-05
log_steps: int = None
logs_dir: str = 'logs'
lr_scheduler: str | LRSchedulerType = None
lr_scheduler_kwargs: Dict[str, Any] = None
lr_scheduling_steps: int = None
max_steps: int = None
metric_for_best_model: str = 'loss'
metrics: List[str | MetricConfig] = None
mixed_precision: PrecisionType | str | None = None
name: str = 'trainer'
num_dataloader_workers: int = 0
num_epochs: int = None
optimizer: str | OptimizerType = None
output_dir: str
resume_from_checkpoint: bool | str | PathLike = None
save_enabled: bool = True
save_freq: int = 'deprecated'
save_steps: int = None
seed: int = 42
task: str | TaskType
use_cpu: bool = False
weight_decay: float = 0.0