hezar.configs module¶
Configs are at the core of Hezar. All core modules like Model, Preprocessor, Trainer, etc. take their parameters as a config container which is an instance of Config or its derivatives. A Config is a Python dataclass with auxiliary methods for loading, saving, uploading to the hub, etc.
Examples
>>> from hezar.configs import ModelConfig
>>> config = ModelConfig.load("hezarai/bert-base-fa")
>>> from hezar.models import BertMaskFillingConfig
>>> bert_config = BertMaskFillingConfig(vocab_size=50000, hidden_size=768)
>>> bert_config.save("saved/bert", filename="model_config.yaml")
>>> bert_config.push_to_hub("hezarai/bert-custom", filename="model_config.yaml")
- class hezar.configs.Config[source]¶
Bases:
object
Base class for all configs in Hezar.
All configs are simple dataclasses that have some customized functionalities to manage their attributes. There are also some Hezar specific methods: load, save and push_to_hub.
- config_type: str = 'base'¶
- dict()[source]¶
Returns the config object as a dictionary (works on nested dataclasses too)
- Returns:
The config object as a dictionary
- classmethod from_dict(dict_config: Dict | DictConfig, **kwargs)[source]¶
Load config from a dict-like object. Nested configs are also recursively converted to their classes if possible.
- classmethod load(hub_or_local_path: str | PathLike, filename: str | None = None, subfolder: str | None = None, repo_type: str | None = None, cache_dir: str | None = None, **kwargs) Config [source]¶
Load config from Hub or locally if it already exists on disk (handled by HfApi)
- Parameters:
hub_or_local_path – Local or Hub path for the config
filename – Configuration filename
subfolder – Optional subfolder path where the config is in
repo_type – Repo type e.g, model, dataset, etc
cache_dir – Path to cache directory
**kwargs – Manual config parameters to override
- Returns:
A Config instance
- name: str = None¶
- push_to_hub(repo_id: str, filename: str, subfolder: str | None = None, repo_type: str | None = 'model', skip_none_fields: bool | None = True, private: bool | None = False, commit_message: str | None = None)[source]¶
Push the config file to the hub
- Parameters:
repo_id (str) – Repo name or id on the Hub
filename (str) – config file name
subfolder (str) – subfolder to save the config
repo_type (str) – Type of the repo e.g, model, dataset, space
skip_none_fields (bool) – Whether to skip saving None values or not
private (bool) – Whether the repo type should be private or not (ignored if the repo exists)
commit_message (str) – Push commit message
- save(save_dir: str | PathLike, filename: str, subfolder: str | None = None, skip_none_fields: bool | None = True)[source]¶
Save the *config.yaml file to a local path
- Parameters:
save_dir – Save directory path
filename – Config file name
subfolder – Subfolder to save the config file
skip_none_fields (bool) – Whether to skip saving None values or not
- update(d: dict, **kwargs)[source]¶
Update config with a given dictionary or keyword arguments. If a key does not exist in the attributes, prints a warning but sets it anyway.
- Parameters:
d – A dictionary
**kwargs – Key/value pairs in the form of keyword arguments
- Returns:
The config object itself but the operation happens in-place anyway
- class hezar.configs.DatasetConfig(path: str | None = None, task: TaskType | List[TaskType] | None = None, max_size: int | float | None = None, hf_load_kwargs: dict | None = None)[source]¶
Bases:
Config
Base dataclass for all dataset configs
- Parameters:
path (str) – Path to the dataset either on the Hub or local. Supported syntax is either <path> or <path>:<name> where <name> is the parameter name in the load_dataset()
task (str) – A supported task for the dataset
max_size (int | float) – Maximum number of data samples. Overwrites the main length of the dataset when calling len(dataset). If set to a float value between 0 and 1, will be interpreted as a fraction value, e.g, 0.3 means 30% of the whole length.
hf_load_kwargs (dict) – Keyword arguments to pass to the HF datasets.load_dataset()
- config_type: str = 'dataset'¶
- hf_load_kwargs: dict = None¶
- max_size: int | float = None¶
- name: str = None¶
- path: str = None¶
- class hezar.configs.EmbeddingConfig(bypass_version_check: bool = False)[source]¶
Bases:
Config
Base dataclass for all embedding configs
- bypass_version_check: bool = False¶
- config_type: str = 'embedding'¶
- name: str = None¶
- class hezar.configs.MetricConfig(objective: Literal['maximize', 'minimize'] | None = None, output_keys: List | Tuple | None = None, n_decimals: int = 4)[source]¶
Bases:
Config
Base dataclass config for all metric configs
- config_type: str = 'metric'¶
- n_decimals: int = 4¶
- name: str = None¶
- objective: Literal['maximize', 'minimize'] = None¶
- output_keys: List | Tuple = None¶
- class hezar.configs.ModelConfig[source]¶
Bases:
Config
Base dataclass for all model configs
- config_type: str = 'model'¶
- name: str = None¶
- class hezar.configs.PreprocessorConfig[source]¶
Bases:
Config
Base dataclass for all preprocessor configs
- config_type: str = 'preprocessor'¶
- name: str = None¶
- class hezar.configs.TrainerConfig(output_dir: str, task: str | TaskType, device: str = 'cuda', num_epochs: int | None = None, init_weights_from: str | None = None, resume_from_checkpoint: bool | str | PathLike | None = None, max_steps: int | None = None, num_dataloader_workers: int = 0, dataloader_shuffle: bool = True, seed: int = 42, optimizer: str | OptimizerType | None = None, learning_rate: float = 2e-05, weight_decay: float = 0.0, lr_scheduler: str | LRSchedulerType | None = None, lr_scheduler_kwargs: Dict[str, Any] | None = None, lr_scheduling_steps: int | None = None, batch_size: int | None = None, eval_batch_size: int | None = None, gradient_accumulation_steps: int = 1, distributed: bool = False, mixed_precision: PrecisionType | str | None = None, use_cpu: bool = False, do_evaluate: bool = True, evaluate_with_generate: bool = True, metrics: List[str | MetricConfig] | None = None, metric_for_best_model: str = 'loss', save_enabled: bool = True, save_freq: int = 'deprecated', save_steps: int | None = None, log_steps: int | None = None, checkpoints_dir: str = 'checkpoints', logs_dir: str = 'logs')[source]¶
Bases:
Config
Base dataclass for all trainer configs
- Parameters:
task (str, TaskType) – The training task. Must be a valid name from TaskType.
output_dir (str) – Path to the directory to save trainer properties.
device (str) – Hardware device e.g, cuda:0, cpu, etc.
num_epochs (int) – Number of total epochs to train the model.
init_weights_from (str) – Path to a model from disk or Hub to load the initial weights from. Note that this only loads the model weights and ignores other checkpoint-related states if the path is a checkpoint. To resume training from a checkpoint use the resume parameter.
resume_from_checkpoint (bool, str, os.PathLike) – Resume training from a checkpoint. If set to True, the trainer will load the latest checkpoint, otherwise if a path to a checkpoint is given, it will load that checkpoint and all the other states corresponding to that checkpoint.
max_steps (int) – Maximum number of iterations to train. This helps to limit how many batches you want to train in total.
num_dataloader_workers (int) – Number of dataloader workers, defaults to 4 .
dataloader_shuffle (bool) – Control dataloaders shuffle argument.
seed (int) – Control determinism of the run by setting a seed value. defaults to 42.
optimizer (OptimizerType) – Name of the optimizer, available values include properties in OptimizerType enum.
learning_rate (float) – Initial learning rate for the optimizer.
weight_decay (float) – Optimizer weight decay value.
lr_scheduler (LRSchedulerType) – Optional learning rate scheduler among LRSchedulerType enum.
lr_scheduler_kwargs (Dict[str, Any]) – LR scheduler instructor kwargs depending on the scheduler type
lr_scheduling_steps (int) – Number of steps to perform scheduler stepping. If left as None, will default to the steps in one full epoch.
batch_size (int) – Training batch size.
eval_batch_size (int) – Evaluation batch size, defaults to batch_size if None.
gradient_accumulation_steps (int) – Number of updates steps to accumulate before performing a backward/update pass, defaults to 1.
distributed (bool) – Whether to use distributed training (via the accelerate package)
mixed_precision (PrecisionType | str) – Mixed precision type e.g, fp16, bf16, etc. (disabled by default)
use_cpu (bool) – Whether to train using the CPU only even if CUDA is available.
do_evaluate (bool) – Whether to run evaluation when calling Trainer.train
evaluate_with_generate (bool) – Whether to use generate() in the evaluation step or not. (only applicable for generative models).
metrics (List[str | MetricConfig]) – A list of metrics. Depending on the valid_metrics in the specific MetricsHandler of the Trainer.
metric_for_best_model (str) – Reference metric key to watch for determining the best model. Recommended to have a {train. | evaluation.} prefix (e.g, evaluation.f1, train.accuracy, etc.) but if not, defaults to evaluation.{metric_for_best_model}.
save_freq (int) (DEPRECATED) – Deprecated and renamed to save_steps.
save_enabled (bool) – Whether to save checkpoints at all. False disables even the saves in-between the epochs.
save_steps (int) – Save the trainer outputs every save_steps steps. Leave as None to ignore saving in-between training steps. If set to a float value between 0 and 1, it will be interpreted as a fraction of the total steps.
log_steps (int) – Save training metrics every log_steps steps. If set to a float value between 0 and 1, it will be interpreted as a fraction of the total steps.
checkpoints_dir (str) – Path to the checkpoints’ folder. The actual files will be saved under {output_dir}/{checkpoints_dir}.
logs_dir (str) – Path to the logs’ folder. The actual log files will be saved under {output_dir}/{logs_dir}.
- batch_size: int = None¶
- checkpoints_dir: str = 'checkpoints'¶
- config_type: str = 'trainer'¶
- dataloader_shuffle: bool = True¶
- device: str = 'cuda'¶
- distributed: bool = False¶
- do_evaluate: bool = True¶
- eval_batch_size: int = None¶
- evaluate_with_generate: bool = True¶
- gradient_accumulation_steps: int = 1¶
- init_weights_from: str = None¶
- learning_rate: float = 2e-05¶
- log_steps: int = None¶
- logs_dir: str = 'logs'¶
- lr_scheduler: str | LRSchedulerType = None¶
- lr_scheduler_kwargs: Dict[str, Any] = None¶
- lr_scheduling_steps: int = None¶
- max_steps: int = None¶
- metric_for_best_model: str = 'loss'¶
- metrics: List[str | MetricConfig] = None¶
- mixed_precision: PrecisionType | str | None = None¶
- name: str = 'trainer'¶
- num_dataloader_workers: int = 0¶
- num_epochs: int = None¶
- optimizer: str | OptimizerType = None¶
- output_dir: str¶
- resume_from_checkpoint: bool | str | PathLike = None¶
- save_enabled: bool = True¶
- save_freq: int = 'deprecated'¶
- save_steps: int = None¶
- seed: int = 42¶
- use_cpu: bool = False¶
- weight_decay: float = 0.0¶