hezar.embeddings.embedding module

class hezar.embeddings.embedding.Embedding(config: EmbeddingConfig, embedding_file: str | None = None, vectors_file: str | None = None, **kwargs)[source]

Bases: object

Base class for all embeddings.

Parameters:
  • config – An EmbeddingConfig object to construct the embedding.

  • embedding_file (str) – Path to the embedding file.

  • vectors_file (str) – Path to the vectors file.

  • **kwargs – Extra embedding config parameters passed as keyword arguments.

build()[source]

Build the embedding model.

config_filename = 'embedding_config.yaml'
doesnt_match(words: List[str])[source]

Get the word that doesn’t match the others in a list.

Parameters:

words (List[str]) – List of words.

filename = 'embedding.bin'
from_file(embedding_path, vectors_path)[source]

Load the embedding model from file.

Parameters:
  • embedding_path (str) – Path to the embedding file.

  • vectors_path (str) – Path to the vectors file.

get_normed_vectors()[source]

Get normalized word vectors.

index2word(index)[source]

Get the word corresponding to a given index.

Parameters:

index (int) – Input index.

Returns:

Word corresponding to the index.

Return type:

str

classmethod load(hub_or_local_path, config_filename=None, embedding_file=None, vectors_file=None, subfolder=None, cache_dir=None, **kwargs) Embedding[source]

Load an embedding model from a local or Hugging Face Hub path.

Parameters:
  • hub_or_local_path – Path to the local directory or the Hugging Face Hub repository.

  • config_filename (str) – Configuration file name.

  • embedding_file (str) – Embedding file name.

  • vectors_file (str) – Vectors file name.

  • subfolder (str) – Subfolder within the repository.

  • cache_dir (str) – Path to cache directory

  • **kwargs – Additional keyword arguments.

Returns:

Loaded Embedding object.

Return type:

Embedding

most_similar(word: str, top_n: int = 5)[source]

Get the most similar words to a given word.

Parameters:
  • word (str) – Input word.

  • top_n (int) – Number of similar words to retrieve.

push_to_hub(repo_id, commit_message=None, subfolder=None, filename=None, vectors_filename=None, config_filename=None, private=False)[source]

Push the embedding model to the Hugging Face Hub.

Parameters:
  • repo_id – ID of the Hugging Face Hub repository.

  • commit_message (str) – Commit message.

  • subfolder (str) – Subfolder within the repository.

  • filename (str) – Name of the embedding file.

  • vectors_filename (str) – Name of the vectors file.

  • config_filename (str) – Configuration file name.

  • private (bool) – Whether the repository is private.

required_backends: List[str | Backends] = []
save(path: str | PathLike, filename: str | None = None, subfolder: str | None = None, save_config: bool = True, config_filename: str | None = None)[source]

Save the embedding model to a specified path.

Parameters:
  • path (str | os.PathLike) – Path to save the embedding model.

  • filename (str) – Name of the embedding file.

  • subfolder (str) – Subfolder within the path.

  • save_config (bool) – Whether to save the configuration.

  • config_filename (str) – Configuration file name.

similarity(word1: str, word2: str)[source]

Get the similarity between two words.

Parameters:
  • word1 (str) – First word.

  • word2 (str) – Second word.

subfolder = 'embedding'
torch_embedding()[source]

Convert the embedding model to a PyTorch Embedding layer.

Returns:

PyTorch Embedding layer.

Return type:

torch.nn.Embedding

train(dataset, epochs)[source]

Train the embedding model on a dataset.

Parameters:
  • dataset – The training dataset.

  • epochs – Number of training epochs.

property vectors

Get the all vectors array/tensor.

vectors_filename = 'embedding.bin.wv.vectors.npy'
property vocab: Dict[str, int]

Get the vocabulary.

word2index(word)[source]

Get the index of a word in the vocabulary.

Parameters:

word (str) – Input word.

Returns:

Index of the word.

Return type:

int

property word_vectors

vector.

Type:

Get key

Type:

value pairs of word