hezar.embeddings.embedding module¶

class hezar.embeddings.embedding.Embedding(config: EmbeddingConfig, embedding_file: str | None = None, vectors_file: str | None = None, **kwargs)[source]¶

Bases: object

Base class for all embeddings.

Parameters:

config – An EmbeddingConfig object to construct the embedding.
embedding_file (str) – Path to the embedding file.
vectors_file (str) – Path to the vectors file.
**kwargs – Extra embedding config parameters passed as keyword arguments.

build()[source]¶: Build the embedding model.

config_filename = 'embedding_config.yaml'¶

doesnt_match(words: List[str])[source]¶

Get the word that doesn’t match the others in a list.

Parameters:: words (List[str]) – List of words.

filename = 'embedding.bin'¶

from_file(embedding_path, vectors_path)[source]¶

Load the embedding model from file.

Parameters:

embedding_path (str) – Path to the embedding file.
vectors_path (str) – Path to the vectors file.

get_normed_vectors()[source]¶: Get normalized word vectors.

index2word(index)[source]¶

Get the word corresponding to a given index.

Parameters:: index (int) – Input index.
Returns:: Word corresponding to the index.
Return type:: str

classmethod load(hub_or_local_path, config_filename=None, embedding_file=None, vectors_file=None, subfolder=None, cache_dir=None, **kwargs) → Embedding[source]¶

Load an embedding model from a local or Hugging Face Hub path.

Parameters:

hub_or_local_path – Path to the local directory or the Hugging Face Hub repository.
config_filename (str) – Configuration file name.
embedding_file (str) – Embedding file name.
vectors_file (str) – Vectors file name.
subfolder (str) – Subfolder within the repository.
cache_dir (str) – Path to cache directory
**kwargs – Additional keyword arguments.

Returns:

Loaded Embedding object.

Return type:

Embedding

most_similar(word: str, top_n: int = 5)[source]¶

Get the most similar words to a given word.

Parameters:

word (str) – Input word.
top_n (int) – Number of similar words to retrieve.

push_to_hub(repo_id, commit_message=None, subfolder=None, filename=None, vectors_filename=None, config_filename=None, private=False)[source]¶

Push the embedding model to the Hugging Face Hub.

Parameters:

repo_id – ID of the Hugging Face Hub repository.
commit_message (str) – Commit message.
subfolder (str) – Subfolder within the repository.
filename (str) – Name of the embedding file.
vectors_filename (str) – Name of the vectors file.
config_filename (str) – Configuration file name.
private (bool) – Whether the repository is private.

required_backends: List[str | Backends] = []¶

save(path: str | PathLike, filename: str | None = None, subfolder: str | None = None, save_config: bool = True, config_filename: str | None = None)[source]¶

Save the embedding model to a specified path.

Parameters:

path (str | os.PathLike) – Path to save the embedding model.
filename (str) – Name of the embedding file.
subfolder (str) – Subfolder within the path.
save_config (bool) – Whether to save the configuration.
config_filename (str) – Configuration file name.

similarity(word1: str, word2: str)[source]¶

Get the similarity between two words.

Parameters:

word1 (str) – First word.
word2 (str) – Second word.

subfolder = 'embedding'¶

torch_embedding()[source]¶

Convert the embedding model to a PyTorch Embedding layer.

Returns:: PyTorch Embedding layer.
Return type:: torch.nn.Embedding

train(dataset, epochs)[source]¶

Train the embedding model on a dataset.

Parameters:

dataset – The training dataset.
epochs – Number of training epochs.

property vectors¶: Get the all vectors array/tensor.

vectors_filename = 'embedding.bin.wv.vectors.npy'¶

property vocab: Dict[str, int]¶: Get the vocabulary.

word2index(word)[source]¶

Get the index of a word in the vocabulary.

Parameters:: word (str) – Input word.
Returns:: Index of the word.
Return type:: int

property word_vectors¶

vector.

Type:: Get key
Type:: value pairs of word