hezar.embeddings.embedding module¶
- class hezar.embeddings.embedding.Embedding(config: EmbeddingConfig, embedding_file: str | None = None, vectors_file: str | None = None, **kwargs)[source]¶
- Bases: - object- Base class for all embeddings. - Parameters:
- config – An EmbeddingConfig object to construct the embedding. 
- embedding_file (str) – Path to the embedding file. 
- vectors_file (str) – Path to the vectors file. 
- **kwargs – Extra embedding config parameters passed as keyword arguments. 
 
 - config_filename = 'embedding_config.yaml'¶
 - doesnt_match(words: List[str])[source]¶
- Get the word that doesn’t match the others in a list. - Parameters:
- words (List[str]) – List of words. 
 
 - filename = 'embedding.bin'¶
 - from_file(embedding_path, vectors_path)[source]¶
- Load the embedding model from file. - Parameters:
- embedding_path (str) – Path to the embedding file. 
- vectors_path (str) – Path to the vectors file. 
 
 
 - index2word(index)[source]¶
- Get the word corresponding to a given index. - Parameters:
- index (int) – Input index. 
- Returns:
- Word corresponding to the index. 
- Return type:
- str 
 
 - classmethod load(hub_or_local_path, config_filename=None, embedding_file=None, vectors_file=None, subfolder=None, cache_dir=None, **kwargs) Embedding[source]¶
- Load an embedding model from a local or Hugging Face Hub path. - Parameters:
- hub_or_local_path – Path to the local directory or the Hugging Face Hub repository. 
- config_filename (str) – Configuration file name. 
- embedding_file (str) – Embedding file name. 
- vectors_file (str) – Vectors file name. 
- subfolder (str) – Subfolder within the repository. 
- cache_dir (str) – Path to cache directory 
- **kwargs – Additional keyword arguments. 
 
- Returns:
- Loaded Embedding object. 
- Return type:
 
 - most_similar(word: str, top_n: int = 5)[source]¶
- Get the most similar words to a given word. - Parameters:
- word (str) – Input word. 
- top_n (int) – Number of similar words to retrieve. 
 
 
 - push_to_hub(repo_id, commit_message=None, subfolder=None, filename=None, vectors_filename=None, config_filename=None, private=False)[source]¶
- Push the embedding model to the Hugging Face Hub. - Parameters:
- repo_id – ID of the Hugging Face Hub repository. 
- commit_message (str) – Commit message. 
- subfolder (str) – Subfolder within the repository. 
- filename (str) – Name of the embedding file. 
- vectors_filename (str) – Name of the vectors file. 
- config_filename (str) – Configuration file name. 
- private (bool) – Whether the repository is private. 
 
 
 - save(path: str | PathLike, filename: str | None = None, subfolder: str | None = None, save_config: bool = True, config_filename: str | None = None)[source]¶
- Save the embedding model to a specified path. - Parameters:
- path (str | os.PathLike) – Path to save the embedding model. 
- filename (str) – Name of the embedding file. 
- subfolder (str) – Subfolder within the path. 
- save_config (bool) – Whether to save the configuration. 
- config_filename (str) – Configuration file name. 
 
 
 - similarity(word1: str, word2: str)[source]¶
- Get the similarity between two words. - Parameters:
- word1 (str) – First word. 
- word2 (str) – Second word. 
 
 
 - subfolder = 'embedding'¶
 - torch_embedding()[source]¶
- Convert the embedding model to a PyTorch Embedding layer. - Returns:
- PyTorch Embedding layer. 
- Return type:
- torch.nn.Embedding 
 
 - train(dataset, epochs)[source]¶
- Train the embedding model on a dataset. - Parameters:
- dataset – The training dataset. 
- epochs – Number of training epochs. 
 
 
 - property vectors¶
- Get the all vectors array/tensor. 
 - vectors_filename = 'embedding.bin.wv.vectors.npy'¶
 - property vocab: Dict[str, int]¶
- Get the vocabulary. 
 - word2index(word)[source]¶
- Get the index of a word in the vocabulary. - Parameters:
- word (str) – Input word. 
- Returns:
- Index of the word. 
- Return type:
- int 
 
 - property word_vectors¶
- vector. - Type:
- Get key 
- Type:
- value pairs of word