hezar.embeddings.embedding module¶
- class hezar.embeddings.embedding.Embedding(config: EmbeddingConfig, embedding_file: str | None = None, vectors_file: str | None = None, **kwargs)[source]¶
Bases:
object
Base class for all embeddings.
- Parameters:
config – An EmbeddingConfig object to construct the embedding.
embedding_file (str) – Path to the embedding file.
vectors_file (str) – Path to the vectors file.
**kwargs – Extra embedding config parameters passed as keyword arguments.
- config_filename = 'embedding_config.yaml'¶
- doesnt_match(words: List[str])[source]¶
Get the word that doesn’t match the others in a list.
- Parameters:
words (List[str]) – List of words.
- filename = 'embedding.bin'¶
- from_file(embedding_path, vectors_path)[source]¶
Load the embedding model from file.
- Parameters:
embedding_path (str) – Path to the embedding file.
vectors_path (str) – Path to the vectors file.
- index2word(index)[source]¶
Get the word corresponding to a given index.
- Parameters:
index (int) – Input index.
- Returns:
Word corresponding to the index.
- Return type:
str
- classmethod load(hub_or_local_path, config_filename=None, embedding_file=None, vectors_file=None, subfolder=None, cache_dir=None, **kwargs) Embedding [source]¶
Load an embedding model from a local or Hugging Face Hub path.
- Parameters:
hub_or_local_path – Path to the local directory or the Hugging Face Hub repository.
config_filename (str) – Configuration file name.
embedding_file (str) – Embedding file name.
vectors_file (str) – Vectors file name.
subfolder (str) – Subfolder within the repository.
cache_dir (str) – Path to cache directory
**kwargs – Additional keyword arguments.
- Returns:
Loaded Embedding object.
- Return type:
- most_similar(word: str, top_n: int = 5)[source]¶
Get the most similar words to a given word.
- Parameters:
word (str) – Input word.
top_n (int) – Number of similar words to retrieve.
- push_to_hub(repo_id, commit_message=None, subfolder=None, filename=None, vectors_filename=None, config_filename=None, private=False)[source]¶
Push the embedding model to the Hugging Face Hub.
- Parameters:
repo_id – ID of the Hugging Face Hub repository.
commit_message (str) – Commit message.
subfolder (str) – Subfolder within the repository.
filename (str) – Name of the embedding file.
vectors_filename (str) – Name of the vectors file.
config_filename (str) – Configuration file name.
private (bool) – Whether the repository is private.
- save(path: str | PathLike, filename: str | None = None, subfolder: str | None = None, save_config: bool = True, config_filename: str | None = None)[source]¶
Save the embedding model to a specified path.
- Parameters:
path (str | os.PathLike) – Path to save the embedding model.
filename (str) – Name of the embedding file.
subfolder (str) – Subfolder within the path.
save_config (bool) – Whether to save the configuration.
config_filename (str) – Configuration file name.
- similarity(word1: str, word2: str)[source]¶
Get the similarity between two words.
- Parameters:
word1 (str) – First word.
word2 (str) – Second word.
- subfolder = 'embedding'¶
- torch_embedding()[source]¶
Convert the embedding model to a PyTorch Embedding layer.
- Returns:
PyTorch Embedding layer.
- Return type:
torch.nn.Embedding
- train(dataset, epochs)[source]¶
Train the embedding model on a dataset.
- Parameters:
dataset – The training dataset.
epochs – Number of training epochs.
- property vectors¶
Get the all vectors array/tensor.
- vectors_filename = 'embedding.bin.wv.vectors.npy'¶
- property vocab: Dict[str, int]¶
Get the vocabulary.
- word2index(word)[source]¶
Get the index of a word in the vocabulary.
- Parameters:
word (str) – Input word.
- Returns:
Index of the word.
- Return type:
int
- property word_vectors¶
vector.
- Type:
Get key
- Type:
value pairs of word