hezar.embeddings.word2vec module¶
- class hezar.embeddings.word2vec.Word2Vec(config: Word2VecConfig, embedding_file: str | None = None, vectors_file: str | None = None, **kwargs)[source]¶
Bases:
Embedding
Word2Vec embedding class.
- Parameters:
config (Word2VecConfig) – Configuration object.
embedding_file (str) – Path to the embedding file.
vectors_file (str) – Path to the vectors file.
**kwargs – Additional config parameters given as keyword arguments.
- build()[source]¶
Build the Word2Vec embedding model.
- Returns:
Word2Vec embedding model.
- Return type:
gensim.models.Word2Vec
- doesnt_match(words: List[str])[source]¶
Get the word that doesn’t match the others in a list.
- Parameters:
words (List[str]) – List of words.
- Returns:
Word that doesn’t match.
- Return type:
str
- from_file(embedding_path, vectors_path)[source]¶
Load the Word2Vec embedding model from file.
- Parameters:
embedding_path (str) – Path to the embedding file.
vectors_path (str) – Path to the vectors file.
- Raises:
ValueError – If vectors file is not found.
- Returns:
Loaded Word2Vec embedding model.
- Return type:
gensim.models.Word2Vec
- get_normed_vectors()[source]¶
Get normalized word vectors.
- Returns:
Normed word vectors.
- Return type:
Any
- most_similar(word: str, top_n: int = 5)[source]¶
Get the most similar words to a given word.
- Parameters:
word (str) – Input word.
top_n (int) – Number of similar words to retrieve.
- Returns:
List of dictionaries containing ‘word’ and ‘score’.
- Return type:
List[Dict[str, str | float]]
- save(path: str | PathLike, filename: str | None = None, subfolder: str | None = None, save_config: bool = True, config_filename: str | None = None)[source]¶
Save the Word2Vec embedding model to a specified path.
- Parameters:
path (str | os.PathLike) – Path to save the embedding model.
filename (str) – Name of the embedding file.
subfolder (str) – Subfolder within the path.
save_config (bool) – Whether to save the configuration.
config_filename (str) – Configuration file name.
- similarity(word1: str, word2: str)[source]¶
Get the similarity between two words.
- Parameters:
word1 (str) – First word.
word2 (str) – Second word.
- Returns:
Similarity score.
- Return type:
float
- train(dataset: List[str], epochs: int = 5)[source]¶
Train the Word2Vec embedding model.
- Parameters:
dataset (List[str]) – List of sentences for training.
epochs (int) – Number of training epochs.
- property vectors¶
Get all vectors.
- Returns:
All vectors.
- Return type:
numpy.ndarray
- property vocab¶
Get vocabulary.
- Returns:
Vocabulary.
- Return type:
Dict[str, int]
- property word_vectors¶
Get word vectors.
- Returns:
Word vectors.
- Return type:
gensim.models.keyedvectors.KeyedVectors
- class hezar.embeddings.word2vec.Word2VecConfig(bypass_version_check: bool = False, dataset_path: str | None = None, vector_size: int = 300, window: int = 5, alpha: float = 0.025, min_count: int = 1, seed: int = 1, workers: int = 3, min_alpha: float = 0.0001, cbow_mean: int = 1, epochs: int = 5, train_algorithm: Literal['skipgram', 'cbow'] = 'skipgram', save_format: Literal['binary', 'text'] = 'binary')[source]¶
Bases:
EmbeddingConfig
Configuration class for Word2Vec embeddings.
- name¶
Name of the embedding.
- Type:
str
- dataset_path¶
Path to the dataset.
- Type:
str
- vector_size¶
Size of the word vectors.
- Type:
int
- window¶
Window size for context words.
- Type:
int
- alpha¶
Learning rate.
- Type:
float
- min_count¶
Ignores all words with a total frequency lower than this.
- Type:
int
- seed¶
Seed for random number generation.
- Type:
int
- workers¶
Number of workers for training.
- Type:
int
- min_alpha¶
Minimum learning rate.
- Type:
float
- cbow_mean¶
Constant for CBOW. Default is 1.
- Type:
int
- epochs¶
Number of training epochs. Default is 5.
- Type:
int
- train_algorithm¶
Training algorithm, either ‘skipgram’ or ‘cbow’.
- Type:
Literal[“skipgram”, “cbow”]
- save_format¶
Format for saving the model, either ‘binary’ or ‘text’.
- Type:
Literal[“binary”, “text”]
- alpha: float = 0.025¶
- cbow_mean: int = 1¶
- dataset_path: str = None¶
- epochs: int = 5¶
- min_alpha: float = 0.0001¶
- min_count: int = 1¶
- name: str = 'word2vec'¶
- save_format: Literal['binary', 'text'] = 'binary'¶
- seed: int = 1¶
- train_algorithm: Literal['skipgram', 'cbow'] = 'skipgram'¶
- vector_size: int = 300¶
- window: int = 5¶
- workers: int = 3¶