hezar.embeddings.fasttext module¶

class hezar.embeddings.fasttext.FastText(config: FastTextConfig, embedding_file: str | None = None, vectors_file: str | None = None, **kwargs)[source]¶

Bases: Embedding

FastText embedding class.

Parameters:

config (FastTextConfig) – Configuration object.
embedding_file (str) – Path to the embedding file.
vectors_file (str) – Path to the vectors file.
**kwargs – Additional config parameters given as keyword arguments.

build()[source]¶

Build the FastText embedding model.

Returns:: FastText embedding model.
Return type:: fasttext.FastText

doesnt_match(words: List[str])[source]¶

Get the word that doesn’t match the others in a list.

Parameters:: words (List[str]) – List of words.
Returns:: Word that doesn’t match.
Return type:: str

from_file(embedding_path, vectors_path)[source]¶

Load the FastText embedding model from file.

Parameters:

embedding_path (str) – Path to the embedding file.
vectors_path (str) – Path to the vectors file.

Returns:

Loaded FastText embedding model.

Return type:

fasttext.FastText

get_normed_vectors()[source]¶: Get normalized word vectors.

most_similar(word: str, top_n: int = 5)[source]¶

Get the most similar words to a given word.

Parameters:

word (str) – Input word.
top_n (int) – Number of similar words to retrieve.

Returns:

List of dictionaries containing ‘word’ and ‘score’.

Return type:

List[Dict[str, str | float]]

required_backends: List[str | Backends] = [Backends.GENSIM]¶

save(path: str | PathLike, filename: str | None = None, subfolder: str | None = None, save_config: bool = True, config_filename: str | None = None)[source]¶

Save the FastText embedding model to a specified path.

Parameters:

path (str | os.PathLike) – Path to save the embedding model.
filename (str) – Name of the embedding file.
subfolder (str) – Subfolder within the path.
save_config (bool) – Whether to save the configuration.
config_filename (str) – Configuration file name.

similarity(word1: str, word2: str)[source]¶

Get the similarity between two words.

Parameters:

word1 (str) – First word.
word2 (str) – Second word.

Returns:

Similarity score.

Return type:

float

train(dataset: List[str], epochs: int = 5)[source]¶

Train the FastText embedding model.

Parameters:

dataset (List[str]) – List of sentences for training.
epochs (int) – Number of training epochs.

property vectors¶: Get all vectors.

property vocab¶: Get vocabulary.

property word_vectors¶: Get word vectors.

class hezar.embeddings.fasttext.FastTextConfig(bypass_version_check: bool = False, dataset_path: str | None = None, vector_size: int = 300, window: int = 5, alpha: float = 0.025, min_count: int = 1, seed: int = 1, workers: int = 3, min_alpha: float = 0.0001, train_algorithm: Literal['skipgram', 'cbow'] = 'skipgram', cbow_mean: int = 1, epochs: int = 5)[source]¶

Bases: EmbeddingConfig

Configuration class for FastText embeddings.

name¶

Name of the embedding.

Type:: str

dataset_path¶

Path to the dataset.

Type:: str

vector_size¶

Size of the word vectors.

Type:: int

window¶

Window size for context words.

Type:: int

alpha¶

Learning rate.

Type:: float

min_count¶

Ignores all words with a total frequency lower than this.

Type:: int

seed¶

Seed for random number generation.

Type:: int

workers¶

Number of workers for training.

Type:: int

min_alpha¶

Minimum learning rate.

Type:: float

train_algorithm¶

Training algorithm, either ‘skipgram’ or ‘cbow’.

Type:: Literal[“skipgram”, “cbow”]

cbow_mean¶

Constant for CBOW. Default is 1.

Type:: int

epochs¶

Number of training epochs. Default is 5.

Type:: int

alpha: float = 0.025¶

cbow_mean: int = 1¶

dataset_path: str = None¶

epochs: int = 5¶

min_alpha: float = 0.0001¶

min_count: int = 1¶

name: str = 'fasttext'¶

seed: int = 1¶

train_algorithm: Literal['skipgram', 'cbow'] = 'skipgram'¶

vector_size: int = 300¶

window: int = 5¶

workers: int = 3¶