hezar.embeddings.fasttext module

class hezar.embeddings.fasttext.FastText(config: FastTextConfig, embedding_file: str | None = None, vectors_file: str | None = None, **kwargs)[source]

Bases: Embedding

FastText embedding class.

Parameters:
  • config (FastTextConfig) – Configuration object.

  • embedding_file (str) – Path to the embedding file.

  • vectors_file (str) – Path to the vectors file.

  • **kwargs – Additional config parameters given as keyword arguments.

build()[source]

Build the FastText embedding model.

Returns:

FastText embedding model.

Return type:

fasttext.FastText

doesnt_match(words: List[str])[source]

Get the word that doesn’t match the others in a list.

Parameters:

words (List[str]) – List of words.

Returns:

Word that doesn’t match.

Return type:

str

from_file(embedding_path, vectors_path)[source]

Load the FastText embedding model from file.

Parameters:
  • embedding_path (str) – Path to the embedding file.

  • vectors_path (str) – Path to the vectors file.

Returns:

Loaded FastText embedding model.

Return type:

fasttext.FastText

get_normed_vectors()[source]

Get normalized word vectors.

most_similar(word: str, top_n: int = 5)[source]

Get the most similar words to a given word.

Parameters:
  • word (str) – Input word.

  • top_n (int) – Number of similar words to retrieve.

Returns:

List of dictionaries containing ‘word’ and ‘score’.

Return type:

List[Dict[str, str | float]]

required_backends: List[str | Backends] = [Backends.GENSIM]
save(path: str | PathLike, filename: str | None = None, subfolder: str | None = None, save_config: bool = True, config_filename: str | None = None)[source]

Save the FastText embedding model to a specified path.

Parameters:
  • path (str | os.PathLike) – Path to save the embedding model.

  • filename (str) – Name of the embedding file.

  • subfolder (str) – Subfolder within the path.

  • save_config (bool) – Whether to save the configuration.

  • config_filename (str) – Configuration file name.

similarity(word1: str, word2: str)[source]

Get the similarity between two words.

Parameters:
  • word1 (str) – First word.

  • word2 (str) – Second word.

Returns:

Similarity score.

Return type:

float

train(dataset: List[str], epochs: int = 5)[source]

Train the FastText embedding model.

Parameters:
  • dataset (List[str]) – List of sentences for training.

  • epochs (int) – Number of training epochs.

property vectors

Get all vectors.

property vocab

Get vocabulary.

property word_vectors

Get word vectors.

class hezar.embeddings.fasttext.FastTextConfig(bypass_version_check: bool = False, dataset_path: str | None = None, vector_size: int = 300, window: int = 5, alpha: float = 0.025, min_count: int = 1, seed: int = 1, workers: int = 3, min_alpha: float = 0.0001, train_algorithm: Literal['skipgram', 'cbow'] = 'skipgram', cbow_mean: int = 1, epochs: int = 5)[source]

Bases: EmbeddingConfig

Configuration class for FastText embeddings.

name

Name of the embedding.

Type:

str

dataset_path

Path to the dataset.

Type:

str

vector_size

Size of the word vectors.

Type:

int

window

Window size for context words.

Type:

int

alpha

Learning rate.

Type:

float

min_count

Ignores all words with a total frequency lower than this.

Type:

int

seed

Seed for random number generation.

Type:

int

workers

Number of workers for training.

Type:

int

min_alpha

Minimum learning rate.

Type:

float

train_algorithm

Training algorithm, either ‘skipgram’ or ‘cbow’.

Type:

Literal[“skipgram”, “cbow”]

cbow_mean

Constant for CBOW. Default is 1.

Type:

int

epochs

Number of training epochs. Default is 5.

Type:

int

alpha: float = 0.025
cbow_mean: int = 1
dataset_path: str = None
epochs: int = 5
min_alpha: float = 0.0001
min_count: int = 1
name: str = 'fasttext'
seed: int = 1
train_algorithm: Literal['skipgram', 'cbow'] = 'skipgram'
vector_size: int = 300
window: int = 5
workers: int = 3