Embeddings¶

In Hezar, embeddings serve as fundamental components for various natural language processing tasks. The Embedding class provides a flexible and extensible foundation for working with word embeddings. Currently Hezar has two embedding models backed by Gensim. This tutorial will guide you through the essential aspects of using and customizing embeddings in Hezar.

Load an Embedding from Hub¶

Loading an embedding from a pretrained embedding on the Hub or locally, is as straightforward as other modules in Hezar. You can choose your desired model from our Hub and load it like below:

from hezar.embeddings import Embedding

word2vec = Embedding.load("hezarai/word2vec-cbow-fa-wikipedia")

Now let’s just run a simple similarity test between two given words:

word2vec.similarity("هزار", "میلیون")

0.7400991

Embeddings methods¶

Similarity¶

For getting the similarity score between two words, use the following:

similarity_score = word2vec.similarity("سلام", "درود")
print(similarity_score)

0.6196184

Get Top-n Similar Words¶

Find top-n most similar words to a given word like:

from pprint import pprint

most_similar = word2vec.most_similar("هزار", topn=5)
pprint(most_similar)

[{'score': '0.7407', 'word': 'دویست'},
 {'score': '0.7401', 'word': 'میلیون'},
 {'score': '0.7326', 'word': 'صد'},
 {'score': '0.7277', 'word': 'پانصد'},
 {'score': '0.7011', 'word': 'سیصد'}]

Least Similar in a List¶

To get the least similar word in a list or a word that does not match other words in a list, use the following:

least_similar = word2vec.doesnt_match(["خانه", "اتاق", "ماشین"])

'ماشین'

Get Word’s Vector¶

Get the vector for a word by:

vector = word2vec("سلام")

You can also give the model a list of words to get vectors for each of them:

vectors = word2vec(["هوش", "مصنوعی"])

Get the Vocabulary¶

Get the dictionary of the whole vocabulary in the embedding model:

vocab = word2vec.vocab

Vocabulary words and indexes¶

You can also get index of a word in the vocabulary or vise verse:

index = word2vec.word2index("هوش")
word = word2vec.index2word(index)
print(word)

'هوش'

Converting to a PyTorch nn.Embedding¶

You can also get a PyTorch embedding layer from the embedding model:

embedding_layer = word2vec.torch_embedding()
print(embedding_layer)

Embedding(240547, 200)

Training an Embedding Model¶

To train an embedding model, first choose and build your embedding. For this example, we’ll train a Word2Vec model using the CBOW algorithm with a vector dimension of 200.

from hezar.embeddings import Word2Vec, Word2VecConfig

model = Word2Vec(
    Word2VecConfig(
        vector_size=200,
        window=5,
        train_algorithm="cbow",
        alpha=0.025,
        min_count=1,
        seed=1,
        workers=4,
        min_alpha=0.0001,
    )
)

Now given a list of sentences as the dataset, run training process:

with open("data.txt") as f:
    sentences = f.readlines()

sentences = [s.replace("\n", "") for s in sentences]

word2vec.train(sentences, epochs=5)

Saving and Pushing to the Hub¶

Now you can save and push your model to the Hub:

word2vec.save("word2vec-cbow-200")

word2vec.push_to_hub("<your-hf-username>/word2vec-cbow-200-fa")