General description and implementation of Word2Vec using PyTorch

annotation

This article provides a general description of the vector representation of word embeddings – the model word2vec. An example implementation of the model is also considered word2vec using the library PyTorch. The implementation of both architectures is given skip-gram so and CBOW.

Source

Word2Vec is a popular word embedding learning model proposed by researchers Google in 2013 (Tomas Mikolov). It allows you to transform words from a corpus of texts into vectors of numbers in such a way that words with similar semantic meanings have similar vector representations in a multidimensional space. It does Word2Vec a powerful tool for natural language processing tasks (NLP), such as sentiment analysis, machine translation, automatic summarization and many others.

Main characteristics Word2Vec:

  • Distributed Representation: Each word is represented as a vector in a multidimensional space, where the relationships between words are reflected through the cosine similarity between their vectors.

  • Unsupervised learning: Word2Vec learns from large unlabeled text corpora without the need for external annotations or markup.

  • Contextual learning: Word vectors are obtained based on the context in which the words occur, thereby capturing their semantic and syntactic relationships.

Two main architectures of the Word2Vec model:

CBOW (Continuous Bag of Words): This approach predicts the current word based on the context around it. For example, for the phrase “blue sky overhead”, the model CBOW will try to predict the word “sky” based on the context words “blue”, “above”, “head”. CBOW Processes large amounts of data quickly, but is less efficient for rare words.

Skip-Gram: In contrast, this approach uses the current word to predict words in its context. For the same example, the model Skip-Gram will try to predict the words “blue”, “above”, “head” based on the word “sky”. Skip-Gram is slower to process data, but works better with rare words and less frequent contexts.

CBOW (Continuous Bag of Words)

Purpose CBOW is to predict the target word based on the context around that word. Context is defined as the set of words around the target word within a given window. The model architecture is simplified as a three-layer neural network: input layer, hidden layer and output layer.

Input layer: Contextual words are supplied as input to the model. These words are represented as vectors using “one-hot encoding”, where each vector has a dimension equal to the size of the dictionary, and contains a 1 at the position corresponding to the word's index in the dictionary, and 0 at other positions.

Hidden Layer: The input word vectors are multiplied by a weight matrix between the input and hidden layer, resulting in a hidden layer vector. For CBOW, context word vectors are typically averaged before being passed on to the next layer.

Output layer: The hidden layer vector is multiplied by the weight matrix between the hidden and output layers, the result of which is passed through softmax-function to obtain the probabilities of each word in the dictionary being the target word. The goal of learning is to maximize the probability of correct target word.

Skip-Gram

Unlike CBOW, the goal of Skip-Gram is to predict context words for a given target word. This word input to the model is used to predict words in its context within a given range of words (called a window).

Input layer: The input is the target word represented by a vector one-hot.
Hidden layer: Same as in CBOWwhere the target word vector is multiplied by the weight matrix leading to the hidden layer.

Output Layer: Unlike CBOWwhere the output layer is one softmaxV Skip-Gram for each word in the context a separate one is used softmax, which means that the model tries to predict each context word separately. The goal of training is to maximize the probability of occurrence of real context words for a given target word.

Python implementation

We implement the Word2Vec model with architecture Skip-Gram using the library PyTorch. It's not the best implementation of Word2Vec, but it's simple enough in my opinion.

We import all the necessary libraries:

import re
from collections import Counter

import numpy as np
import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset

Skip-Gram:

Data preparation

First, we need to prepare our data. In this example, we'll skip the preprocessing step and focus on the model itself. Let's assume we already have a dictionary of word-to-index correspondence and vice versa, as well as training data in pair format (central word, context word).

def prepare_data(text, window_size=2):
	# Удаляем все символы кроме a-z, @, и #
	text = re.sub(r'[^a-z@# ]', '', text)    
	# Преобразуем текст в нижний регистр
	text = text.lower()
	# Разбиваем по словам
	tokens = text.split()    
	# Формируем словарь уникальных слов
	vocab = set(tokens)
	# Формируем слова слов с указанием индекса  слова в словаре
	word_to_ix = {word: i for i, word in enumerate(vocab)}
	# Формируем пары слов n-грамм
	data = []
	for i in range(len(tokens)):
		for j in range(max(0, i - window_size), min(len(tokens), i + window_size + 1)):
			if i != j:
				data.append((tokens[i], tokens[j]))    
	return data, word_to_ix, len(vocab)

Skip-Gram Model Definition

Let's set up the dataset structure for the dataloader; we can do without it, but it allows us to scale the project and may be useful to us in the future. Details can be found in official documentation.

The Dataset class requires the following three methods:

  • __init__: Executed when an instance of a class is created. This is usually where attributes are defined.

  • __len__: Should return the length of the data set. This is important for understanding how much memory to allocate.

  • __getitem__: given an index, objects data in the form of a batch (a set of data of a set length) corresponding to this index.

class SkipGramModelDataset(Dataset):
	def __init__(self, data, word_to_ix):
		self.data = [(word_to_ix[center], word_to_ix[context]) for center, context in data]	
	def __len__(self):
		return len(self.data)	
	def __getitem__(self, idx):
		return torch.tensor(self.data[idx][0], dtype=torch.long), torch.tensor(self.data[idx][1], dtype=torch.long)
		

Let's define the structure of our model on PyTorch.

Let’s take a simple model structure, the input layer is nn.Embedding standard for tasks NLP, is a vector representation of (embeddings) words. Next comes the linear layer. Finally, we use the logarithmic function softmax.

LogSoftmax usually applied to the last layer of a neural network before calculating the loss function, e.g. NLLLoss (Negative Log Likelihood Loss). LogSoftmax converts logits (linear layer outputs) into log probabilities, which can then be used directly with NLLLoss. Important, that NLLLoss expects its input to be in log-probability format.

class Word2VecSkipGramModel(nn.Module):
	def __init__(self, vocab_size, embedding_dim):
		super(Word2VecSkipGramModel, self).__init__()
		self.embeddings = nn.Embedding(vocab_size, embedding_dim)
		self.out_layer = nn.Linear(embedding_dim, vocab_size)
		self.activation_function = nn.LogSoftmax(dim=-1)

	def forward(self, center_word_idx):
		hidden_layer = self.embeddings(center_word_idx)
		out_layer = self.out_layer(hidden_layer)
		log_probs = self.activation_function(out_layer)
		return log_probs

Model training

General approach:

  1. Initialization- First, the word vectors are initialized with random values.

  2. Context prediction – for each word in the training corpus, the model uses its vector representation (the output of the first layer of the neural network) to predict the vectors of words in its context (via the output of the second layer and the function softmax).

  3. Optimization – The loss function is optimized to improve context predictions. This updates the word vectors during the training process.

  4. Iteration – the process is repeated over several training epochs.

def train_model(data, word_to_ix, vocab_size, embedding_dim=50, epochs=10, batch_size=1):
	# Формируем набор данных
	dataset = SkipGramModelDataset(data, word_to_ix)
	dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
	# модель
	model = Word2VecSkipGramModel(vocab_size, embedding_dim)
	# функция потерь
	loss_function = nn.NLLLoss()
	#  оптимизатор
	optimizer = torch.optim.SGD(model.parameters(), lr=0.05)

	for epoch in range(epochs):
		total_loss = 0
		for center_word, context_word in dataloader:
			model.zero_grad()
			log_probs = model(center_word)
			loss = loss_function(log_probs, context_word)
			loss.backward()
			optimizer.step()            
			total_loss += loss.item()			
		print(f'Epoch {epoch + 1}, Loss: {total_loss}')
	return model
# Основная функция для вызова
def train(data: str):
	# Гиперпараметры:
	# размер окна
	window_size = 2
	# длина ембединга
	embedding_dim = 10
	# количество эпох обучения
	epochs = 5
	# размер батча
	batch_size = 1
	
	# предобработка данных
	ngramm_data, word_to_ix, vocab_size = prepare_data(data, window_size) 
	# основной процесс формирование и обучения модели
	model = train_model(ngramm_data, word_to_ix, vocab_size, embedding_dim, epochs, batch_size)
	
	# # Извлекаем векторы слов из модели
	embeddings = model.embeddings.weight.data.numpy()
	# формируем словарь слов и их векторное представление
	ix_to_word = {i: word for word, i in word_to_ix.items()}
	w2v_dict = {ix_to_word[ix]: embeddings[ix] for ix in range(vocab_size)}
	return w2v_dict

The hyperparameters are selected for educational purposes only.

# Тестовые данные
test_text="Captures Semantic Relationships: The skip-gram model effectively captures semantic relationships between words. It learns word embeddings that encode similar meanings and associations, allowing for tasks like word analogies and similarity calculations. Handles Rare Words: The skip-gram model performs well even with rare words or words with limited occurrences in the training data. It can generate meaningful representations for such words by leveraging the context in which they appear. Contextual Flexibility: The skip-gram model allows for flexible context definitions by using a window around each target word. This flexibility captures local and global word associations, resulting in richer semantic representations. Scalability: The skip-gram model can be trained efficiently on large-scale datasets due to its simplicity and parallelization potential. It can process vast amounts of text data to generate high-quality word embeddings."

w2v_dict = train(test_text)

We create a dataset with central words and their contexts, and correctly form inputs and learning goals.
Model Word2VecSkipGramModel takes the index of the central word, returns the log probabilities for all words in the dictionary.
In the learning function train_model we use NLLLoss to calculate the loss between predicted log probabilities.

CBOW:

In this approach, the model predicts the current word based on the context around it. This means that several words from the context of the current word are fed to the model’s input, and the model learns to predict this current word.

Major changes

  1. Data preparation

We change the data preparation function so that it creates training examples consisting of context words as input and a central word as target:

def prepare_data_cbow(text: str, window_size=2):
	text = re.sub(r'[^a-z@# ]', '', text.lower())    
	tokens = text.split()    
	vocab = set(tokens)
	word_to_ix = {word: i for i, word in enumerate(vocab)}
	data = []
	for i in range(window_size, len(tokens) - window_size):
		context = [tokens[i - j - 1] for j in range(window_size)] + [tokens[i + j + 1] for j in range(window_size)]
		target = tokens[i]
		data.append((context, target))
	return data, word_to_ix, len(vocab)	

class SkipGramDataset(Dataset):
	def __init__(self, data, word_to_ix):			
		self.data = [(word_to_ix[center], word_to_ix[context]) for center, context in data]
	
	def __len__(self):
		return len(self.data)
	
	def __getitem__(self, idx):
		return torch.tensor(self.data[idx][0], dtype=torch.long), torch.tensor(self.data[idx][1], dtype=torch.long)
  1. Changing the model architecture

Modify the model so that it takes context words and predicts the center word:

class Word2VecCBOWModel(nn.Module):
	def __init__(self, vocab_size, embedding_dim):
		super(Word2VecCBOWModel, self).__init__()
		self.embeddings = nn.Embedding(vocab_size, embedding_dim)
		self.out_layer = nn.Linear(embedding_dim, vocab_size)
		self.activation_function = nn.LogSoftmax(dim=1)

	def forward(self, center_word_idx):
		hidden_layer = torch.mean(self.embeddings(center_word_idx), dim=1)
		out_layer = self.out_layer(hidden_layer)
		log_probs = self.activation_function(out_layer)
		return log_probs
  1. Learning Feature Update

The training function will need to be adapted to work with the new data format and CBOW model:

def train_model_cbow(data, word_to_ix, vocab_size, embedding_dim=50, epochs=10, batch_size=1):
	dataset = CBOWDataset(data, word_to_ix)
	dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
	model = Word2VecCBOWModel(vocab_size, embedding_dim)
	loss_function = nn.NLLLoss()
	optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
	for epoch in range(epochs):
		total_loss = 0
		for context_words, target_word in dataloader:
			context_words = context_words
			model.zero_grad()
			log_probs = model(context_words)
			loss = loss_function(log_probs, target_word)
			loss.backward()
			optimizer.step()
			total_loss += loss.item()
		print(f'Epoch {epoch + 1}, Loss: {total_loss}')
	return model

Performance Improvements

To improve the quality of the Word2Vec model, you can apply a number of methods and techniques:

  1. Increasing the volume of training data

    • More text data: A larger and more diverse training corpus can help the model better understand different contexts of word usage and improve the quality of vector representations.

  2. Data preprocessing

    • Tokenization: Efficiently breaking text into words, sentences, and other meaningful units.

    • Removing stop words: Excluding frequently occurring words that may not carry a significant semantic load (for example, prepositions, conjunctions).

    • Lemmatization and Stemming: Reducing words to their base form can help reduce vocabulary size and noise.

    • Using n-grams: Training a model on phrases or combinations of words (for example, “New York” instead of “New” and “York” separately) can improve the quality of embeddings for compound terms.

  3. Tuning Hyperparameters

    • Vector size: Increasing the size of the vector representation of words can improve quality by capturing more semantic nuances, but also increases the computational and memory requirements.

    • Context window size: Experimenting with window size can help fine-tune the balance between learning about immediate context and broader contextual relationships.

    • Update frequency (subsampling): Ignoring overly frequent words during training can improve the overall quality of the model.

    • Number of epochs: Increasing the number of passes through the dataset can help the model learn better, but at the risk of overfitting.

  4. Negative Sampling And Hierarchical Softmax

    • Number of Negative Samples: Setting the number of negative samples for each positive sample can affect the speed and quality of training.

    • Using hierarchical softmax: Can speed up training for very large vocabularies by calculating probabilities more efficiently.

  5. Using Ensembles and Multimodal Data

    • Model ensembles: Combining predictions from multiple models can improve the overall quality of the investment.

    • Multimodal learning: Integrating information from multiple sources (text, images, audio) can help create richer and more varied representations of words.

  6. Feedback and iterative improvement

    • Quality assessment: Regular testing of the model on tasks close to the target application will help identify weaknesses and areas for improvement.

    • Iterative Improvement: Continuously adding new data and retraining the model based on feedback can continually improve its quality.

    • The application of these methods and techniques requires experimentation and may vary depending on the specific problems and available data.

Conclusion:

Word2Vec provided a breakthrough in the field NLP, by proposing an efficient way to extract and represent semantic and syntactic relationships between words as vectors. Its attachments are used in various tasks NLP to this day remain an important tool for working with text data.

Sources Used:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *