A Beginner's Guide to Using Language Models

In order to recognize natural human speech, special models are used – language models. They are able to perceive the content of the text, continue sentences and conduct a meaningful dialogue.

Together with data scientist and bioinformatician Maria Dyakova, we prepared a detailed guide on how the most popular language models are structured and what you need to know to start working with them.

Maria Dyakova

Senior Data Scientist and Bioinformatician
at TargetGene Biotechnologies

What are language models?

Language model — is an algorithm that analyzes text, understands its context, processes and generates new texts. It is based on nonlinear and probabilistic functions, with the help of which the model predicts which word in the text may be next — calculates the probability for each of the possible words.

For example, if the input sentence is “the weather is nice today”, a well-trained model is expected to continue the sentence with “it is warm and sunny outside”.

Language models are usually based on neural networks trained on a large amount of text information. This could be books, Wikipedia articles and dictionaries, forum posts, and much more. The expected result for a model depends on what exactly it was trained on.

For example, if you train a model on literature about Africa, it is quite possible that the expected response to the query “the weather is good today” will be “it is not hot today and it is raining.” And if the training dataset is articles on meteorology, the expected result may look like “temperature +23°, air humidity 60%.”

Tasks of language models

The main task of the language model is to “understand” the text based on the patterns in the data and generate a meaningful response. Thanks to fine-tuning, it can be used for other tasks. For example, for classification or NER (Named Entity Recognition) — recognition of entities in the text.

Here are some examples of what you can do with language models:

  • analyze the tonality of texts, such as reviews in online stores;

  • sort news by category, say, “Finance” or “Society”;

  • detect and filter spam;

  • find key ideas in the text, for example, formulate a summary of a scientific article;

  • highlight names, addresses, product names and prices in the text – for example, to automatically fill databases, etc.

In addition, language models can independently generate meaningful texts in response to a query. For example, there have already been cases where a model generated plot of the book or the text of the thesis.

What are large language models?

A separate class of language models can be distinguished LLM (large language model)or large language models. These include the popular GPT and BERT model types today. Among the characteristic features of LLM are:

  • very large size – such models use more than a billion parameters. The most famous LLMs have hundreds of billions of them;

  • training on a huge amount of input data – for example, 50 billion web pages from the database Common Crawl;

  • large computing resources required to create and train such a model;

  • the ability to process input data in parallel rather than sequentially.

Due to their size and architectural features, LLMs are more flexible. The same model can be used to generate code, simulate live dialogue, or create stories. A good example is the well-known ChatGPT.

Structure of language models

The structure depends on the mathematical model used in its creation. It is impossible to talk about a single structure — different approaches were used in different years. The first language models were statistical, based on the Markov chain probabilistic algorithm. Later ones were based on recurrent neural networks (RNN) — a type of neural network designed to process sequential data.

Modern large language models like BERT or GPT are based on a framework called “transformer”This architecture turned out to be the most efficient and gave better results than statistical or RNN models.

A transformer is a mathematical model that consists of two parts – an encoder and a decoder.:

  • Encoder encodes the input text, transforming it into a vector of numbers that describes the original data as accurately as possible.

  • Decoder converts the numeric vector back into text or another semantic expression that is required from the model. For example, this could be the category to which the input text belongs – fiction, scientific article, etc.

The internal vector that the model works with describes the relationships between the source data and allows the model to process and generate text.

A simplified diagram of the transformer looks like this. Source

A simplified diagram of the transformer looks like this. Source

The operation of the transformer is based on mechanism of attention. This means that the words in the text are not considered by themselves, but in context: it depends on the words around them, the position of the word in the sentence, and the frequency of combinations of specific words. Thanks to this mechanism, the language model is able to deeply analyze the text and recognize its meaning just like a person.

Inside the encoder and decoder are different combinations of attention layers and feedforward neural networks. The attention layers determine the context and connections between tokens. They are based on three matrices:

  • Query matrix (Q) analyzes a word in the context of other words.

  • Key Value Matrix (K) checks how a particular word relates to the input query.

  • Matrix of values ​​(V) defines what a word means not in the context of a sentence, but for the language as a whole.

Feedforward neural networks are placed after the attention layers. They add nonlinear transformations to the data – turning the computed data for each word into an N-dimensional vector.

There are links between layers that help take into account data from previous layers. This helps not to lose important information when passing through any layer.

Before loading into the encoder, the input data goes through tokenization and embedding layers:

  • Tokenization — is a process in which each word or character in the input text is assigned its own unique ID. The model receives a set of IDs as input, not the raw text.

  • Embedding — converting a set of IDs into semantic vectors on the first layer of the language model. The IDs coming into the model by themselves only match some numbers with words. Embedding converts them into a vector in such a way that words with similar meanings are closer to each other in the vector space.

For example, the words “rain”, “sun”, “wind” will most likely be close to each other in vector space, because they all describe the weather. And unrelated words like “sun”, “computer”, “dog” will be far from each other. True, a lot depends on the training of the model. If it was trained on texts where the sun, computer and dog are mentioned in the same context, it can recognize them as semantically close words.

A separate type of embeddings is positional. These are layers that determine the position of a word in the semantic vector based on its position in the sentence. They are useful in situations where a word changes meaning depending on its position.

For example, let's take two phrases in English: I can take (“I can take…”) and I take a can (“I take a can”). The words are the same, but the word can in the first case means “I can”, and in the second – “a can”. By the way, the word take in the first sentence, depending on the rest of the phrase, can mean not only “to take”, but also “to endure”. To recognize such moments, positional embeddings are needed.

The complete diagram of the transformer model looks like this. This image is from the article that first described the new architecture. It was published in 2017 and marked the beginning of the era of transformers.

The complete diagram of the transformer model looks like this. This image is from articleswhich first described the new architecture. It was released in 2017 and marked the beginning of the era of transformers

How Language Models Work

The operating principles may differ depending on the architecture, but in this article we will consider transformer ones, as the most relevant. Two well-known families of models, BERT and GPT, work on the same principle: they predict the hidden word that is most likely in a given context.

  1. The input text is tokenized and embedding.

  2. After that, it is loaded into the encoder, where it is passed through attention layers and fully connected layers in turn. At this stage, the input data is analyzed and important tokens are highlighted.

  3. From the encoder, the data goes to the decoder. The decoder receives the context information collected by the encoder and, based on it, generates new tokens — predicts based on the previous ones.

  4. At the output, the transformer produces a set of probabilities that are converted into words.

The difference between BERT and GPT is in the processing features. In the former, the main task is performed by encoders, while the latter is built on the basis of decoders. In practice, this means the following:

  • BERT predicts a word within a sentence and takes into account all surrounding words before and after the hidden one. It is most often used in pair-finding, classification, and transformation tasks for existing text data.

  • GPT always predicts the next word and only pays attention to the previous words in the sentence. The output is a set of probabilities for each hidden word. The model is more convenient to use for generating new texts from scratch.

The main difference between the models is in the direction of data processing. Source

The main difference between the models is in the direction of data processing. Source

How language models are trained

The learning process can be divided into several stages:

1. Preparing the dataset. Huge text databases are used to train language models. If the model is highly specialized, then the data for it is taken in a certain format (for example, scientific articles on a specific topic or comments on the Internet). The well-known ChatGPT was trained on data of very different formats to become universal.

It's not enough to collect data — it needs to be cleaned and prepared for loading into the model. Cleaning involves removing personal data, prohibited or incorrect information from a huge array of information. Otherwise, the model won't be usable: imagine a business chatbot that curses customers with obscenities.

Here's an example of what can happen if data isn't cleaned up properly. T-Bank's chatbot

Here's an example of what can happen if the data isn't cleaned up properly. T-Bank's chatbot “Oleg” was trained on open-source data. In early versions, it could be rude or even threaten clientswho approached him

The cleaned data is prepared for loading: it is tokenized, and in the case of BERT, some words in phrases are replaced with a mask. The model must learn to predict which words are in place of the mask. After this, the dataset is divided into training, validation, and testing.

2. Loading into the model. The prepared data is passed to the model for training. It is trained on each part of the dataset in turn:

  • On training. This sample is a set of examples that should show the model the distribution of connections between words. When the model is trained on this sample, it adjusts the vectors and forms its own “representation” of the relationships between words;

  • At the validation stage. This part of the dataset is used after the training sample is completed — they compare how much the accuracy of the model has changed depending on the training stage. Validation can be carried out several times and they check at what point the model produces better results;

  • On test. Test data is needed after training and validation are completed. It is used to finally test the already trained model. Sometimes such data sets are specially composed of examples that are not in the training and validation samples – to see how the model behaves when working with unfamiliar words.

The training itself occurs approximately according to the same principle as for classical neural networks. It consists of three stages:

  1. Direct data flow over the network — the model processes information and makes assumptions about the result;

  2. Calculating the error — the model checks how correct its predictions were and calculates the deviation from the correct values;

  3. Back Pass — the model propagates the calculated error across layers and adjusts the weights based on it to make more accurate predictions in the future.

The difference with conventional neural networks is that the stages are more computationally complex: they are more complexly structured, and the processes occurring within the model can be nonlinear. The model itself is much larger than classical neural networks, and huge datasets are used during training – all this can increase the risk of explosive gradients, overfitting, underfitting, and other problems typical of neural networks.

Therefore, the main feature of learning language models is the need for particularly careful and fine-tuning of the learning strategy to avoid errors. Otherwise, the approach to learning remains the same structurally and conceptually.

Teaching methods may also vary. The main ones include:

  • Pre-training on big text data. It is used to teach a model to understand the language in general, rather than specific topics. For example, a developer's task is to teach a model to understand articles on genetics in Russian. But there are not many high-quality articles on this topic, and this number is not enough to train a large model. Therefore, the model is first trained on “regular” text data of different formats, and then further trained on specific ones.

  • Fine tuning. This includes further training of an existing model for a specific task. For example, a chatbot that is already familiar with the language in general is further trained so that it understands youth slang. Or an algorithm is trained to understand and analyze reviews on a website.

  • Prompt engineering. This is how they train and fine-tune already working models — training occurs based on requests. Instructions for the model are formulated so that it produces the desired result. For example, they feed data to the input in a certain format, for which the model will produce a clearer answer.

  • Data augmentation. This is a variant of additional training using an artificially created data set. For example, models for biological problems do not simply feed texts into the input, but pre-enrich them with the names of genes and molecules. This teaches the model to recognize and understand specific terms.

  • Reinforcement learning. This method trains a model to generate text based on rewards. The model is given a “reinforcement” if the output looks a certain way. This helps, for example, to tune dialogue models to make speech sound more natural.

Pre-trained models are often used to solve real-world problems. They have already been trained on big data and understand the language in general. All that remains is to further train them on specific datasets, for example, using data augmentation – this will help solve specialized problems.

Building a language model

A simple model can be built from scratch independently, but more often ready-made ones are used – BERT, GPT and others. They are adapted to a specific task, but the structure and operating principle remain unchanged. To do this, standard models are loaded from specialized libraries, such as TensorFlow or PyTorch. Often – pre-trained, with existing basic settings.

For example, this is what building and training a BERT model looks like:

from transformers import BertTokenizer, BertForMaskedLM, Trainer, TrainingArguments
from datasets import Dataset, load_dataset
import torch

# Преобразование предложения в токены
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
sentence = "Transformers are amazing!"
tokens = tokenizer.tokenize(sentence)
print(f"Tokens: {tokens}")

# Подготовка данных к обучению
# Создаем тренировочные данные с помощью маски
masked_sentence = "Transformers are [MASK]!"
input_ids = tokenizer.encode(masked_sentence, return_tensors="pt")

# Создаем метки, заменяя все не замаскированные токены на -100
labels = input_ids.clone()
mask_token_index = (input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
labels[input_ids != tokenizer.mask_token_id] = -100

# Пример создания кастомного набора данных
# Здесь мы используем один пример для иллюстрации
train_dataset = Dataset.from_dict({
    'input_ids': [input_ids[0].tolist()],
    'labels': [labels[0].tolist()]
})

# Загружаем предварительно обученную модель BERT
model = BertForMaskedLM.from_pretrained("bert-base-uncased")

# Выставляем параметры обучения
training_args = TrainingArguments(
    output_dir="./results",                # папка для выходных данных
    learning_rate=5e-5,                    # темп обучения
    per_device_train_batch_size=8,         # размер батча
    num_train_epochs=3,                    # количество эпох
    logging_dir="./logs",                  # папка для логов
)

# Создаем обучающий объект
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

# Обучаем модель
trainer.train()

# Сохраняем обученную модель
model.save_pretrained("./my_bert_model")

In what areas and why are language models used?

Language models, particularly BERT and GPT, are the gold standard for natural language recognition, or NLP, tasks. In fact, they are the primary tool for solving them.

Here are the areas in which natural language processing is most often used:

  • Science. Language models generate abstracts — brief summaries of scientific articles that are published before the main text. Models also help in searching for scientific texts, classifying articles, processing research results, and much more;

  • Medicine. Models are used to search for specialized texts, analyze symptoms, and sometimes for diagnostics. For example, in 2018, Pennsylvania developed a language model that recognizes people's depression based on their social media posts. The accuracy was about 70% and depended on how active the person was on social media.

  • Creation of digital services. A huge number of IT solutions work with the help of language models: from search engines and translators to chatbots in social networks. For example, large companies create assistant bots with their own character and manner of speech.

  • Marketing. Language models are used to generate content plans, ideas for articles and stories, advertising posts and banners. They are used to come up with slogans and even names for new brands.

Most often, ready-made models are used in these areas. Own ones are mainly developed in two cases: when solving highly specialized problems and during research into the creation of new architectures.

How models may evolve in the future

Since GPT-3, the use of language models has increased dramatically and their development has accelerated. This has become especially noticeable after the release of the chatbot ChatGPT from OpenAI. The technology is becoming more powerful, the possibilities for training ML models are wider. Most likely, this trend will continue. Here are just some of the possible directions for the development of language models:

  • The emergence of new architectures and algorithms. Research in this area is conducted by both non-profit organizations and large brands. Perhaps new, more effective structures will emerge, like the transformer once was.

  • Solving more abstract problems. In the future, language models will likely be tailored to goals rather than specific actions. For example, the model's task will no longer be “generate a work plan” but “optimize repair costs.”

  • Penetration into all spheres of life. Chatbots for searching and analyzing will become as common as online translators are now. The models' results will be more accurate, and they will be used more actively to search for information or generate ideas.

How to learn to work with language models

To work with language models, the following skills and knowledge are required:

  • Programming: Proficiency in Python and working with libraries such as TensorFlow and PyTorch.

  • Mathematics: understanding of linear algebra, probability theory and statistics, which underlie the operation of machine learning algorithms.

  • Natural Language Processing (NLP) Theory: knowledge of model architecture, principles of their operation, data preparation and model optimization.

These skills are most easily learned at colleges, where curricula encourage students to study computer science, mathematics, and machine learning in a sequential manner.

For example, you can master a popular area in Data Science – NLP joint master's degree TSU and Skillfactory. Students study disciplines that develop linguistic and mathematical thinking to solve practical problems in the field of speech technologies. They also train to apply NLP in various IT areas.

Skills can also be acquired in special courses or independently – with due desire and motivation.

At the same time, companies are primarily interested in the practical experience of a specialist. It is especially useful when a person has a specific background. For example, if a company works in the medical field, knowledge of biology or medicine may be more important than deep knowledge in IT. Because setting up and training specialized models requires an understanding of the data it analyzes.

To practice working with language models, basic knowledge of Python and the basics of at least one ML library is enough. You also need to understand the basic concepts of NLP and be able to prepare data.

Useful materials for independent study:

  • Denis Rotman, Transformers for Natural Language Processing: bookwhich examines in detail modern models of natural language processing.

  • Rao McMahan, Introduction to PyTorch: practical guide on one of the leading libraries for developing deep learning models.

  • Tutorials and documentation from the developers: on websites TensorFlow And PyTorch You can find a lot of educational materials that will help you deepen your knowledge in machine learning and NLP.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *