how programmers taught computers to talk

Neural networks today write news, advertising texts, poems and even entire scripts. How did we get to this point? We tell the story of language models – from a psychotherapist simulator from the 1960s to the first neural networks at the beginning of the 21st century.

The biggest news about artificial intelligence over the past few years has been that neural networks have become much better at understanding language. Open services like Chat GPT and Dall-E can analyze complex and voluminous text queries and produce a meaningful result – new text, image or video – and it all starts with a request in “human” language.

This functionality of neural networks became possible thanks to language models. This is a special type of statistical model that is used in the field of natural language processing. Their task is to predict the probability of a sequence of words in a given text or sentence. Over time, the language model learns to capture the probabilistic relationships between words in the language and predict the next words based on the previous ones.

In fact, we were introduced to language models long before Chat GPT. You may remember T9 – it helped print SMS in push-button phones. T9 looked at the part of the word that you had already entered and found it in a special dictionary with the most frequently used words, suggesting what you probably wanted to write. Modern language models are faster, more relevant and no longer suggest writing something obscene instead of “Yulia”, but the essence of their work remains the same.

Modern AI applications use large language models or LLMs (Large Language Model). They are called large because of the high volume of parameters and connections between neurons – they are calculated in millions or billions.

Like most basic language models, LLMs apply their knowledge to predict the next word that might continue a given phrase. Such neural networks are trained on huge amounts of text data, and therefore can produce much more accurate and detailed results than ever before. Moreover, they have so much information at their disposal that they can not only reproduce previously written texts, but also create new unique combinations from them.

The modern capabilities of language models are amazing, but the principles that underlie them were developed almost 100 years ago. Alan Turing, one of the founding fathers of AI, was the first to speak about the possibility of machine learning. In 1947, at a lecture in London, Turing proposed the concept of a machine that could “learn from experience” and “change its instructions on its own.”

The idea of adaptability, which underlies large language models, has proven much more difficult to implement than to voice. After the end of World War II, projects in the field of natural language processing became especially popular. Countries began to interact more closely and collect information about each other. Language understanding and translation have become a task of national importance.

However, the projects implemented over the 10 years from the 1940s to the 1950s did not bring much success. No matter how much the US government and IBM poured in funding, natural languages proved too counterintuitive and confusing for the computers of that era. Roughly speaking, humanity did not yet have powerful hardware that could handle the scale of the idea.

A significant breakthrough in this area occurred in 1966, when MIT researcher Joseph Weizenbaum created the world's first human communication simulation program – the great-grandmother of modern chatbots.

The ELIZA program received a text request from the user and matched it with a relevant question to continue the conversation. In the first published dialogue with ELIZA, they discussed gender issues and followed the scenario of an appointment with a psychotherapist – the language model “listened” to what the “client” was saying and asked leading questions.

The similarities with psychotherapy are not accidental. Weizenbaum intended ELIZA to be this way, because therapists follow a similar strategy in their work: listen and then ask questions based on context. In 1973, ELIZA even had its first machine “patient” – the PARRY algorithm, which imitated the speech of a person with a schizophrenic disorder. The first ever “conversation” between two algorithms was extremely impolite and awkward: the computers quarreled already on the third phrase.

The conversations ELIZA was conducting were unnatural: the problem was that the algorithm could only create texts based on downloaded templates and strict logical rules thought out by the developers. This was not enough to navigate the complex and counterintuitive world of natural language.

A solution to the problem was found only in the early 1990s, when computing power increased significantly and scientists found a new approach to language processing. Instead of working with predefined rules, they began to use statistical models that analyzed sample texts. Such algorithms were more flexible and could interact with a wider range of language patterns.

The new approach was based on the concept of Markov Chains or HMM (Hidden Markov Model). Markov chains are a sequence of events or actions, where each new event depends only on the previous one and does not take into account all other events. In language models, this means that algorithms generate text by gradually selecting phrases that are most likely to come after the very last word.

In the late 1990s, Markov Chain-based algorithms expanded to create models using n-grams—combinations of several words. Such models analyze not only the previous word in a chain, but a series of previous words – for example, bigrams can analyze the last two words. In 1996, this model formed the basis of the famous Google PageRank (PR) algorithm, which analyzed responses to a search query based on the number and quality of links to a specific web page.

The n-gram algorithms remembered more content, but the more complex language patterns they had to predict, the worse they did. Combinations of five words are repeated in texts much less often than combinations of two, so it was much more difficult for the algorithm to collect the necessary statistics to predict n-grams with five values. Researchers were able to overcome this barrier using neural networks.

They do not have the word limit that plagued Markov Chain algorithms: even in the early 2000s, they were able to analyze patterns in texts of hundreds of phrases. Neural networks take vector representations of previous words as input. Phrases are converted into numeric form (encoding), and the resulting result is called embedding.

In such a system, each phrase becomes a vector of numbers assigned to each word. The AI evaluates the user-specified vector and decides which numbers (i.e. words) will be the most logical continuation of the phrase. Modern neural networks “read” hundreds of texts, which is why they give such accurate results. For most requests, most likely something has already been written – the neural network can only analyze and reproduce it.

The first language model powered by a neural network was presented in 2003 by Canadian cyberneticist Yoshua Bengio. In 2010, Stanford developed the CoreNLP solution, which added key functionality to neural networks such as text sentiment analysis and named entity recognition.

However, these models still had limited understanding of the text – they could not understand many grammatical subtleties and appreciate the context. Because of this, their texts often turned out unnatural and not of very high quality.

Solving this problem led researchers to the form of language models that are used in modern tools. The most famous of them, the ChatGPT service from the OpenAI organization, gives a hint about what kind of technology we are talking about. GPT stands for “Generative Pre-trained Transformer” – generative pre-trained transformer

This is a type of neural network architecture that solves data sequences with dependency processing. Transformers create a digital representation of each element in a sequence, retaining important information about it and its surrounding context. This data can be fed into other neural networks to help understand hidden patterns and relationships in the input data.

The main advantage of transformers is their ability to process dependencies in long sequences. Due to their high performance, they can also process multiple sequences in parallel. These features make them ideal tools for working with natural language.

The way transformers work with texts is based on the so-called attention mechanism. The neural network focuses on keywords and remembers their parameters. Unlike other neural networks, the transformer system works with all text at once, and not in order. The language model inside looks at the history of data processing at each stage and selects the most necessary from it.

Thanks to this, modern neural networks have the deepest understanding of texts and illogical features of natural languages. The circumstances are ideal for the development of language models: they have more data available to them than ever before; they work with colossal computing power; transformer architectures allow them to process entire texts rather than individual sequences.

The capabilities of modern language models in neural networks are impressive, but so far they work best with utilitarian texts. A neural network that has read millions of news texts about football matches can easily compile a new article in this genre. Artificial intelligence has not yet learned how to create truly new things – it simply very convincingly reproduces what has already been created by human authors. Will we see AI that can create and think independently, as imagined by science fiction writers of the last century? Time will show.