How language models work

Guide without any jargon

From the original author: I’ve been working on a series of articles in which I’m trying to map the psychology and behavior of language models: compression, expansion, translation, and remixing. I realized that in order to write it, I needed a clear, low-level explanation of how language models work. In other words, I needed to write a prequel to my article on language models as text compressors, a Phantom Menace of the series, if you will (except, I hope, the good stuff). This is that article. – Dan Schipper

If we want to use large language models (LLMs) in our work and still call the results creative, we have to understand how they work – at least at a high level.

There are many excellent tutorials about the inner workings of language models, but they are all quite technical. (The notable exception is article (Nira Zicherman in Every magazine on how BYAM is food.) It's a shame, because there are only a few simple ideas you need to understand to get a basic idea of ​​what's going on under the hood.

I decided to lay out these ideas for you — and for myself — in as jargon-free a manner as possible. The explanation below is intentionally simplified, but it should give you a good idea of ​​how it all works. (If you want to go beyond the simplifications, I suggest putting this article on ChatGPT or Claude.)

Ready? Let's get started.

Let's imagine that you are a language model.

Imagine that you are a very simple language model. We will give you one word and make you predict the next word well.

I'm your coach. My job is to challenge you. If you get the problems right, I'll stick my hands in your brain and mess with your neural wiring to make it more likely that you'll do it again in the future. If you get it wrong, I'll mess with it again, but this time I'll make sure you don't do it again.

Here are some examples of how I want you to work:

If I say “Donald”, you say “Trump”.

If I say “Kamala”, you say “Harris”.

Now you. If I say “Joe”, what do you say?

Seriously, try to guess before you move on to the next paragraph.

If you guessed “Biden”, congratulations – you're right! Here's a little treat. (If you guessed wrong, I would slap you on the wrist).

This is how we train language models. There is a model (you) and there is a training program (me). The training program tests the model and adjusts it depending on how well it works.

We've tested you on simple problems, so let's move on to something more challenging.

Predicting the next word isn't always that easy

If I say “Jack”, what will you say?

Try guessing again before moving on to the next paragraph.

Obviously you would say “of” as in “Jack of all trades, master of none!” [умеет всё, но посредственно – известная американская поговорка про дилетантов широкого профиля / прим. перев.] That's what my mom was afraid of, what would happen to me if I didn't focus on my studies.”

What? That's not what immediately popped into your head? Oh, you thought “Nicholson”? Or maybe you thought “Black.” Or maybe “Harlow.”

It's clear. Context can change which word we think comes next. The previous examples were the names of famous politicians followed by their last names. So you, the language model, predicted that the next word in the sequence would be the name of another celebrity. (If you thought of “rabbit”, “in a box” or “and a beanstalk”, we might have to tinker with your brain!)

If you had more context before the word “Jack” — maybe a story about who I am, my upbringing, my relationship with my parents, and my insecurities about being a generalist — you'd be more likely to predict that I'm “good at everything, but not very good.”

So how can we get you to the right answer? If we simply boosted your mental abilities—say, put all the computing power in the world into your brain—you still wouldn't be able to reliably predict “of” based simply on “Jack.” You'll need more context to understand which “jack” we're talking about.

This is how language models work. Before a word comes after “jack,” the models spend a lot of time asking, “Which 'jack' are we talking about?” They do this until they've narrowed down the search for “jack” enough to make a good guess.

A mechanism called “attention” is responsible for this. Language models pay attention to any word in the clue that might be related to the last word, and then use it to update their understanding of what the last word is. Then they predict what will happen next.

This is the main idea of ​​language models:

There are many more words in the English language than we realize.

You and I might see the word “jack” on the page, but being a BLAM, you'll see something else.

To your left will be an invisible hyphen with a bunch of extra words that you will carry around with you like a runaway child with an invisible bag:

You will also be encoding things like the part of speech, whether the word occurs in real life or not, and millions of other details that we have a hard time putting into words.

This will make it much easier for you to predict what's coming next. This method takes the word “Jack” and turns it into a much more specific word — let's call it a superword — that looks something like this:

“Jack-Nicholson is the iconic Hollywood actor with the legendary grin, known for his brilliant Lakers fandom and his restless charisma.”

A much more complex version of the above is probably a word that exists somewhere in GPT-4, and based on that word the model can make a list of likely things that will come next.

Now the question is: how does the model do this?

Language models have the biggest, coolest vocabulary you've ever seen.

As the language model learns, it creates a huge dictionary containing all these very complex, made-up superwords. It creates this dictionary by reading the entire internet and creating superwords from the concepts it encounters.

To do this, it uses the same attention mechanism we talked about earlier: it studies the text, fragment by fragment, trying to calculate statistical relationships between words in the texts it encounters. It then encodes them as words in its superword dictionary.

Once trained, when a language model receives a hint, all it has to do is take the last word in the hint and ask it repeatedly, “What word are you, really?” until it creates a much more complex superword. It then looks up that superword in its huge dictionary, which helps it predict what usually comes after that word. This process is repeated over and over: it combines the hint and its answer into a single word, and calls itself again with this new hint. And so on until it reaches a stopping point and returns an answer. The language model’s complete response is a record of this journey.

There's just one problem. What if the model encounters a superword that isn't in its vocabulary? This could happen, for example, if words start to combine together in new ways that the language model didn't notice during training.

For example, it is well known that Jack Nicholson is a Lakers fan. What would happen if he suddenly gave up his fandom, became a Pacers fan and moved to Indianapolis? It's unlikely that the language model would have encountered this during training, which means it's unlikely to have superwords in its vocabulary that represent Jack Nicholson as a Pacers fan.

This is bad! We don't want our language model to fail if it encounters superwords it hasn't seen before. It's not that big of a conceptual leap to make Jack Nicholson a Pacers fan rather than a Lakers fan. How can we build a vocabulary that allows us to do this?

Language models have a great solution. Their dictionaries do not consist of a list of words. Instead, their dictionaries work like a map.

Mapping your language using math

New York City is laid out on a rectangular grid. If you start on 1st Street and walk north on 1st Avenue, you'll eventually end up on 14th Street. If we were to create a dictionary of all the streets between 1st and 14th, it would look something like this:

1-я улица
2-я улица
3-я улица
и т. д.

But between 1st and 2nd Streets, there's a whole block of different shops and restaurants. And they're constantly changing. The individual spots on that block wouldn't fit on a list like the one above, even if they existed.

Instead, we map the locations of shops and restaurants. And we locate ourselves using latitude and longitude. This way, we can move in smaller increments than a grid of street names alone would allow.

This is what language models do with the superwords they store in their vocabularies. As they learn, they plot all the superwords they create on a map. Words whose coordinates—or location—are closer to each other are closer in meaning. But superwords can exist between any two points on a map, just as you can visit any place between, say, 1st and 2nd streets, even if that particular address is not marked on the grid.

By representing superwords as coordinates on a map, language models can “know” about words that lie between known points, even if those specific words were not present in the training data.

The most interesting thing is that this card allows you to perform mathematical operations with meaning. Let's return to the word “Jack”. If you move along the map in any direction, you can come across different forms of this word. For example, on the language model map there is a direction corresponding to being an actor. The further you go in this direction, the more likely it is that the word you construct refers to an actor.

There is also a direction “musician”, which has the same property. The further you go in the direction of “musician”, the more likely it is that the word refers to a musician. If you subtract the “actor” direction from the word “Jack” and add the “musician” direction, the superword you create is much more likely to be “Jack Johnson” than “Jack Nicholson.”

What language models tell us about language

The way language models work reveals some profound properties of the nature of language and reality.

They tell us that what happens next is a result of what came before. The past is prologue, as Shakespeare said.

They also tell us that this does not happen through a simple list of static rules. Instead, everything happens in a continuous space of possibilities, where every bit of what came before contributes to the meaning of the word, and therefore to what comes next. Every bit of context matters.

As we've seen, language models represent superwords as locations on a giant map of meaning. The distance and direction between these locations reflects the complex relationships between words and concepts. This map is so vast that even combinations not encountered directly during training, such as Jack Nicholson becoming a Pacers fan, can be found if you move in the right “semantic direction.”

They also tell us that words are powerful. Every word we feed into a language model is actually a signpost pointing to a specific place in this vast landscape of language possibilities. And the model generates what comes next, plotting a path from that place, guided by the subtle interaction of all the signposts that came before.

This is its power as a tool: Language patterns are only as good as the way we use them. Learning to use them well is a skill and an art.

This should be of interest to anyone who wants to use them in creative work.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *