Efficient Launch and Inference of LLM on Your Server from Scratch (Part 1)

In this course, you will learn the details of effective open-source LLM serving and retraining, including methods for handling multiple requests from multiple users. Using several of these methods simultaneously, you can improve both latency and throughput. For example, by using the latest open-source technologies in our product, we have achieved up to 70x throughput per GPU compared to the default Hugging Face & PyTorch.

The course is too extensive even for a longread, it has a lot of practical code, so today I will start with the first lessons and release the next parts if I see a keen interest. This is an adaptation, not a direct copy-paste, so somewhere I will expand the course a little with information from myself, and somewhere I will shorten it. I would also like to note that Russification of terms around LLM is a rather thankless task, so some of them will be in English.

What's inside the course?

Details of text generation using LLM one token at a time and KV-Caching implementation;
Batching to process multiple inputs simultaneously;
Continuous batching for processing a stream of requests in real time without waiting for the formation of full batches;
Quantization to reduce the model's memory consumption and hardware requirements;
LoRA as an effective fine-tuning method without changing the initial weights of the model;
Combining multiple LoRAs with continuous batching to serve dozens of retrained models simultaneously.

From words to… the essence of the course

Following the course structure, in the first lesson I will tell and show how to iteratively generate text with LLM one token at a time and how to split this process into two phases – pre-fill and decode, and also optimize it using KV-caching (caching part of the calculations).

Loading LLM from Hugging Face

Let's start by downloading LLM from Hugging Face, which will serve as an example of a model for inference (aka generating output tokens). The original course uses GPT-2, but it has poor Russian language support, so I will use the Russian-language equivalent – rugpt3small_based_on_gpt2 throughout the course.

Let's now import the required dependencies (PyTorch, transformers, etc.) and then load our LLM and the corresponding tokenizer from Hugging Face.

# Импортируем зависимости
import matplotlib.pyplot as plt
import numpy as np
import time
import torch

# Загружаем LLM и токенизатор
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "ai-forever/rugpt3small_based_on_gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

If errors occur, read them and install the required libraries using pip install or similar.

And if you want to take a deeper look at the model architecture, use the command below:

print(model)

It is worth recalling here that GPT-2 and most modern LLMs are decoder models in terms of architecture. While the original models followed the logic of “encoder converts input tokens into embeddings, and decoder generates output tokens based on these embeddings,” in GPT-2 the input data is immediately converted into embeddings and then passed through a series of blocks. The key feature of such models is the generation of text one token at a time, which makes them autoregressive. Visualization in the picture below:

Token generation process

Okay, now that we've talked about the features of our model and loaded it, let's explore the text generation process itself. Let's take a simple prompt:

prompt = "Каждое утро я пью кофе с"

Now let's run this prompt through the tokenizer:

inputs = tokenizer(prompt, return_tensors="pt")

Here return_tensors="pt" indicates that we want to get results in PyTorch format. This is different from NumPy or TensorFlow formats, which you can also use, but today we are working with it.

The result of tokenization is a dictionary containing several tensors:

print(inputs)
# {'input_ids': tensor([[ 6207,   900,   647,  7949,   417, 40946,  6715,   281]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

Let's analyze the obtained tensors:

input_ids: The main tensor representing the numeric mapping from our text to tokens. Each number corresponds to a specific token in the model's vocabulary.
attention_mask: This tensor consists of units and has the same length as input_ids.

Concept attention_mask will become clearer when we move on to batching. For now, it's enough to think of it as an auxiliary tensor that comes with input_ids and determines which tokens within the LLM input data should be paid attention to during processing.

Once the input data has been tokenized, we are ready to feed it into the model and analyze the output. Here's how:

# Отключаем вычисление градиентов для экономии памяти при инференсе
with torch.no_grad():
    # Передаем токенизированные входные данные в модель
    outputs = model(**inputs)

# Извлекаем logits — необработанные предсказания модели
logits = outputs.logits
print(logits.shape)
# torch.Size([1, 8, 50264])

The resulting tensor has 3 dimensions:

1: batch size (in our case 1, since we passed one input);
8: the number of tokens in our input;
50264: Size of the model's dictionary (number of possible tokens for output).

The next step after receiving logits from the model is to determine which token the model will predict as the most likely continuation of the sequence:

# Выбираем logits только для нового токена
last_logits = logits[0, -1, :]
# Находим индекс токена с наибольшей вероятностью быть следующим в последовательности
next_token_id = last_logits.argmax()
print(next_token_id)
# tensor(29258)

Now we use the tokenizer to decode the resulting token back into plain text:

predicted_token = tokenizer.decode(next_token_id)
print(predicted_token)
# ' молоком'

If you recall the original prompt, “Every morning I drink coffee with,” you'll notice that the continuation is grammatically correct and makes logical sense in the context of the given input.

Now, instead of picking just the most likely token, let's look at the top 10 most likely continuation options:

top_k = torch.topk(last_logits, k=10)
tokens = [tokenizer.decode(tk) for tk in top_k.indices]
print(tokens)
# [' молоком', ' сахаром', ' лим', ' кори', ' шоколад', ' конья', ' було', ' м', ' пирож', ' утра']

Like if you also like coffee with cognac(com) in the morning 🙂

When using different decoding strategies, the model may choose one of these alternative tokens. In LLM, the degree of “creativity” in choosing less probable tokens is usually controlled by temperature. But I will not go into this in depth in this lesson.

In fact, after generating the first token to continue our phrase (milk), we can use it to create a new input tensor and further generate text:

next_inputs = {
    # Обновляем input_ids, добавляя новый токен к исходной последовательности
    "input_ids": torch.cat(
        [inputs["input_ids"], next_token_id.reshape((1, 1))],
        dim=1
    ),

# Update attention_mask by adding 1 for the new token

    "attention_mask": torch.cat(
        [inputs["attention_mask"], torch.tensor([[1]])],
        dim=1
    )
}

After adding the new token, let's take a look at the updated input:

print(next_inputs["input_ids"], next_inputs["input_ids"].shape)
print(next_inputs["attention_mask"], next_inputs["attention_mask"].shape)
# tensor([[ 6207,   900,   647,  7949,   417, 40946,  6715,   281, 29258]]) torch.Size([1, 9])
# tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]]) torch.Size([1, 9])

If we analyze the results:

We see new ones input_idsincluding the new token (29258) at the end;
The shape of the tensor has changed from [1, 8] on [1, 9]reflecting the addition of a new token;
Similarly for attention_mask.

We measure the speed

Now let's think about how fast we can generate tokens. This is one of the most important metrics that our team regularly works on when optimizing LLM.

Let's start by defining a function that will combine everything we did in the previous cells. It will take a single dictionary inputs and generate the following token:

def generate_token(inputs):
    with torch.no_grad():
        outputs = model(**inputs)

    logits = outputs.logits
    last_logits = logits[0, -1, :]
    next_token_id = last_logits.argmax()
    return next_token_id

Now that we have a helper function generate_tokenlet's try to generate some tokens and see how long it takes:

generated_tokens = []  # Список для хранения сгенерированных токенов в виде текста
next_inputs = inputs   # Входные данные, которые будут обновляться на каждом шаге
durations_s = []       # Список для хранения длительности каждой итерации в секундах

# Генерация 20 токенов
for _ in range(20):
    # Замер времени перед генерацией токена
    t0 = time.time()
    
    # Генерация следующего токена
    next_token_id = generate_token(next_inputs)
    
    # Запись времени, затраченного на операцию
    durations_s += [time.time() - t0]
    
    # Обновление входных данных для следующей итерации
    next_inputs = {
        "input_ids": torch.cat(
            [next_inputs["input_ids"], next_token_id.reshape((1, 1))],
            dim=1),  # Конкатенация input_ids с новым token_id
        "attention_mask": torch.cat(
            [next_inputs["attention_mask"], torch.tensor([[1]])],
            dim=1)  # Добавление 1 к attention_mask
    }
    
    # Декодирование token_id в текст и добавление в список
    next_token = tokenizer.decode(next_token_id)
    generated_tokens.append(next_token)

# Вывод общего времени генерации
print(f"{sum(durations_s)} s")
# Вывод списка сгенерированных токенов
print(generated_tokens)
# 0.540771484375 s
# [' молоком', '.', ' ', ' И', ' каждый', ' вечер', ' я', ' пью', ' кофе', ' с', ' молоком', '.', ' ', ' И', ' каждый', ' вечер', ' я', ' пью', ' кофе', ' с']

The result shows that the generation of 20 tokens took about 0.54 seconds. To analyze the dynamics of this speed from token to token in more detail, let's look at the visual graph:

plt.plot(durations_s)
plt.show()

Let's notice several points:

It can be assumed that the generation time of each token should increase. This is due to the fact that we are constantly adding new tokens to the input data, increasing the size of the processed sequence at each subsequent step. Even with only 20 tokens, we see this trend, although the graph is quite noisy.
The first token generation usually takes a little longer for a number of reasons, but we won't go into that now. After that, the generation time for an individual token drops sharply and starts to gradually increase. From about 200 milliseconds to 300+ by the end of the sequence.

This graph shows where a significant portion of the cost comes from in LLM inference when using this straightforward approach. Now let's try to optimize this process.

Using cache to speed up calculations

For transformers, one of the most resource-intensive operations is calculating attention. The attention mechanism can be written in a separate article, so I will try to explain it briefly and with a slight simplification. Attention uses 3 key components:

Q (Query): “What do you need now?”
K (Key): “Labels” for words in the input text;
V (Value): Factual information about the words.

When LLM generates a long text in a straightforward manner, token by token, for each new word it has to “reread” and “think” about the entire previous context, i.e. recalculate these values for the entire sequence. It's like if you were to reread a long sentence from the beginning each time you continue it.

KV-Caching allows you to avoid such resource costs and divide the process into 2 phases:

When generating the first token (pre-fill phase), the model computes Q, K, and V for the entire input text.
For subsequent tokens (decode phase):
- Q, K and V are calculated only for the new token.
- K and V for previous tokens are taken from the cache.
- New K and V are added to the cache.

This avoids re-calculations for already processed tokens, significantly speeding up the generation process, especially for long sequences.

So, we have already defined the function generate_tokenwhich accepted only input input_ids And attention_mask. Now let's create a new function generate_token_with_pastwhich does the same thing but uses and returns past_key_values — the values of K and V already calculated for a given model input.

def generate_token_with_past(inputs):
    with torch.no_grad():
        outputs = model(**inputs)

    logits = outputs.logits
    last_logits = logits[0, -1, :]
    next_token_id = last_logits.argmax()
    return next_token_id, outputs.past_key_values

Let's now repeat the process we did earlier to measure the generation time of 20 tokens. But this time we will use KV-caching and pass only the next token as input_ids:

generated_tokens = []
next_inputs = inputs
durations_cached_s = []
for _ in range(20):
    t0 = time.time()
    # Используем новую функцию, которая возвращает past_key_values
    next_token_id, past_key_values = generate_token_with_past(next_inputs)
    durations_cached_s += [time.time() - t0]
    
    next_inputs = {
        # Теперь input_ids содержит только новый токен
        "input_ids": next_token_id.reshape((1, 1)),
        "attention_mask": torch.cat(
            [next_inputs["attention_mask"], torch.tensor([[1]])],
            dim=1),
        # Используем закэшированные past_key_values
        "past_key_values": past_key_values
    }
    
    next_token = tokenizer.decode(next_token_id)
    generated_tokens.append(next_token)

print(f"{sum(durations_cached_s)} s")
print(generated_tokens)
# 0.3520181179046631 s
# [' молоком', '.', ' ', ' И', ' каждый', ' вечер', ' я', ' пью', ' кофе', ' с', ' молоком', '.', ' ', ' И', ' каждый', ' вечер', ' я', ' пью', ' кофе', ' с']

Now let's build a visual graph that will show the token generation time with and without KV-caching:

plt.plot(durations_s)
plt.plot(durations_cached_s)
plt.show()

The orange hockey stick-like line shows the new results using KV-caching. As expected, after the first token is generated, the time to generate each subsequent token drops sharply and remains low until the very end.

This approach has already allowed us to reduce the overall token generation time by 35%. However, with a further increase in the number of tokens or with the use of initial long prompts, the difference will be even greater. Try experimenting yourself!

KV-caching optimization is the first important step in accelerating LLM inference. And it is at the core of what modern libraries do. There are other, more sophisticated optimization techniques that try to efficiently use the cache both in memory and in CUDA-level computations. Our ready-made Compressa LLM build uses the best modern approaches, such as Page Attention, which allows for 2-10x faster generation for 1 query.

This concludes the first lesson on LLM inference optimization. In the next part, I will talk about the batching technique, which optimizes the processing of several requests simultaneously and allows us to “load” our hardware even more efficiently, increasing throughput.

So when is the next part?

I put a lot of effort into adapting the first part, so if you want more, ask questions in the comments, share feedback, send a link to this article to your friends and colleagues, and also save it to your bookmarks. This way I will understand that it is worth continuing and will release a lesson on batching. Thank you for reading to the end!

Efficient Launch and Inference of LLM on Your Server from Scratch (Part 1)

What's inside the course?

From words to… the essence of the course

Loading LLM from Hugging Face

Token generation process

We measure the speed

Using cache to speed up calculations

So when is the next part?

The Story of How Graphviz and BOR Cracked Sony's Encryption

Bakalchuk's divorce of Wildberries owners – the largest IT company in Russia

How much does Habr spend on PPA per month?

Transformers and Hate in Vancouver: How Anti-Plagiarism Rides the NeurIPS-2019

Matstat in one scheme

Recursion in Java with an example of solving a problem with LeetCode

Leave a Reply Cancel reply

What's inside the course?

From words to… the essence of the course

Loading LLM from Hugging Face

Token generation process

We measure the speed

Using cache to speed up calculations

So when is the next part?

Similar Posts

Leave a Reply Cancel reply