How we saved $70k on LLMs

Recently, I was approached by friends who were actively integrating LLM into their product, but they were embarrassed by the cost of such a solution – they paid about $8/hour for Huggingface inference Endpoint 24/7, which cost an unprecedented ~100 thousand dollars a year. I needed to research what methods there are for deploying large text models, understand what problems there are and choose the optimal serving. I share the results of this research in this article)

What is an inference server?

Running an LLM locally is not a trivial task. You need to interact with the video card in order to multiply matrices as quickly as possible, queue requests so that its memory does not overflow, combine requests from the queue into batches according to different rules, tokenize text, etc.
In order not to clutter the main application with this logic, it can be moved to a separate microservice.
An inference server is just such a microservice that allows you to work with LLM as a black box into which you can feed text and receive text as output.

Full support

Probably the first and most important point when choosing a server is to check whether it can run the model that you require. You need to be especially careful about this if you are launching something exotic, like MoE, or, God forbid, a mamba.

You also need to be sure that your server supports your video card, even if it is a solution from Nvidia. The functionality of old cards is significantly less than that of new ones, which is responsible for Compute Capabilities. The higher the number, the better. If you take a card with support lower than 8.0, be prepared to encounter some functions not working. Also, not all servers support parallelization between GPUs and I do it optimally

Easy to Deploy

Unfortunately, not every AI startup has the budget to hire an MLops team, so most likely regular developers will handle the deployment of LLMs. So when choosing a solution for running models, ease of deployment is a fairly important criterion. Among the leaders in this criterion, Ollama stands out – you run one command, it pulls up everything you need, detects the presence of cards and starts serving in the background with the OpenAI compatable API. And you can also quite easily add your own models to it by writing in a rather funny Modelfile how to work with them

GPU offload

Modern LLMs with billions of weights require tens of gigabytes of VRAM to operate. Since not all consumer cards can boast this amount of memory, the obvious solution would be to use RAM to store part of the scales and a CPU to process them. That is, some of the layers are located in the memory of the video card, after which the intermediate activation vector is transferred to the CPU, and calculations continue there. In this format, performance will be greatly limited by the speed of the PCI bus, as well as by how few layers are considered on the CPU (the fewer, the faster, of course). Among the leaders of servers for GPU-pure there is llama cpp and, again, a wonderful wrapper around it – ollama.

Quantization

Quantization is the process of reducing the precision of floating point numbers in neural networks. By default, floating point numbers (such as float32) require 32 bits to represent each number. Quantization allows you to reduce this amount of memory used to represent numbers by converting them to fixed precision numbers such as int8 or int4. This reduces memory consumption and (potentially) increases computation speed by using a more compact number format.

However, you need to be careful with quantization:

  • There are 100,500 different ways to round up model weights, and it’s not a fact that your server will support everything

  • Quantization may not be implemented in an optimal way (as for example in vLLM), in which case the speed from using quantization may even drop.

Batching

Often the LLM is the bottleneck of the system, and many requests accumulate in the task queue in front of it. The system could process them not “one by one,” but in batches, taking several requests at once and processing them in batches, thanks to the magic of matrix multiplication.
That is, when using batch size 2, the model will be able to generate 2 responses in less than 2 times longer than with BS=1. In addition, it should be understood that increasing the batch size inexorably increases the latency of the response

Continuous batching

There may not be a long queue of tasks, but requests can still arrive in parallel. That is, while one request is being processed, a second one may arrive.
When generating LLM text, we call the model many times sequentially to generate each token. So we can immediately switch to generating the next phrase if one batch contains sequences of different lengths.

. LoRA support

Some servers allow you to switch adapters on models on the fly. This way you can keep 1 copy of the original model in memory, and a large set of lore, and generate answers in different styles. If you need such functionality, look towards lorax. You can read more about how Lora works in my other article (who is this Lora of yours)

Operation speed

Despite the fact that, in general, all inference servers perform the same operations under the hood, there are a large number of small optimizations, due to which the speed of text generation for some is higher than for others. Globally, this is caused by two reasons: non-optimal code (and if it is optimized, it will start working faster with the same quality) or various debuggers that speed up the model, due to a slight drop in quality)
The main parameters to look at are:

  • Throughput – generation speed with optimal batch size. Important if you have a large flow of requests and you can choose the optimal batch size

  • Single thread t/s – generation speed without batch. If there are few users, and often the batch is not typed. Slightly more specific:

  • latency ms – after how many seconds the LLM will generate the first token. May be important for systems using TTS where it is important to start speaking the answer as early as possible

  • prompt t/s – prompt processing speed. (latency per token).

    About where to look for specific values ​​for different servers/video cards/models – look at the very end

Conclusion

Instead of asking chat gpt to pour water in conclusion, I’ll just attach a table

Easy to use

Offload

Quantization

Batching

Speed

MoE

Mamba

Comment

vLLM

+/-

+

+

+

+

Best throughput

Tensor RT

+

+

+/-

+/-

+/-

Difficult

exLlama

+

+

Best single utterance throughput

llama.cpp

+/-

+

+

+

+

For GPU pure

ollama

+

+

+

+

The simplest solution

deepspeed-mii

+/-

+

+

+/-

+

+/-

Nvidia-CC >= 8.0

TGI

+/- (slow)

+

+

+/-

+

Couldn't collect(

A more detailed, quantitative comparison of various serving methods on different hardware, with exact benchmark results, can be found in this table (supplemented)

My personal blog

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *