How to set up LLM on a local server? A step-by-step guide for ML specialists

deployed locally. To carry out this process, engineers need to complete two tasks.

  • Create a convenient “sandbox” for experiments to quickly test business hypotheses.
  • Effectively scale the found cases within the company, reducing resource costs whenever possible.

Wondering how to build a fast and cost-effective LLM inference? In the text we will share a detailed guide and the results obtained. Welcome to the cat!

Author: Alexey Goncharov, founder of the Compressa.ai platform for developing GenAI solutions on its servers.

Compressa LLM platform on Selectel infrastructure

Rent ready-made LLM infrastructure from Compressa and Selectel, in the cloud or on dedicated servers. Speed ​​up your development cycle and reduce token generation costs. Try it free with a two-week trial.

Problems with open source-LLM and their solutions

There are three key factors that make it difficult for a business to use models on its server: high cost, lack of specialists and low quality of open source-LLM in applied tasks. Below I will tell you about each in more detail.

High cost of resources

If a team wants to use LLM in production with a high load or a large number of users, they will need a server with at least one powerful GPU, for example A100. But such equipment is expensive, and not every department or even company can afford it.

What if for high quality answers you need additional training for different business tasks? Then to run this process you will need a whole cluster of GPUs, and for hosting each version of LLM – a separate server. As a result, computing costs will increase.

For example, the cost of NVIDIA A100 starts from 800 thousand rubles per unit. Administration, maintenance and other costs also apply.

Few experts

Unfortunately, there are not many specialists on the market who can come to a company and quickly deploy a local LLM. They must understand what infrastructure is needed for the company’s specific needs, what optimization methods should be used and how they affect the operation of the model. And also take into account the context of LLM use and priority metrics during the setup process. Few have such competencies.

Low quality of open source-LLM answers

Open source LLMs, especially small ones, often show poor results in specialized tasks, so using them to solve business problems will not work. You can improve the results of the model if you collect a suitable dataset and properly train it for the desired task.

Tools

To address these issues, the open community is actively developing tools to reduce GPU costs. First –

quantization

. Converts the model to eight, four, or fewer bits to reduce the memory required to run the LLM. Second –

LoRA adapters

. They help to retrain models faster and cheaper, and then run them on one video card. Third –

frameworks for inference

. Allows you to increase the performance of the model.

If you are interested in the topic of the article, join us to our community “Milky Way” on Telegram. There we discuss together the problems and best practices of organizing production ML services, and also share our own experience. Digests on DataOps and MLOps are also published there once a week.

How to optimize LLM inference

Step 1. Choose a model

There are many open benchmarks that help measure performance. However, they are often criticized for overtraining models for certain results. Closed ones, on the contrary, offer more reliable data, but access to it is limited.

Comparison of results onopen academic benchmarks.

To find the right solution, there are two ways to select a model. On the one hand, we study benchmarks and how models were run on them. On the other hand, we give a subjective assessment of the quality of work in each scenario.

Comparison of models on applied problems.

At the same time, the results in Russian and benchmarks are usually worse, so you need to carefully evaluate the answers to your problems. Fortunately, there are open source projects such as saiga. Essentially, this is a “Russification” algorithm that is applied to well-known models such as Mistral or Llama.

Step 2. Train the model for the required task

When working with applied tasks, Compressa.ai does not use a full fine-tune model, but the approach of LoRa adapters. Fine-tune is a complex, albeit effective operation. The company needs the resources and expertise of specialists so as not to accidentally break the model. LoRa adapters do not change the weight of the original model, but adjust its responses to the desired task, for example, for output in Russian.

A fully trained model may require a separate video card. Whereas LoRA adapters allow you to connect several at once on one and dynamically switch them while processing requests.

The mechanism of operation of LoRa adapters on one GPU.

Of course, increasing the number of adapters per GPU reduces overall performance, so you need to find the optimal balance based on the required load and the hardware available to you. You may need to transfer some adapters to another video card.

Step 3: Use Quantization

Every year the scientific community invents new methods for quantizing models. The most famous are GPTQ and AWQ, SmoothQuant, OmniQuant, AdaRound, SqueezeLLM and others. They allow you to reduce the size of the model and GPU requirements, as well as minimize losses in the quality of answers.

Model quantization methods.

The resulting quantization effect differs among different methods. But to get a certain result, you need to connect these methods to scenarios in which the models are then applied.

For example, we want to make a lossless model, but for a limited set of tasks. To do this, we use special methods that compress part of the model, but have virtually no effect on its weight. As a result, the model will also perform in limited scenarios, but much worse in others.

There is another option – to use more universal methods. The model will lose a small percentage of metrics, but will work more or less stably for all tasks. From a mathematical point of view, these methods are orthogonal – they can be combined with each other and obtain new results.

Step 4. Use inference frameworks

There are different frameworks for model inference, including Llama.cpp, LMDeploy, vLLM, TensorRT-LLM and others. Many of them originated from scientific papers: researchers described new ways to optimize for certain operations in the token generation process. For example, vLLM grew out of the paged attention approach – the open community contributed other acceleration methods there.

Each of these frameworks supports a different set of optimizations and a different speed of adding new features. The choice of the optimal one depends on the available hardware, the average number of tokens per entry and exit, priority metrics for optimization and other features. Compressa ML engineers constantly test different frameworks to understand their strengths and weaknesses. Additionally, we will contribute to some of them.

Step 5: Avoid interference on the A100

If you handle a lot of requests every day and want to run the Llama 70B, you'll definitely need several A100s. However, for some scenarios, builds on economical GPUs are also suitable. Let's take a closer look at them.

Small models

Typically, practitioners will start experimenting with the strongest and largest LLMs to see the maximum quality of answers available for a particular business scenario. But for some tasks, models with a small amount of memory, for example Phi-3, are suitable. They can be further trained and run efficiently even on a very low-cost GPU.

Distributed Inference LLM

If the desired LLM does not fit into the memory of a budget video card, perhaps you should not immediately switch to expensive hardware. You can connect multiple inexpensive GPUs to a shared cluster to increase the total amount of available memory. In such a system there are costs for transmitting information, so the optimal solution to the problem depends on the requirements for latency and throughput.

Build results


Let's say we have already developed a solution. What results can you expect? There are two options for the development of events.

  1. If you run models on powerful GPUs, throughput will increase by 10-70 times compared to HuggingFace and PyTorch. Much less hardware will be required to process the same flow of requests, so you can greatly reduce costs.
  2. If you don't yet have access to expensive graphics cards or scaling, you can use the budget RTX 2080. Bandwidth, of course, will suffer compared to the A100, but will still remain high.

Acceleration results on different GPUs.

Case: LLM for search tasks


One of our customers uses LLM to improve search engine indexing of documents. He needed to process a large volume of documents with regular updates. LLM performance has become a limiting factor in scaling the solution.

To solve the problem, we optimized the pipeline for indexing LLM documents with a large number of updates. After implementing the necessary optimizations, we achieved a 20-fold speedup. Now the system can index 200 thousand pages instead of ten in the same period. At the same time, resource costs were reduced from four A100 to just one video card.

Article based on a Compressa.ai report at the MLechny Way conference. Video recording – follow the link.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *