are not allowed to use ChatGPT. What can be done?

There are situations when life circumstances do not allow using ChatGPT and you have to deploy LLM locally. For example, grandma doesn't allow it. So you can remain without AI, but the men definitely won’t understand this. Are there any ways to solve this problem?

If you have such a situation, you can breathe out, there is a solution. Currently the following options exist:

1. Proprietary models:

a. Anthropic – currently comparable or superior to ChatGPT 4.0 on some tasks and has a large context window, making it possible to solve many problems without resorting to RAG and other hybrid methods

b. Yandex GPT – functions well in Russian, so if your grandmother is also a major, she will definitely appreciate this option

c. GigaChat – model from Sberbank, works just as well in Russian and see point above

2. Open models:

  1. LLama 2 – an original open model from a well-known terrorist organization, on the basis of which over 100,500 different models have already been piled up, for which many thanks to this organization (still no one understands what prompted Mark to make this decision). The quality is not up to ChatGPT 4.

  2. ruGPT – pretrain from GigaChat under the MIT license, Sber had a hand here too, thanks to them. Can be used

  3. Mistral – a model developed by people from Google in France. The quality is not up to ChatGPT 4, but on average it is better than Llama 2.

  4. Falcon – the model was developed with Arab money by Europeans. Overall, Llama 2 is weaker, and the point of using it eludes me.

  5. Grok by X – presumably “based” model from Elon himself. It works so-so so far, give or take at the level of ChatGPT 3.5, but Elon promises to tear everyone to rags and there are reasons to believe him.

Model estimates currently look something like this (you can see here):

Our user experience confirms that the models from OpenAI and Anthropic are superior, and Anthropic even wins slightly.

OnPrem

What to do if it is impossible to use cloud solutions (grandmother is afraid that scammers will find out where the pension stash is hidden). There are two options:

  1. deploy locally

This will require NVidia A100 level video cards, each costing around $16 thousand.

How many you need depends on what you will be doing. Training a model from scratch can require thousands of hours and, accordingly, a large number of video cards (and, accordingly, from tens of thousands to millions of dollars). The Falcon 7B, for example, was trained with 400 A100s over the course of two weeks. 7B Karl!

To use the model (inference) – depends on the use and the number of simultaneously connected users. Let's say you want to make a chatbot that will serve 100 users. Conservatively, the number of GPUs needed to host the LLAMA 2 70B model for 100 users depends on the amount of GPU memory. The exact memory requirements depend on the model specification, but one NVIDIA A100 with 80 GB VRAM can handle a couple of copies of the model.

For 100 concurrent users, you need to consider that not all users will require an instant response at the same time, but the system must be robust enough to handle high loads.

Let's assume that one NVIDIA A100 80GB can comfortably run 2 instances of the model. Each instance should be able to serve multiple users, depending on how the chatbot is structured and how user requests are managed.

Let's say that one GPU can serve up to 25 concurrent users (taking into account latency and processing). Thus, 100 concurrent users will require 4 GPUs. The cost of video cards will be approximately $65 thousand, not counting the cost of servers, $75-90 thousand together with the cost of the server.

  1. deploy in a data center

For example, let's take Selectel. One hour of server operation with the configuration described above (4xA100) will cost approximately 1200 rubles per hour. Not weak, but it makes sense if you are not going to use it very actively.

Both scenarios are applicable in certain situations; here you need to evaluate what you want to get as a result.

No video cards, but you hold on (aka quantization)

If your grandmother claims that there is no money (and you haven’t found a stash yet), is it possible to somehow reduce expenses? Yes, you can use quantization. This is an optimization technique that can reduce the amount of memory required to store and execute a model and speed up its computation, usually with little degradation in quality. This is achieved by reducing the number of bits that are used to represent numbers in the model's weights. Quantization most often involves reducing the precision of data from 32-bit floating point numbers to 16-bit or 8-bit integers. As a rule, the quality does not drop significantly, but you need to look at your specific tasks. This can reduce hardware requirements by 2-4 times, but you need to experiment.

This is actually a very large topic and it is difficult to describe all the nuances in one article.

Ira, our expert in this field, will soon conduct webinar on this topic (completely free). If you are interested in diving into this topic and asking questions to an expert, here link for registration.

All the best!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *