what is there besides ChatGPT and how to deploy it

Bloomwhich issued obscenities that even served as the reason for the temporary closure of the model. So, the main feature of LLM Data Protector is the protection of businesses using various LLMs (including ChatGPT) in their solutions from incidents of data leakage and gaining access to the model.

Model Life Cycle in MLOps

In this article, we look at the current state of LLM through the prism of MlOps — we need some system of comparison, because otherwise we will drown in a sea of ​​information and an abundance of models.
But before we move on to LLM itself, let's recall the basic life cycle diagram of any ML model and reflect on how it has changed (and whether it has changed at all) with the advent of Large Language Models.

The scheme is described in more detail in this articleI will only focus on a few steps that interest us. Namely, on the model inference. At the inference stage, we assume that we already have the model architecture, weight files, and we can host it and use it as your DS programmed it your heart desires. There are many inferences on the diagram, although the main pipeline is one. The advantage is that we can infer on different devices at once and in parallel.

Are LLMs Always Necessary?

An important topic I would like to discuss is the use of LLM. Let's try to determine the pool of tasks where the use of such large models (subject to payment for inference and obtaining good results) is justified, and where it is better to go to classical approaches and models.

LLM is currently on a wave of hype, which attracts a lot of people to Data Science – that's cool! But if you're solving one or two classic problems, then using ChatGPT for this may not be the best option (but who's going to stop you).

Look, if we follow the path of the standard classic old but gold classifier model, for example, and design and train it ourselves, we pay for DS hours, if we go with the LLM solution, we also pay, but for LLM hosting. The advantages of the classics are the ability to work with the model output, modify it and further train the model itself until it becomes good.

A separate story is the reproducibility of the result. If this is mega-important in the current task, then LLM out of the box (without additional training) also has problems. And either additional training, or adding a classical model to the pipeline, or full adherence to the classical approach will save the day.

LLM is worth using in tasks of text generation, summarization and analysis — since this is their main kill feature. We use API based LLM when a quick result is needed without lengthy additional training.

Communication with users should definitely be delegated to LLMs. These can be various chatbots, voice bots, etc. You can also keep a blog in which the text will be generated by a large language model. After all, human feedback was used during training, so it will write the text in a completely “human” way. You can also make additional settings – for example, ask to use a business style or, conversely, put emoticons at the end of each sentence.

The Variety of LLMs: How to Choose and How to Use Them

All models are divided into open and closed.

Proprietary models: Antropic, Yandex Gpt-2, Gigachat, Grok by X.

Open models: LLama 2, ruGPT, Mistral, Falcon.

How do you know which model you need to use: open or closed? To do this, you need to determine how important security is to you, whether you want to host the solution in your data center, and how good the model is at solving your particular problem in your domain area.

Models differ in the datasets they were trained on, training methods, architecture, and other parameters. To catch the general vision of the latest LLM market leaders for your domain, you can use open benchmarks and leaderboards.
For example:
1. huggingface_open_llm_leaderboard
2. Chatbot Arena Leaderboard

But it is important to remember that not all benchmarks can be reproduced. Therefore, you cannot blindly rely on a benchmark alone. I recommend selecting several LLMs – for example, at least three or four, and see in practice, by conducting an experiment and obtaining results.

We learned about model selection and benches – it's time to learn about model hosting, but before hosting there is an important legal aspect of LLM licensing.

Model Licenses Overview

There are currently several types of licenses for open source, each with different usage policies. It is worth at least knowing about their existence, as it can greatly save energy, time, and most likely money for your LLM startup.

Llama 2 Hardware Requirements

Let's go back to the open-source Llama model. Llama 2 has three versions: 7B, 13B, and the largest 70B. The number next to the letter tells how big the model is, what size dataset it was trained on, and how much it knows. So, Llama 2 70B knows the most. But the larger the model, the more resources it takes to host it. We usually use a server with four Nvidia A100 video cards for this.

In the picture above you can also see how many hours these models were trained, and roughly calculate the cost of additional training for each of them.

To use the full version of Llama 2 70B, for example, we did not have four A100 video cards with 40 gigabytes. This model requires a lot of memory, because almost 80% is spent on simply loading the weights containing gradients and other artifacts that the model acquired during training. Therefore, if we want to load a context of more than 2000 tokens, the model goes out of memory.

Due to the current cost of video cards, the cost of additional training and inference of 'vanilla' full models is high, but always and everywhere methods of optimization and all kinds of compression of models before training, after training, filtering layers, etc. come. Let's see what is applied to LLM.

Quantization and fine tuning – remove cannot be used.

I suggest that everyone puts the comma in themselves, since optimization methods are often presented under the guise of 'we don't lose quality' or 'we lose very little information – it wasn't needed anyway', but is that really the case?

LLM Quantization

Let's look at the holy of holies – weight quantization. This refers to a method of optimizing weights after training, and it is a great fit for LLM since the final weights and architecture are often all that is available to us.

This is a fairly simple but interesting trick. All grids work based on matrices and convolution operations or matrix multiplication and addition.

The default type is float-32, which covers about 4 billion numbers in the range [-3.4e38, 3.40e38]. But do we really need that much? Let's consider the standard quantization option: quantize to int-8. This means that we take and compress our entire four-million float-32 range into 256 int-8 values.

We can't do without value mapping here, but how can we make a good mapping from a set of float-32 weights to a set of Int-8 weights, with minimal loss of information?

In fact, the model weights are normally distributed (Why). Therefore, we will allocate levels in the desired set Int-8 based on the normal distribution: the more values ​​in the range, the more levels we allocate.

The picture below gives a clear picture of the different quantization algorithms and those same levels:

Quantization came to us from Computer Vision, but with the advent of LLM and their large sizes, new algorithms emerged, for example double quantificationand Microsoft in general quantized the model to int 1.

Fine tuning LLM

Let's recall the MlOps paradigm again – in the case of LLM, the usual training from scratch is replaced by widespread additional training of the model, commonly known as fine-tuning. The state of the art (SOTA) of fine-tuning, at the moment, is PEFT (Parameter-Efficient Fine-Tuning).

These are a variety of fine-tuning methods, some of which we have been using for a long time and are extremely easy to understand — for example, Prompt-tuning. Some of them are incredibly popular now — LoRA, some are not used by everyone — hello, AttentionFusion. Most likely, if you have encountered fine-tuning in articles or repositories, it was LoRA or QLoRA. And they have gained their frequent use due to the fact that they allow you to quickly fine-tune models using the approach with adapters and rank reduction. Let's take a look!

So, LoRA stands for low rank adaptation. The input is W — the matrix of our initial weights and it is frozen — we do not update anything in it yet. Then, during additional training, we want to add new information, that is, the matrix of new weights W and sum it up with the old one. The essence of the method is to reduce the dimensionality of the matrix W by moving to a matrix of lower rank (why is this possible), and subsequently work only with it. That is, we will perform the most resource-intensive operations during additional training by reducing the dimensionality, and then, having performed the inverse transformation, we will go back to the dimensionality of the matrix W, add the values ​​of the matrices W + W (old and new information), and get a new beautiful output.

Q-LoRA — this is LoRA only with int-4 quantization. This is a cool feature that allows you to run LLM on a MAC or a computer with a CPU. But the response time and context during the dialogue in practice drops significantly. Therefore, I would advise not to quantize in int-4, but to start with larger dimensions.

Vector Databases and Lang Chain

After selecting a model, its possible quantization and additional training, the question arises about deploying the model itself and the environment around it. At the moment, there are several proven combinations that give good results and have proven themselves at the production level.

The first thing I want to note here is that vector databases. They operate with embeddings or vector representations of words, which speeds up database searches and optimizes storage of a large dataset.

Vector representation of words is the transformation of text, image or audio into a vector. The vector size depends on your task and the embedding model.

If you are not satisfied with the dimensionality of the resulting base, you can look at dimensionality reduction and PCA. If your embeddings are poorly distinguished and get confused, I advise you to take a closer look at using Triplet Loss in your embedding model.

Existing vector databases:

Pinecone, Weaviate, Qdrant, Milvus, Vespa

MongoDB, Cassandra, PostgreSQL and SingleStore

LangChain – is a framework for developing AI solutions. It contains everything you need: models that you can select for your request, templates for prompts, indexing by documents. LangChain also has a memory implementation and chains, which makes it a good tool for rapid prototype development.

Finale – What's New in LLMOps

Having summed up all the new information, it's time to look at it from the perspective of the model's life cycle. You can see how some of its steps have slightly changed: instead of training the model from scratch, in the case of LLM we have fine tuning, which is more than enough at the moment, model testing is now not done with human evaluation, which was not always necessary before, and inference began to require more resources and mandatory optimization.

Let's see together what else LLM will change in the usual processes of machine learning!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *