Speeding up and lightening models for maintaining dialogue between virtual assistants Salut

reports for the Highload conference.

The architecture in the Salute voice assistant solves many problems: speech recognition, text preprocessing, definition of named entities, intent classification, annotation, skill execution.

All this is necessary to conduct a conscious, interesting, empathetic, emotional dialogue. At the same time, no more than one second is given for the delay, that is, all these tasks must be consistently completed in 1 second and no more.

How core models are arranged in Salute

Core models in Salute are built on large transformer architectures.

At first, we had a BERT Large transformer, which occupied about 2 gigabytes of GPU memory. And this is without taking into account additional training, that is, this size was constantly increasing. After all, we trained it on huge data sets – about millions of different dialogues, a huge amount of special post-tag markup of named entities, intentions, various information, determination of the sentiment of the text. All this was placed for training in our pipeline with BERT Large.

To make this model effectively represent our text as a vector, we used Metric-Learning.

SBERT and Multi-GPU Learning

To fit the tasks into our model training pipeline, we used HOROVOD multi-GPU learning. HOROVOD — a library that allows you to train a model on both TensorFlow and PyTorch out of the box on multiple GPUs.

We use it like this:

All data is sharded, batched, distributed across several GPU nodes that learn in parallel on the same architecture. At the start, each architecture is initialized randomly, independently of each other, and at the end of each training epoch, the weights are “summed up” by exchanging gradients between GPUs. In essence, we use the DataParallel multiGPU-leaning approach. This is how we create an improved model from the weights of the models that we distributed across the GPUs using HOROVOD. As a result, we got not a simple BERT, but SBERT-multitask.

SBERT-multitask efficiently represents search queries. And in addition to these queries, we teach it to conduct dialogs, identify named entities, and so on. There are also several additional tasks, so we need to place all this information in GPU memory.

Lightening Core NLU: Distillation and Compression

We use not only parallelization of calculations, but also methods of model simplification using distillation and compression. This is when we transfer knowledge from a strong teacher model to a weaker model to a student, with minimal loss of quality.

In this case, we use the LABSE model as one of the examples. This is also one of the latest known models with an effective representation of sentence embeddings. We use distillation of the embeddings of our SBERT model, which we trained so extensively, to the LABSE model. This allows us to reduce the dimensionality of the search index by about 3.5 times. We do this due to the compressing shift layer, which allows us to have 768 instead of the vector dimensionality of 1024, and ultimately we reached 300 from the dimensional representation. In addition, during distillation, there is inheritance of properties – this adds 4% to the quality of relevance and 10% to the search speed. And the search index that we use from vectors with a reduced dimensionality weighs only 400 MB in memory.

We need to think not only about making our model fast, compressed, and memory-saving, but also about maintaining conversational adequacy. To do this, we need to think about its resistance to attacks.

Speedup and Robust Learning: Core NLU. Adversarial Attack

To make the model resistant to attacks, we made two frameworks. The first one is simple_aug, which uses all possible linguistic replacements of our symbols in the input phrase for erroneous symbols, so we have a match of errors. At the same time, we additionally use RuTextFooler, which replaces our phrases not at the symbol level, but completely – with similar ones in spelling, but different in meaning. Thus, our resistance increases. Currently, both of these modules are parts of our library augmentex.

This stability not only improves the quality of the model, i.e. its relevance and diversity of responses, but also affects the convergence rate. That is, you get a better model when training not in 10 epochs, but in 5 or 7. This increases the training speed of the final model by 5-10%.

Sampling training core NLU

Since our model is large and we use many tasks there, then one batch for each task fits no more than 24-32 samples. If we multiply this by 10 tasks, then the largest size will have 320 values.

If we use dgx v100 with the size of 32 GB, it will hardly fit it, that is, the batch will be small. But we know that the larger the batch, the better the quality of transformers.

The potential drop in quality assessment from batch size can be corrected using the Metric Learning approach. We have a user question, a relevant answer, and an incorrect answer to this question. We feed the model with such “triples”. Previously, only unique triples were used (question + positive pair or negative pair), but then we started using FullBatching.

Accelerating learning: core NLU. FullBatching

When all pairs of topics in a batch are unique, then for each sample in the batch all samples that are not located in this row by index are also irrelevant. This information should also be used. This means that we do not have 32 batch examples, but n2 of 32, that is, in fact 322, although only the initial number is stored in memory.

We use this FullBatching when calculating inside Loss, which allows us to take into account much more information. It also allows us to fit in memory without using much more data, increase the speed of convergence and accuracy. We see more negative and positive examples at a time, without increasing the number of visible samples. We look not only at our pair, but at everything.

Speeding up learning: p-tuning and adapters

“The cherry on top” is an approach that allows us to retrain our models very quickly when new content is delivered. All the previous stages take from two weeks to a month on a large supercomputer like Christofari. But when we have new dialogues, the model also needs to be retrained on them. At the same time, we cannot queue the model for a month or two for each new delivery.

There is a solution: use the most popular P-Learning or adapters. That is, we freeze the main part of the model, and retrain the feed-forward layers, i.e. only certain small models on top of the transformer, thereby simply shifting the distribution of some examples or labels.

This allows you to very quickly obtain a model with updated content data in one sprint, a release cycle.

Inference

I'll tell you how we made our models faster and lighter.

Making inference faster and easier

When we talk about speed, we have an upper limit on latency. If we can gain a little extra time, we always think about how we can use more features or a bigger model, because this directly affects online metrics.

Also, when we talk about speed, we mean the speed of processes – that is, the speed of how quickly changes can be delivered to production, be they bug fixes or new features.

Principles for implementing new ML functionality

It is important that we make all these improvements on the condition that there is no degradation according to certain parameters or principles that we have developed for ourselves:

  • metrics we track;

  • ease of maintaining this mechanism;

  • the ability to roll back to previous versions of models;

  • rapid iterations of innovations;

  • scalability.

General scheme of work

Let's take a look at how virtual assistants or conversation agents work.

Imagine you have a certain phrase from a user.

There is always a top-level Intent Recognizer module that routes traffic and determines the user's intent: to set a timer, turn on music, or chat.

Next, we get to the module that is discussed in this article – this is communication on free topics. Here, too, there is an internal intent recognizer, because you will not send traffic directly to the “chat” or chat model. You have a certain set of intents that, for example, define the character's personality. This is called the character's bible – what is his name, what does he look like, what are his answers. Or these can be some custom scenarios or promotions timed to certain dates. That is why such a division exists.

The Annotators block is a set of classifiers on top of vectors, context, sentences, or just tokens that enrich the query. These predictions can be used either for scenario logic or simply as features in a ranker that selects the most interesting and relevant replica in a dialogue.

Vectorization and adapter models

We have a two-step approach:

There is a basic, strong vectorizer model that can match phrases that are similar in meaning but different in spelling to one point in the semantic space. Then we use a whole galaxy of different very simple and lightweight models that can solve a specific problem. For example, determine whether the current context is provocative, or identify the emotion/subject of a specific query. This is much faster than training 20 huge models, because the models are small.

This approach allows you to easily add a new annotator or tasks. You just need to train this simple grid on a small amount of data. Here we can also talk about replacing models, that is, they are separated. But there is a moment that needs to be monitored – this is consistency. When the next model directly depends on the vectors of the previous one, it is important to monitor that their versions match. After all, if the vectors do not diverge much on simple functional tests, this may not even be noticed.

When there is such a selection of the base vectorizer, it can be covered with both cold and hot cache. I'll tell you a secret: if you take a cache of 40 thousand different phrases, you will cover half of the distribution even in the “chat”. After all, when we communicate in life, we use approximately the same phrases: “hello”, “how are you” and so on. Therefore, you can use caches for this and not run a large heavy model once again.

Let's discuss intent recognition blocks and the ChitChat model.

The tasks of “Chat” and simple conversation can be solved in two ways:

  1. Use generative networks that autoregressively, token by token, reconstruct the answer, for example, GPT. That's what we do – we use a large network.

  2. Solve the problem of information retrieval. This is when there is a huge base of prepared remarks, you need to mark and get the most suitable remarks for a given context from this base, and then rank them.

When we talk about the approach based on intent search and recognition, the task comes down to vectorization of the current context. This is necessary to see which candidates are the best fit and is solved by a quick search for nearest neighbors.

FAISS for fast search

From the very beginning, we started using the library for storing and searching information in vector representation. FAISSHe generally satisfied our requirements, so we continued to use him.

The piece in the image above is from the ReadMe. So you can use literally three lines.

But there is a nuance. The thing is that we use fast search of nearest neighbors for different tasks. It turns out that there are different indexes that can be configured differently. Since in our case we do not have a large base of a billion conditionally different samples, we do not use Product Quantization to compress the data. It is enough for us to convert everything to float16.

We see a certain trade-off between accuracy, reproducibility and speed that will suit us. Therefore, in the case of intent and chatter recognition based on information search at the candidate generation stage, we use a combination of our SBERT and FAISS. Here, of course, we have all the caches for the base prepared, we have selected the settings.

I will mention a point that is not in the ReadMe – how you will load this index into memory. You can either on the CPU or on the GPU, with some of the indexes on the GPU. You can save the entire large index, serialize it and load it. Or you can create an index and load it in shards. The second option will allow you to avoid a memory peak during the pod start. This can be quite critical, because if you fall out of the GPU memory, you can simply get stuck in this endless restart.

Consistency

Consistency is more about the overall project, not just the two-stage annotator system.

The thing is, from the very beginning, prototypes were made simple: there is a Docker container for the project and models, in which we stored the code. Naturally, this is not the path to follow. Over time, we separated the code and certain configurations. We call them “statics” – these are our models and caches.

This is necessary because in most companies the release processes of code, statics and model weights differ in speed. Therefore, in our case, it is much more convenient to keep only the code and libraries that are executed in Docker. Then Docker will be lighter, which means it will load into the registry and start faster. And we store the models and caches themselves separately, and pull them from the bucket during the pod start. This provides convenience: you can update models separately and be sure that everything will be fine overall.

But here some checks are needed:

  1. Checking the version match between code and statics.

  2. Matching of the vectorizer and the caches that belong to it.

So we have a lot of different checks that evaluate the consistency between cache and vectorizer vectors.

Conclusions

The image above highlights the points that have been most beneficial to us in terms of learning and inference:

  • Multitask, when the model is basically trained on different synergistic tasks – those that do not share embeddings much between themselves. This way you can get a stronger model. Multi-GPU on HOROVOD, and the batch is larger. This way we get a stronger base vectorizer model.

  • Search on inference. There is a search query and a base that helps reduce the dimensionality. That is, when we have not 1000 float32 for each sentence in the base, but only 300, this greatly saves memory.

  • Adversarial attacks. When we first started, we just wanted to improve consistency against typos and ASR errors. But this thing turned out to be really cool. It's a very cheap action that allows you to improve convergence, and tighten up metrics, and make the model more robust.

  • Distillation.

  • For inference, we recommend:

– Use caches where possible, within reason.

– Use the library to quickly find nearby neighbors.

– Establish CI/CD processes.

– Think about a two-stage approach when you need to annotate or set labels for the current context. That is, make one basic vectorizer model and many small ones, each solving its own task. Moreover, they can be trained right at the multitask stage, and then tuned separately.

– Translate models into special formats. We translate all models into special formats for inference, fortunately, both TensorFlow and PyTorch have them. This allows us to avoid unnecessary calculations during inference and compress the models themselves.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *