LLM chatbot at the heart of the concierge service

Work on mistakes

Big language models like GPT have taken the IT world by storm, literally forcing everyone involved, from data scientists to traditional web developers, to reconsider their priorities. Remember when the web did without GPT? Nowadays, not only no startup can do without elements of artificial intelligence, but also quite traditional projects are inevitably involved in the use of new tools. This explosion of interest has not only sparked a wave of reskilling across the industry, but has also forced many teams to rethink traditional approaches to machine learning and product development in general. But the thing about any rapidly evolving application field is that it sometimes leaves too many question marks.

When my team and I set out to create the MVP of our concierge service for hotel booking – part of the big world of travel technologies, it seemed that this was an area in which there were no unresolved, and at the same time, significant problems left for a long time; the point is just to do it yourself the process is more flexible and convenient. But in practice, of course, everything turned out to be somewhat more complicated.

Using a standard set of tools (python for ML, custom backend code and web interfaces) in combination with new commercial APIs opened up not only new horizons for us, but also provided a lot of pitfalls that are quite traditional for this kind of startups. We decided to write this short article about how we dealt with them. We hope that our lessons will help you avoid our mistakes and speed up the development of your prototype.

Step zero: Only basic tools

For the prototype of our chatbot, we decided to use a minimal set of tools: bot-master on socket.io for the interface, Node.js for the server part, and gpt-4-turbo-preview from OpenAI became the heart of it all. At the same time, we considered other options, including open source solutions and competitive commercial APIs, although in the end we settled on the most obvious. It is important to note that our findings are not specific to a specific architecture or product, but rather a general approach to using modern generative models.

The main idea was not to delve into the complexities of technical implementation. We needed to create a system that could flexibly transfer data between the user and the AI, allowing data to be fed back. We deliberately left more complex features, such as advanced chat history saving, context analysis and variable dialogue mechanisms, outside the scope of the MVP. Let's make it simple before we start making it difficult. To our surprise, it was this approach that allowed us to achieve the best results, which we will show below.

Step one: Infinity is not the limit!

The very first primitive assembly, more similar to what in modern commercial chatGPT is somewhat tautologically called gtp's, and a little earlier was simply called “assistant,” immediately showed itself to be a star. Our virtual travel assistant not only mastered the art of conversation, he confidently hallucinated on various topics, generously promising all kinds of miracles, discounts and cashbacks. Of course, most of this was not true, but at the time it seemed like the least of our problems, because this thing worked!

Our whole focus on the basic LLM model was to add short instructions to the opening message: “Who are we? Where are we? To hell with the details! And, of course, we did not forget to feed the system with previous dialogue lines for the next iteration, for consistency of responses. This chatbot was almost like a living thing: it recommended hotels (sometimes outdated or completely fictitious), planned a budget (almost losing consciousness from the transfer of currencies), while politely ignoring any conversations not related to travel, and all this gave us confidence in that we are on the right path.

Step Two: Awareness Fallacy

When we, feeling like experienced ML masters, were faced with the fact that our neural network did not know how to do something necessary, the solution was obvious: teach it this. We set to work with enthusiasm, rushing to form training samples, write binding code, even though the fourth gpt from OpenAI is not yet available for custom training to a wide range of developers, which means we will make do with the already rather outdated gpt-.3.5-turbo, which will do for a demo. Moreover, the task did not seem so overwhelming – take the generative nonsense of the base model as a basis for fine tuning, adjust it slightly taking into account the reality in relation to our specific service, add variability, drive it into the machine, wait half an hour and please – customized the model actually began to make significantly fewer errors in fact.

However, the key problem with this approach was noticed immediately – our chatbot was rapidly becoming stupid even in relation to the base 3.5, but in comparison with the latest fourth version, it was inferior in everything – and primarily in the amount of machine nonsense it produced. There was much more of it, the generation began to literally stutter in places. It was possible to move further in an extensive way, clarifying the training dialogues, expanding the subject area, but then we realized it and, fortunately, did not do that.

In essence, we thereby tried to turn a broad-profile LLM into a narrow-profile LLM, in which the “thin client” between the model and the user rapidly grew fat,” quickly becoming juggled with different models, a ton of regular programs and taking development into the realm of, well, yes, traditional old-school ML with an emphasis on NLP, just at a different technological level represented by the newfangled transformer architecture. And this was not what we initially strived for. Again, as we will show later, this was also the right decision, otherwise we would have started going completely in the wrong direction.

Step Three: A Second Life for Training Dialogues

Fortunately, one of the artifacts of an unsuccessful previous attempt gave us a set of ready-made dialogues that we would like to see performed by our electronic concierge. Since its main function is to prompt and give advice, and not to replace basic interfaces, it was logical to immediately skip the entire development phase and get straight to the point.

Namely, since stupidly stuffing raw project help into the assistant’s “header” is not the most economical way (the consumption of tokens grows along with the size of the prompt for each request), the help themselves have to be either shortened or cut into thematic pieces. But this was precisely the essence of generating educational dialogues. Therefore, the main idea at this stage was the dosed insertion of these dialogues as answers supposedly already given earlier by the assistant, and please, the level of nonsense became minimal, while the prompts almost did not change in volume.

An elegant solution came in the form of integration with elasticsearch. We began feeding the chatbot the most relevant answers from the question-answer database as an augmentation, adding a little randomness and filtering keywords from our project dictionary. This made it possible to significantly reduce the level of machine nonsense in responses, while maintaining the same volume of prompts. In fact, we first looked for the most suitable “hat” for the assistant, and he already responded taking it into account. This approach not only improved the quality of answers, but also created a clear data flow, where any new instructions were mechanically turned into clear instructions for the bot.

Now it's time to teach him something that the base model did not know at all, or its knowledge was quite fragmentary. Teach her to choose the right hotels, since the latest gpt-4-turbo-preview described and recommended cities, countries and regions that are important for tourism right out of the box.

Step four: Prompt is not rubber

And here we were faced with the enormity of the problem – if the help files were quite limited in volume, and in fact, it was solely the saving of tokens that prevented us from simply putting all the key dialogues into the generation at once (don’t do that), then there were more than 3 million hotels in our database, and their descriptions in raw text amounted to tens of gigabytes. Our proven method with elastic and keywords worked here too, but only if the traveler knew exactly where he was going and why. We invested a lot of time in ensuring that our chatbot could pull out from the dialogue all the necessary information about the trip: goals, dates, budget, number of guests, and so on, in order to then suggest directions and specify cities and countries.

However, when it came to choosing a specific hotel, our bot again showed its imperfections, limiting itself to offering from a pre-prepared list a couple of dozen of the same hackneyed options, which, in fact, repeated the functionality of those very boring web interfaces. After all, the main task of any concierge is to offer not just any option, but exactly the one that is ideal for you, taking into account all your wishes and requirements. What's the point of a chatbot if it can't do the job better than a standard website search?

And then we were faced with token restrictions that did not allow our robot to be flexible and multifunctional enough, especially when it comes to such diversity as, for example, Paris with its huge selection of hotels. The bot gave similar, mechanical answers, unable to take into account complex queries like “I want to live away from the center, but with convenient access.”

Then we realized that we needed a key mechanism, part of which we had already developed while working on the basic functionality. This mechanism was supposed to make our chatbot a real electronic concierge, capable of satisfying any traveler’s needs.

Step Five: Rabbit from the Magician's Hat

One of the key features of working with natural queries is their uncertainty. It is not enough to know the context; it is necessary to be able to analyze the user’s intentions even where he does not formulate them explicitly. For example, this is a completely natural dialogue:

— I'm going on vacation for a week.
– Oh, cool, for May?
– Whatever it is, the bullshit will let me go. In mid-June.
— As usual, going to Antalya with your family?
– No, this time I’ll choose something less budget, Greece or Cyprus.

What, from a formal point of view, would even the smartest robot of the pre-LLM era understand from this conversation? There are no exact dates indicated, there is no exact location, what does “with family” mean, and is “less budget” cheaper or more expensive? But modern large language models click such calls as nuts. We quickly taught our robot to push the most formalized and most machine-readable intent into the context, on the basis of which high-level selections from the database were easily obtained; the only issue that could not be solved “head-on” was the mixing of more precise sets of sentences: there was not enough personalization.

And here we again applied the old work with the knowledge base for the site, namely, using the same gpt-4, we prepared short briefs for each hotel for the main classes of trips (business, beach, romantic, etc.) and price cohorts, drove them into “elastic”, and taught the robot to select random 20 of the most relevant 50 hotels for a specific context as an augmentation, after which they asked it to conduct an analysis and select the most suitable ones based on its very specific requirements, such as the availability of parking, non-tourist accommodation center, etc.

The result was very expressive – the answers were unique, hotels were selected exactly according to the user’s requests, and the overall variety of offers (the robot cheerfully offered slightly more expensive options, taking into account discounts and cashbacks), offers were checked on the fly for availability.

Step six: Train on cats

By that time, our chatbot had a new feature – it began to attract fresh reviews about hotels in real time directly from public platforms. This transformed our robot into a real critic: it learned to express doubts about the stars and ratings of hotels, inviting users to read live reviews. On the one hand, this gave us a unique tool for revising hotel ratings that seemed less than ideal compared to real opinions. On the other hand, this allowed us to think about more thoroughly retraining the model so that it could distinguish which hotels are really worth attention, even without involving user opinions.

This evolution of the bot revealed one drawback: our electronic assistant has become quite resource-hungry. Although we were satisfied with the economic side of the issue, we still wanted to make the system more economical. Solution? Fine-tuning the base model to the specifics of hotel search and recommendations to make it work more efficiently without the need to constantly involve additional data.

As if by magic, we already had all the necessary base to begin such additional training. This opened up the opportunity for us to create a version of the chatbot that would not only accurately respond to user queries, but also do so with less augmentation costs. Ultimately, our efforts led us to the beginning of our MVP, but at a higher level of the codebase and understanding of our goals, capabilities and challenges. Which in itself is very correct, since ideally any development is a looped, endless process in which each innovation leads to new ideas and improvements.

conclusions

So, during the development of our concierge, we discovered that the most promising method for working with gpt was a combination of “prompt mastering” and a “data augmentation” method. This approach allows us to take full advantage of all the advantages of pre-trained commercial LLM models, while avoiding an abundance of crutches and keeping the level of generative nonsense under control.

This, in fact, is the only way that allows us to carefully and with predictable results move on to deeper fine tuning of specialized models without losing their cognitive abilities. Our framework makes it possible to flexibly balance between models of different “gluttony” – for example, GPT-3 can be responsible for collecting intent and context, but the source data storage itself, even with significant volumes, does not pose any noticeable problem for such a framework.

Although we didn't come to this decision right away and made a few key decisions along the way, the level of reuse of most of our work and code base was surprisingly high for an R&D project where there is a high level of uncertainty. We hope our experience will be useful for colleagues looking for effective ways to implement LLM in their products.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *