AI Product Hack

How many times have you witnessed judging at a hackathon that, at first glance, seemed wrong? We think there have been many such cases.

Today we are Anna Tishchenko and Minko Bogdan participants AI Security Lab from master's degree Talent Hublet's look at the results AI Product Hack and we will try to figure out who was right after the awards of places: team members annoyed by the defeat or the judges.

In particular, we will consider the case of the Raft company – “Monitoring toxic content in AI products.”

The first fair question is, why toxic content? It's simple. It’s no secret to you, dear reader, that LLM is at the peak of popularity. And when you want to implement a smart assistant or RAG system in production, you will hardly be pleased to see the model’s hallucinating responses, which pose a potential danger. For example, let’s imagine the LLM pipeline integration team who sit in their office and are happy that they were able to save money after replacing a bunch of support operators with one chat bot. But suddenly, suddenly it turns out that for any whim, ill-wishers come, who can’t wait to send 100,500 attacks on the bot containing jailbreaks, prompt injections, etc. After this, no one is happy anymore, because his innovative solution sells goods at a minimum cost, drains users confidential information, behaves like a gigachad from Fourchan and issues dangerous instructions. All this leads to huge financial losses and lowers the company’s trust rating to the very bottom.

We think we are now ready to look at solutions to this problem in order to give an objective assessment to all hackathon teams. (final pitch)

There were 6 teams in total:

Since the total number of teams in the cases from the customer was relatively small, the places were distributed not among one, but across all cases. According to the results of the judges' decisions, the team LLaMasters took first place (for the case on Red Teaming, who by the way got article on Habr). The second place went to the analab team, and the third place went to the moon.

Description of solutions

ytsuken123

In this team, two models were additionally trained based on BERT, designed to identify toxic content. The first model allows binary classification of messages as toxic or not, and the second one performs multi-class classification.

These models were integrated into the user interface, providing interaction with the content that comes from the user's communication with the LLM. Connection of alerts is available.

The guys also implemented functionality to display message history in the form of graphs illustrating the level of toxicity.

Kazan Plova

Here we adapted the DeBERTa model, additionally training it on our own dataset to determine the necessary classes of toxic content. To simplify access to optimization, the model was converted to ONNX. The team used InfluxDB to store and analyze data.

The guys also implemented a Telegram alert system. Additional analytical data was visualized using Grafana.

Analab

The guys from this team distinguished themselves in their approach to the task. They created as many as 7 repositories in order to ultimately implement a decorator, with the help of which it is possible to connect to any service for LLM interaction with the user and monitor content according to the necessary parameters. This way, analyzers can perform any type of testing, either with or without a model. A service was added to each repository.

This particular toxic content analyzer was created using the classic Bert.

In total, they came up with a solution that allows them to use ready-made analyzers and/or add new ones.

We also tweaked the basic UI for user convenience and added the ability to connect alerts.

To the Moon

The guys developed a monitoring system integrated into Grafana. In addition, their solution includes the ability to run locally and interact with the user through a Telegram bot. To implement the ML component, we decided to use two Bert models to detect the user’s toxic prompt at the input and the generated LLM at the output. An analogue is a combination of the multilingual-e5-large embedder, which marks the input prompt according to a certain threshold, as well as Llamaguard 3 at the output. The guys also looked towards LLama Prompt-Guard-86M, LLama 3.1.

Monitox

Similar to the guys above, here we developed a service on Grafana, and also connected a Telegram bot with which the user interacts with it. From an ML perspective they presented two Bert models. One was further trained to classify the user’s prompt, and toxicbert was raised for LLM. The system uses the Mistral API, connecting to Mistral 7B, which also contains prompt instructions for protecting and classifying harmful content.

M.L. Hedgehogs

This team used a combination of TF-IDF and Log Reg to implement the basic model. They then further trained the Saiga Llama 3 8B model for a binary classification task. The Mistral 7B Instruct model was also additionally trained, and after that Saiga Llama 3 8B was used again, but to solve the multiclass classification problem.

Raised a solution through Langsmith. It is worth noting that this is the only team using Langsmith and LLM supplementary training using Lora, which clearly justified itself. It is also important to mention that there was a system prompt, which probably played a role in achieving the result.

Summary

Here we decided to identify the best solutions and also show that Bert is no longer SOTA for this task

Summary	Type of protection	Model	Metrics	Team
Best model	Input	Saiga Llama 3 8B	0.9962	M.L. Hedgehogs
Best model	Output	ruRoPeBert	0.8984	To the Moon

Input – checking the input prompt from the user
Output – check LLM response for toxicity

The metric used is f-beta, where beta=2

As a result, we have many similar solutions from the ML point of view and only a few teams, namely To the Moon and ML Hedgehogs showed a truly unique result.

Validation was carried out on an internal dataset, which none of the presented models had seen. Most of the data consists of unwanted content from both the user and the model. Since the recall metric cannot be the only solution in this situation, we decided to use the f-beta metric with the parameter beta = 2. This allows us to place more emphasis on recall compared to precision, without reducing the importance of the latter. Speaking of precision, all models for the input prompt showed more than 90% accuracy, which is quite good and suggests that there will be fewer errors in the first one, but it is worth remembering that most of the data is toxic typing that breaks LLM, or tries to do so .

Solution architecture

The architecture of all solutions has similar concepts. Let's look at them one by one for each team.

ytsuken123

The client can be either a human or a bot that accesses Nginx, acting as an API Gateway.

The central component is LLM Guard, which is responsible for filtering requests to the LLM model. An important role in this process is played by the LLM Request Filter API, which validates requests to FastAPI. Requests are sent to a queue via Message Queue, based on RabbitMQ, after which they are analyzed by various services, such as Toxic Service, Topic Classifier Service, etc.

The LLM Adapter API sends requests to the LLM. For notifications, services are provided that inform users about the status of the system via Telegram and email.

The system also includes a UI developed using Streamlit and FastAPI. User data is stored in PostgreSQL. System monitoring is carried out through Prometheus.

Kazan Plova

The monitoring system based on the pre-trained DeBERTa model analyzes incoming messages and sends the results to the InfluxDB database. The obtained results are further visualized through Grafana. Notifications about the detection of toxic content come through alerts in Telegram.

Analab

LightHouse Server is the control component of the system. The ability to integrate with ClickHouse has been implemented. Users interact through the Lighthouse UI to configure monitoring and data visualization, and Lighthouse Monitoring collects information and sends notifications to users through a Telegram bot.

To the moon

The guys identified two options for architecture in the format of a user story and the architecture itself. This approach allows you to see the top-level concept on the one hand, and see the details on the other.

Here the story is clear and simple, they decided to split the request into two, one goes to LLM, and the other goes to their service. This approach allows you to save waiting time, but the question arises as to whether it is worth sending a request to LLM and spending resources (computing power or tokens) to receive a response.

Here we see a more detailed transfer of jsons from the user to the developer (the person who conducts the monitoring). Data from the user, having passed the ML block, is collected in the database, and then goes to Grafana, in parallel, alerts will be sent to the Telegram bot.

Monitox

The guys presented a complete json flow diagram.

The formula is the same: the user interacts with the bot in telegram, from there the data first goes to the Rest service with ML, as well as to the Mistral API. This is then collected via Postgres and output to Grafana.

M.L. Hedgehogs

Hedgehogs displayed more ML components than the overall architecture. However, as mentioned above, everything is integrated into Langsmith and also has the ability to run locally.

Quality of code and repository design

During the time we were studying these solutions, many questions had accumulated, most of which were resolved. Nevertheless, it is worth noting that some teams did not write the readme very well, some did not have one at all, which made it difficult to analyze the results. Despite the detailed repository of the To The Moon team, we would advise next time to add to it the installation features of some models, especially Llamaguard3, which required exclusively local launch via ollama. This is not critical, but to check it I had to dig into the repository and try to understand why the HF token on Llama models does not work.

Team	Place at the hackathon	Versatility	Uniqueness	ML Metrics
analab	2	+	7 repositories that divide the solution into separate modules	–
To the Moon	3	+	E5 (embedder), Llama Guard, model on Output	best on output
Monitox	–	+-	model on Output	–
ML-Hedgehogs	–	+-	LLM+Lora (adaptability + reduced train costs)	the best among all on input
ytsuken123	–	+	Best architecture (scalability + reliability + performance)	–
KazanPlova	–	+	Long-term storage of time series in InfluxDB	–

Comparison of hackathon scores and technical implementation

Each team brought something different, and this “own” makes the choice of the winner not so obvious. Let's take ML-Hedgehogs. At first glance, the Hedgehog team, armed with three models trained using Lora, looked like the undisputed leader. But is it worth rushing to conclusions? Let's figure it out.

The ML-Hedgehogs team trained their models on English-language datasets, but despite this, they pleasantly surprised us with the results from the Russian-language test. The model did a really good job with interference on attacks. However, for all good things you have to pay: the Hedgehogs’ solution turned out to be the most resource-intensive. Was it worth it? Maybe.

Meanwhile, the analab and To the Moon teams didn’t turn out so interesting in terms of inference. On the other hand, they have rolled out fully developed solutions, and their services are already ready for use. Turn the models more powerful – and it will turn out impressive.

As for the rest of the teams, there is room for improvement: some were not strong enough in the basic solution, others spent too little time training their own models.

So who should have won?
By moving the leaderboard from the hackathon, we decided to distribute the prizes as follows inside the case:

To the Moon
Analab
M.L. Hedgehogs