LLM analysis of winning solutions

About a year ago, a Kaggle enthusiast named Darek Kleczek conducted an interesting experiment: he collected all the available writeups of winning solutions on Kaggle in recent years, ran them through LLM and combined them general statistics about which mechanics and algorithms turn out to be “the most successful.” The guy’s report turned out to be quite voluminous, interesting, and sometimes unpredictable. This article is a free author's retelling of his essay. Let the prologue to it be a quote from Darek:

“I love winning. When I join a competition, I study winning solutions from past similar competitions. They take a lot of time to read and understand, but they are an incredible source of ideas and knowledge. But what if we could learn from all competitions? This seems impossible, because there are so many of them! Well, it turns out that with LLM we can do it…”

Mechanics

First, let's look at how all this was implemented. Darek's analysis is based on OpenAI's GPT-4 API. He used this model in two stages: first to extract structured data from solution descriptions, and then to cleanse the lists of methods extracted from solutions from noise. The input data was this dataset.

In the first step, each description is fed into the LLM so that the model extracts from it a list of mentioned machine learning methods. It would seem that this should be enough. However, it turns out that the output is a rather noisy list, so we move on to the next step. Here, all extracted methods are combined and again sent to LLM, whose task is to standardize this list, that is, to associate each method with a purer category or, if you like, a cluster. For example, “blur” and “rotation” are combined into a single category “augmentation”.

The resulting table. The methods column contains the noisy original method, and the cleaning_methods column contains the standardized category.

The author mentions that he also subsequently carried out additional manual data cleaning: during this, he corrected some LLM hallucinations and removed noise that the model could not cope with. Darek posted all the code Here. By the way, he also created the resulting dataset shared. So you can experiment with it yourself 🙂

The most popular methods and techniques

Well, now let’s move on to the most interesting part – let’s see what tricks and methods are most often used by the winners of Kaggle competitions. By the way, if any of the methods presented below in the graph are not familiar to you, subscribe to our tg channel Data Secrets: there we post daily analyzes of the latest articles, useful materials and news from the world of ML.

So, the histogram above illustrates that Ensemble Methods occupy a confident first place in our ranking. Perhaps not surprising. Well, really, who among us hasn’t tried to glaze or blend 100,500 models?

Further – more interesting. In second place in the ranking of techniques used was… augmentation. But everyone’s favorite boosting went down to fourth position, being overtaken by convolutional neural networks.

Is Augmentation Important?

Or maybe we shouldn’t be so surprised that augmentation took an honorable second place? In real life, the most important aspect of building a good model is the data. On Kaggle, the data is already given to us, and the main way to diversify it and make the dataset larger is augmentation. If you look at this category “under a microscope”, you get the following picture:

Here, in addition to the general category, we see the popular TTA approach (test-time augmentation), a whole bunch of methods for transforming images, and finally, adding data from external sources. Yes, yes, you shouldn’t be ashamed of this either. Especially now that LLM is at hand.

Speaking of Deep Leaning

What about CNN? What did they forget about the packages in third place? Here you need to remember that computer vision competitions were incredibly popular on Kaggle in 2018-2023. Therefore, it is possible that they simply dominated the original data set from which the analysis was collected. However, the author was also interested in such statistics, and he decided to compare the dynamics of the popularity of three main neural network architectures: convolution, recurrent networks and transformers, which are now considered by default the best architecture for almost any task. Here's what happened:

The graph shows how deep learning enters the Kaggle scene in 2016. We first became addicted to CNNs, and RNNs soon followed. Transformers was invented in 2017, started appearing on Kaggle in 2019, and reached its peak by 2022. As the popularity of Transformers grew, the love for RNNs began to decline noticeably, but CNNs initially continued to thrive. Only in 2022, when the transformer gained real fame, their popularity began to decline.

Gradient boosting

Here is another graph where gradient boosting joins our Deep Learning trinity:

What can I say, a real dramatic fall. The dominance of gradient boosting before the deep learning era is probably not surprising, especially given the popularity of tableau competitions at the time. Of course, then, as Kaggle added more and more CV and NLP competitions, the method ceased to be so common.

But pay attention to 2022 – at this time, the popularity of boosting begins to grow again (the diagram, by the way, is on a logarithmic scale). The point is that many participants have adapted to using GBDT in combination with transformers or other DL architectures. For example, in NLP competitions. If you want to see how such collaborations are implemented in practice, you can look into the above dataset and use this script:

combined_methods = df.groupby(['year', 'competition_name', 'place', 'link']).agg({'cleaned_methods' : list}).reset_index()
combined_methods['both'] = combined_methods.cleaned_methods.apply(lambda x: 'Gradient Boosting' in x and ('Transformers' in x or 'CNNs' in x or 'RNNs' in x))
sample = combined_methods[combined_methods.both == True].sample(n=5, random_state=1)
for i, (_, row) in enumerate(sample.iterrows()):
    print(f'[{14 + i}] {row.place}st place in {row.competition_name} ({row.year}): {row.link}')

Loss and optimizers

What else is important when we talk about training a quality model for competition? Of course, the ability to tune it well. Optimizing hyperparameters is one of the most important parts of winning. Darek presented two interesting graphs on this topic in his analysis: the most “winning” losses and optimizers.

It turns out to be quite an interesting picture: default losses and optimizers turn out to be the most reliable! It’s probably not for nothing that they are considered default 🙂 So, the Adam family has completely seized power in the world of optimizers. It’s about the same with losses: with the exception of focal loss (you can read about it Here) Most solutions use standard CE/BCE/MSE.

Labeling and post-production

It was not for nothing that the author analyzed these sections separately: there is an opinion that all the magic of victory lies in them. In any case, here are the top methods that you should keep in your arsenal:

It is not surprising that the most favorite feature of the winners is pseudo-labeling and label smoothing. They really work well. Well, as for post-processing, there is definitely magic in this top: trash hold, averaging, weighted averaging – all this is definitely worth keeping in mind.

Conclusions

So. Here's what all of the above teaches us AKA your checklist for the next Kaggle competition:

Data is the key: process it correctly and supplement it wisely
Don't forget about GBDT, CNN and RNN! You'll always have time to remember about transformers 🙂
Don't ignore default configs (optimizers, loss functions)
Try pseudo-labeling
Put a lot of time and effort into post-processing
And, of course, use good old ensembles!

More news, memes and Easter eggs from the world of ML in our tg channel. Subscribeso you don't miss anything!

LLM analysis of winning solutions

Mechanics