The experience of participating in the Kaggle competition Ubiquant Market Prediction or how a bad organization can kill any competition

Between January 18 and July 18, a competition was held on the Kaggle website Ubiquant Market Prediction from the Chinese company Ubiquant Investment. I participated in this competition and my experience of participating was rather negative, primarily due to the poor organization of the competition, but more on that later. First, let’s tell you what kind of competition it was.

About the competition

This company specializes in investments and trading in financial markets, respectively, and the competition itself was financial, i.e. based on the available data for each asset (share, cryptocurrency token, fiat currency, etc.), it was necessary to predict the values ​​of one of its indicators (price, income, profit, etc.) over a certain period in the future.

Initially, it was not clear what specific financial instruments we work with. At the entrance there was dataset 18.5 GB, consisting of asset id , timeframe id (again, its value was unknown – half an hour/hour/day/week, plus, on test and real data, its value could be different), row id (stupid timeframe id + asset id), 300 anonymized features (normalization was carried out, periodic changes were eliminated or smoothed, noise was added, etc.) and a target, also anonymized.

Accordingly, our task was to predict its target in the future based on the available signs for each time period and financial instrument. Used as a metric Pearson correlation. The peculiarity of this competition was that the final result was evaluated on the current real market data collected after the close of the competition in the period from April 18 to July 18, 2022. That is, such a problem, traditional for competitions, as data leakage, could not arise in this case. The second consequence of this was that for a successful submission it is not enough to download the file with model predictions, you need to download all the code + models necessary for the prediction, which from my point of view is good, since it imposes restrictions on the code running time (9 hours) and the size occupied in RAM (13-16 GB depending on the configuration of the hardware on which Kaggle executes the code), which allows you to brush aside crazy decisions like five-level ensembles with 1000 models or hellish neural networks with billions of parameters.

So what went wrong then?

crooked dataset

The data format (in particular, row ID = timeframe id + asset id) on the training data was always the same, but was not always the same on the test data. Accordingly, the submissions of the participants who used this column in their predictions worked out normally on the training data, but gave an error on the test ones. Since, due to the competitive specifics of Kaggle, it was not possible to determine the source of the error, we had to work at random: comment on certain sections of the code in turn or add try / except to them.

Also, for some reason, submits that used the values ​​of features from previous timeframes gave an error. We still didn’t understand what the problem was, in our code or in the API of the competition, again, everything always went fine on the train, it fell out with an error on the test, but it seems like other participants successfully used these signs.

Curve API

Literally a week after the start of the competition, the first place is occupied by a participant with fantastic by the standards of financial markets, the result is 0.663. So what’s the deal? No, not a data leak.

It’s just that the API through which the system receives model predictions allows you to change already made predictions backdating. Where does this lead? Well, let’s say we have a feature that strongly correlates with target values ​​from previous timeframes. We make predictions for the current timeframe, then move on to the next one, take the values ​​of this feature from there and simply replace the predictions from the previous timeframe with them. Of course, by changing the model predictions already made in this way, we can get unrealistic results, which was used by some participants in the competition.

As a result, the organizers had to delete such miracle submissions and change the rules, prohibiting any solutions that use this bug.

Rules are meant to be broken

In paragraph 7.C rules competition was written in plain text that public external data (ie various micro and macro financial indicators + other publicly available information) can be used.

Quite quickly, analyzing the dependence of the target on time, the participants realized that the financial instruments they work with are the so-called Chinese A-shares, the timeframe is one day, and the data was collected from 2014 to 2018. It was possible to understand this because at some time interval the target of most assets behaved unstable, and the duration of this instability coincided with the duration turbulence in the Chinese stock market from 2015 to 2016.

As can be seen from the chart, in the period corresponding to timeframes with id from about 250 to 600, there are sharp spikes in the charts of the number of unique assets, as well as the average and standard deviation of the target.
As can be seen from the chart, in the period corresponding to timeframes with id from about 250 to 600, there are sharp spikes in the charts of the number of unique assets, as well as the average and standard deviation of the target.

Then, knowing the market and the time frame, and assuming that the target is the disguised return of stocks for the previous day, the participants, by measuring the correlation between the target and real data, were able to match each financial instrument with a specific stock from the Chinese market.

Not that it would help much on real data, because. when checking the solution, new shares were also used, but such data could be used to add new features, such as the industry to which the company issuing these shares belongs + financial indicators by industry or for the entire Chinese market for the past year.

There are several posts on the forum with discussions on this topic, but here is a surprise! – the authors of the competition write that these data cannot be used. One of the frustrated participants, who spent more than a week extracting data and training the model on it, even creates a theme, in which he lays out all his achievements, because now they are useless. True, then the organizers did give backand they write that it is still possible to use external data, but only carefully, because you can retrain and all that.

Another member later posts a similar one dataset in the public domain, but then the organizers write to him and ask to be removed. He, frightened by legal problems with a large investment fund, first deletes everything, but then restores it back. The rules don’t allow it!

More data needed!

A few weeks before the end (!!!) of the competition, the organizers dumped out an auxiliary dataset with data for 2019 in the public domain. This data is fake so far, and is only needed to check the correctness of the work of their models on supposedly test data, to identify errors, etc., but after the end of the competition they will be replaced with real data. That is, on these data, after the completion of the competition and the start of test runs, it will be possible to further train your model (the so-called online training) in order to get a better result.

I note that the organizers made this decision actually at the final stage of the competition, when the pipelines of most of the participants were already ready, data was pre-processed, new useful features were added, useless ones were removed, and models were selected. It remains only to pick up hyperparameters, combine them into an ensemble and think about post-processing.

And here the organizer completely changes the rules of the game, because most of the participants simply will not be able to use this data in their current decisions – there is not enough RAM. Someone has to hastily redo their decisions and catch bugssomeone simply ignores the new data, but everyone eventually caught the headache.

Curve API vol. 2

Along with the addition of new data, the API was finally broken. Personally, I spent half a day trying to figure out where I am a fool, and why the solutions that worked yesterday suddenly began to give an error today, until I went to the forum and realized that I was like this not alone. As a result, several submissions were lost and the whole day, and, again, this is the final straight of the competition, each submission counts! The problem was fixed the next day.

Curve API vol. 3

Guess what happened to the submissions of some participants on the last day? Correctly! They started again crash with errors! After all, today is the last day of the competition, it’s time to dig into the API! As a result, the organizers had to postpone the final of the competition for 48 hours.

Deadlines are needed to avoid them

After the completion of the competitive stage, a verification stage follows, which lasts 3 months, during which the decisions of the participants are checked against real data from the financial markets. Judging by the statements of the authors of the competition, the leaderboard was supposed to be updated every two weeks.

The first took place a month later. Before that it was postponed three times. However, not all participants who waited for the update were delighted. For many, submissions are simple issued mistake, someone had them stupidly deleted without explanation, someone had the wrong submissions that were chosen before the final of the competition. In general, a complete mess.

good tradition was continuedsimilarly with final update. As a result, some of the erroneous submissions fixed, some of the deleted ones were restored. But only part.

Winners

Who won this competition?

1 place – 5 LGBM + 5 Tabnet, added 100 new features, used part of the training data + additional data.

2nd place – 5 LGBMs, added 100 new features, used all available training data + additional data.

3rd place – 5 transformers with 6 layers, used 300 original features, used all available training data + additional data; I note that this solution gave an error after the first update of the leaderboard, fortunately the Kaggle team was able to fix it.

5th place – one fully connected network for 4 layers, used 300 original features, used part of the training data + additional data.

As you can see, at least 4 teams from the top 5 used additional data. Which, let me remind you, were added just a few weeks before the end of the competition and the use of which regularly led to inexplicable mistakes.

Results

Our team made it into the top 20%, which is not too bad given the nature of this competition and the fact that we didn’t use additional data. Despite numerous mistakes and the strange attitude of the organizers to their own rules, I learned a lot and gained invaluable experience.

PS

The prize fund of the competition was $100,000. It is sad that people spent such serious money on the competition itself, but for some reason they were not interested in its normal organization and, as a result, they grabbed tons of negativity and many disappointed participants.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *