Meet CatBoost 1.0.0

Hello everyone. My name is Stanislav Kirillov, I work in the team responsible for the development of the CatBoost machine learning library. We first shared it with the community four years ago – although we are used to building binary trees, so we prefer to count the years the same way. This is a joke, of course, but the “century” is a good reason for the release of the first “production ready” version of the library with the symbolic number 1.0.0.

Today I will briefly answer why we consider the release of version 1.0.0 an important milestone, and highlight the main changes (both in the new version and in the whole year). And already tomorrow I will speak with a story at the meeting, which will be entirely devoted to the practice of using CatBoost and the opposition of neural networks and gradient boosting. If these words mean anything to you, then welcome to cat.


A little bit of background.

CatBoost is a gradient boosting machine learning method designed by us to work equally well out of the box with both numerical and categorical traits (hence the name – not a single cat was harmed during development). We originally created it as a library for use in Yandex services, hence the requirements typical for a large company. For example, our services always have high loads, so the model inference speed is critical for CatBoost.

After the technology was laid out in open source, these and other features were in demand not only with us. Work began with the community and industry peers. So, initially, the intra-Yandex solution, created by us without regard to others, commit by commit, version by version (and there were 76 of them, for a minute), began to acquire useful improvements, fixes, and tools.

We started small. Several years ago, even in Yandex, the scope of the library was small. CatBoost is now used in most of our services, from ranking and recommending music to predicting rainfall and self-driving cars. In the outside world, we not only accumulated 6 thousand stars on github (which is also a lot), but also meet CatBoost in projects of third-party companies such as Avito, Lamoda, Sber, Cloudflare.


Just a snapshot of the WWDC 2018 main stage

Version 1.0.0 that we present today is more than just a release. For us as a project team, as well as for the entire community, this is a sign that the library is no longer just an internal solution published in open source. Now it is a fairly stable and functional library, ready for production not only in Yandex, but also in other companies. And in support of my words, I want to talk about several changes in the project.

Faster, Faster

All 4 years that CatBoost has been in the public domain, we never stop accelerating the learning process. The graph below is just an example of such progress in 2.5 years using the higgs dataset when training 100 trees in 28 threads.

In the 1.0.0 release, we accelerate the learning time in the binary classification mode on the CPU from 15% to 35%. In addition, from March 2020 to the current release, we have accelerated CatBoost 1.7 times. These were both algorithmic and technical accelerations. For example, with the help of the Intel development team, we improved thread scaling and thereby accelerated CatBoost training on machines with 128 cores by almost 2 times.

Distributed learning on terabytes of data

When we talk about learning from big data on clusters, the first technology that comes to mind is Apache Spark. In fact, the distributed learning feature has been available in CatBoost since 2018, but it was not very familiar and convenient to use it. Therefore, the ability to run distributed learning using familiar interfaces such as Apache Spark was one of the most requested features.

The first version of such support appeared in early 2021, but this version could not be called actually distributed. Although the training data was indeed evenly distributed among the computational nodes, the validation data was stored on the coordinator, which added asymmetries in resource consumption, especially if the validation data was quite large (then the coordinator required a lot of memory to store this data). In release 1.0.0, we made distributed storage and calculation of metrics based on a validation set for computational nodes.

In addition, we fixed a few more critical issues – we consider Apache Spark support to be ready for use in production processes.

CPU vs GPU

In CatBoost, there are 2 ways to learn models: on the CPU and on the GPU, however, these processors have almost independent boosting implementations, and the feature sets are slightly different. Over the past year, in terms of the set of functions and the quality of training, we have finally brought them closer to each other.

In particular, we supported:

– adjustment of the size of the formula, which penalizes too expensive calculated characteristics,
– the use of GPU models in Python,
– a new way to sample data – Minimal Variance Sampling (invented by our researchers, you can read here),
– and one more idea of ​​our researchers: learning with Langevan dynamics (more here), which allows you to learn models that can predict the uncertainty of their predictions.

Open documentation

Previously, the documentation was written in

DITA

with a very awkward XML format. There were many minor problems with the dock: inconvenient visualization of parameters, inability to customize other aesthetic things. Significant disadvantages also got out:

– It became difficult to update the documentation and describe the features (even for ourselves: the dock simply did not cover some of the possibilities, and the number of such undescribed features grew like a snowball).
– Users were unable to build locally.
– External contributors could not make any changes.

Now we have switched to the convenient and familiar Markdown throughout the open source community. Documentation sources appeared in our repositories, and external contributors can now change something too. The dock has become beautiful and convenient: we abandoned the tables in the layout, made more transparent navigation.

But this is not the end yet. We have a lot of work ahead of us to update and introduce undescribed features.

Multilabel-classification

The multilabel classification mode is now available. Imagine that your objects have a possible set of labels, each object can be given several such labels, and you need to predict from the object what labels are given to it. In CatBoost 1.0.0, we introduced such a feature – for example, it can be useful in behavioral analysis to predict the possible characteristics of an object.

R – halfway to CRAN

In the 1.0.0 release, we significantly improved our library for R: added support for text features and embeddings, supported the calculation of the uncertainty of model predictions using virtual ensembles.

Our goal now is for CatBoost to appear in the package library. CRAN… On the way to this, we have already fixed most of the problems, but, unfortunately, we did not manage to do everything. Here we are looking forward to the help of our contributors, who are not indifferent to the fate of R.


On the one hand, version 1.0.0 is an important milestone for the entire project. On the other hand, I want to believe that this is only the beginning of a long and interesting journey.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *