Coolest Data Science Library I Found in 2021

Never waste time tweaking hyperparameters again

I became a data scientist because I enjoy finding solutions to complex problems. The creative part of the job and the information I get from the data is what I like the most. Boring things like data cleansing, preprocessing and hyperparameter tuning give me little pleasure, and so I try to automate these tasks as much as possible.

If you also like to automate boring things, you will love the library I’m going to review in this article.

Currently, the field of machine learning is dominated by deep learning in the case of perceptual problems and efficiency gains for regression problems.
No one uses linear regression from Scikit-Learn these days to predict house prices in Kaggle contests because XGboost is more accurate.

However, XGboost hyperparameters are difficult to tune. There are many of them, and machine learning engineers spend a lot of time setting them up.

Introducing Xgboost-AutoTune

I am glad to share with you Xgboost AutoTune librarydeveloped by MIT’s Sylvia Olivia. This library has become my preferred option for automatically configuring XGboost.

Let’s look at it with an example climate dataset… We will predict temperature increases as a function of greenhouse gas concentrations and evaluate the impact of each gas.

First of all, we will import a dataset and plot the CO gas concentrations2, CH4, N20 and synthetic gases:

After running this code, we can see how the amount of greenhouse gases has increased over the past 140 years:

Cool, now we can import our library, which I mentioned earlier, but just in case you haven’t downloaded the repository, I’ll show the code here too:

Basically all you need to know is that the main method of this library is the “fit_parameters” parameter, you just need to call it, and it will do all the hard work of finding the best values ​​for your hyperparameters. Like this:

Note that we have specified the scoring metric for the models (in this case, the RMSLE logarithmic error) and the original model is XGBRegressor because this is a regression problem (another option is a classification problem).
Cool, we just built the best possible XGboost model with two lines of code, now let’s calculate the predictions:

The code will display a graph with the predicted temperature based on real values ​​from the test dataset:

Looks good.

Now, if we want to know which gases are most influential in the warming effect, we can run the following code:

We will get a graph:

As expected, CO2 Is the gas with the strongest effect, this is not surprising, but we also see that CH4 also has a fairly strong effect, and most importantly, this model was very quick to learn.


Gradient boosting is the most commonly used algorithm for regression and classification problems that does not require deep learning due to its high accuracy, interpretability, and speed.

Unfortunately, while the Python ecosystem provides the XGboost library, it is not as extensive as other libraries such as Scikit-Learn, and setting parameters has to be done manually by data scientists, which is a lot of pain.
This is why I consider this library to be a gem to share.

My final thought: Hiring data scientists is expensive and their time is best spent on non-trivial jobs. Can you imagine a sales director making cold calls? Of course not, this is not his job.

Well, unfortunately, many data science specialists are jack of all trades, their work very often includes searching for data, cleaning it, receiving data, choosing a model to use, coding a model, writing a script to set up a model, deploying a model, presenting the model to the business and god knows what else.

So, the more automation tools a data scientist has, the more they can focus on their most important job: making sense of data and extracting value from it. I hope you enjoyed this article and will help you train your models faster. Happy coding and don’t forget the promo code HABR – it gives an additional 10% to the discount on the banner.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *