Hello, Habr! We invite Data Engineers and Machine Learning specialists to a free Demo lesson “Launching ML models into the industrial environment using the example of online recommendations”… We also publish an article by Luca Monno – Head of Financial Analytics at CDP SpA.
One of the most useful and simple machine learning methods is Ensemble Learning. Ensemble Learning is the method behind XGBoost, Bagging, Random Forest and many other algorithms.
There are many great articles on Towards Data Science, but I chose two stories (the first and second) that I liked the most. So why write another EL article? Because I want to show you how it works with a simple example, which made it clear to me that there is no magic here.
When I first saw EL in action (working with a few very simple regression models) I couldn’t believe my eyes, and still remember the professor who taught me this method.
I had two different models (two weak training algorithms) with metrics out-of-sample R² equal to 0.90 and 0.93, respectively. Before looking at the result, I thought I would get an R² somewhere between the initial two values. In other words, I thought EL could be used to make the model not perform as badly as the worst one, but not as well as the best model might.
To my great surprise, a simple averaging of the predictions gave an R² of 0.95.
At first I started looking for an error, but then I thought that there might be some magic lurking here!
What is Ensemble Learning
Using EL, you can combine the predictions of two or more models to produce a more reliable and efficient model. There are many methodologies for working with model ensembles. Here I will touch on two of the most useful ones to give an overview.
Through regressions the performance of the available models can be averaged.
Through classification you can let models choose labels. The label that was chosen most often is the one that will be chosen by the new model.
Why EL Works Better
The main reason EL performs better is because every prediction has an error (we know this from probability theory), combining the two predictions can help reduce the error, and therefore improve performance (RMSE, R², etc.) etc.).
The following diagram shows how two weak algorithms work on a dataset. The first algorithm has a larger slope than necessary, while the second has it almost equal to zero (possibly due to over-regularization). But ensemble shows the result much better.
If you look at the R² indicator, then the first and second training algorithms will have it equal to -0.01¹, 0.22, respectively, while the ensemble will have it equal to 0.73.
There are many reasons why an algorithm might turn out to be a bad model even for a basic example like this: maybe you decided to use regularization to avoid overfitting, or you decided not to rule out some anomalies, or maybe you used polynomial regression and chose the wrong degree (for example , used a polynomial of the second degree, and the test data shows a clear asymmetry, for which the third degree would be better suited).
When EL works better
Let’s look at two training algorithms working with the same data.
Here you can see that combining the two models did not significantly improve performance. Initially, for the two training algorithms, the R² values were -0.37 and 0.22, respectively, and for the ensemble it turned out to be -0.04. That is, the EL model received the average value of the indicators.
However, there is a big difference between these two examples: in the first example, the model errors were negatively correlated, and in the second – positively (the coefficients of the three models were not estimated, but were simply chosen by the author as an example.)
Hence, Ensemble Learning can be used to improve the bias / variance balance in any case, but when model errors are not positively correlated, using EL can lead to better performance…
Homogeneous and dissimilar models
Very often EL is used on homogeneous models (as in this example or in a random forest), but in reality you can combine different models (linear regression + neural network + XGBoost) with different sets of explanatory variables. This will most likely lead to uncorrelated errors and increased performance.
Comparison with portfolio diversification
EL works similarly to diversification in portfolio theory, but so much the better for us.
When diversifying, you try to reduce the variance of your performance by investing in uncorrelated stocks. A well-diversified portfolio of stocks will perform better than the worst single stock, but never better than the best.
To quote Warren Buffett:
“Diversification is a defense against ignorance, for someone who does not know what he is doing, it [диверсификация] makes very little sense. “
In machine learning, EL helps reduce the variance of your model, but it can result in a model with overall performance better than the best original model.
Let’s sum up
Combining multiple models into one is a relatively simple technique that can solve the variance bias problem and improve performance.
If you have two or more models that work well, don’t choose between them: use them all (but with care)!
Is it interesting to develop in this direction? Sign up for a free Demo lesson “Launching ML models into the industrial environment using the example of online recommendations” and participate in online meeting with Andrey Kuznetsov – Machine Learning Engineer at Mail.ru Group.