How I regularly improve the accuracy of my training models from 80% to 90 +%

In anticipation of the start of the basic course in Machine learning, have prepared an interesting translation for you, and we also offer you to watch for free recording a demo lesson on the topic: “How to start making the world a better place with NLP”


Introduction

If you’ve completed at least a few of your own Data Science projects, you probably already figured out that 80% accuracy isn’t too bad. But for the real world, 80% is no longer suitable. In fact, most of the companies I’ve worked for expect a minimum accuracy (or whatever metric they look at) of at least 90%.

Therefore, I will talk about five things you can do to significantly improve accuracy. I highly recommend that you read all five points as there are many details that beginners may not know.

By the end of this article, you should have realized that there are many more variables that play a role in how well your machine learning model performs than you imagine.

With that said, here are 5 things you can do to improve your machine learning models!

1. Handling missing values

One of the biggest mistakes I see is that people don’t handle missing values, and they might not even be to blame. Many materials from the Internet say that it is necessary to process missing values ​​by imputing the mean / median data, replacing empty values ​​with the mean of this characteristic, and this is usually not the best solution.

For example, suppose we have a table that has age and fitness values, and imagine that an eighty-year-old is missing a fitness measure. If we take the average fitness score in the 15 to 80 age range, then a person in 80 years old will get a higher value than it really is.

So the first question you should ask yourself is “why is there no data?”

Next, we’ll look at other methods for handling missing values ​​besides imputing mean / median value:

  • Trait Prediction Modeling: Going back to my example with age and fitness metrics, we can simulate the relationship between age and performance, and then use the model to find the expected value. This can be done in several ways, including regression, ANOVA, and others.

  • Data imputation using K-nearest neighbors: using the K-nearest neighbors method, the missing data will be filled with values ​​from another similar sample, and for those that are not signed, the similarity in the K-nearest neighbors method is determined using the distance function (i.e. e., Euclidean distance).

  • Deleting a Line: Finally, you can delete a line. This is usually not acceptable, but you can do it if you have a huge amount of data.

2. Engineering features

The second way to significantly improve the machine learning model is through feature engineering. Feature engineering is the process of converting raw data into features that better represent the nature of the problem a person is trying to solve. There is no specific way to do this, which is why Data Science is both science and art at the same time. However, here are some things you can focus on:

  • Variable type conversion DateTime and extracting from it only the day of the week, month, year, etc.

  • Creation of groups or baskets for variables (for example, for a variable with a height, you can make groups 100-149cm, 150-199cm, 200-249cm, etc.)

  • Combining multiple objects and / or values ​​to create a new object. For example, one of the most accurate models for the Titanic mission created a new variable called “Iswomenor_child”, Which was True if the person was female or child, and False otherwise.

3. Feature selection

The third way to improve the accuracy of your model is feature selection, that is, selecting the most relevant / valuable features in your dataset. Too many features can cause your algorithm to overfit, and too few to underfit.

There are two main methods you can use for feature selection:

  • Feature Importance: Some algorithms, such as random forest or XGBoost, allow you to determine which features are most “important” in predicting the value of a target variable. By creating one of these models and analyzing the importance of features, you will get an idea of ​​which variables were the most important.

  • Dimensional reduction: One of the most common dimensionality reduction techniques is the principal component analysis (PCA). It accepts a large number of features and uses linear algebra to reduce their number.

4. Ensemble Learning Algorithms

One of the easiest ways to improve your machine learning model is to choose the best algorithm. If you are not already familiar with ensemble methods, now is the time to get acquainted with them.

Ensemble Learning is a technique in which multiple machine learning algorithms are used together. The point here is that you can achieve better predictive performance in this way than with any one algorithm.

The most popular ensemble learning algorithms are random forest, XGBoost, gradient boosting, and AdaBoost. To explain why ensemble learning algorithms are so good, I will give an example with a random forest:

A random forest involves creating multiple decision trees using sets of raw data. The model then chooses the mode (majority) of all predictions for each decision tree. What’s the point here? By relying on the principle that “whoever has the majority wins,” it reduces the risk of individual tree error.

For example, if we create one decision tree, the third, it will give us 0. But if we rely on all 4 trees, then the predicted value will be 1. That’s the power of ensemble learning!

5. Tuning hyperparameters

Finally, something that is not often talked about, but what is extremely important to do is tune the hyperparameters of your model. This is where it is important that you clearly understand the machine learning model you are working with, otherwise it will be difficult to understand what each of the hyperparameters is.

Take a look at all the random forest hyperparameters:

class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features="auto", max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None

For example, it would be nice to understand what is minimpuritydecreaseso that if you ever want your machine learning model to be more forgiving, you can tweak this setting! 😉

Thanks for reading!

After reading this article, you should have some ideas on how to improve the accuracy of your model from 80% to 90 +%. This information will also help you in your future projects. I wish you the best in your endeavors and in Data Science.


If you are interested in the course, sign up for a free webinar, within the framework of which our experts will tell you in detail about the training program and answer your questions.

Read more:

  • Risks and Caveats When Applying Principal Component Method to Supervised Learning Problems

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *