Feature Ranking with Recursive Feature Elimination in Scikit-Learn

On the eve of the start of the course “Machine Learning. Professional” we publish a translation of a useful article.

We also invite you to see recording an open webinar on the topic “Clustering”


Feature selection Is an essential task for any machine learning application. It is especially important when the data in question has many characteristics. The optimal number of features improves the accuracy of the model. It is possible to single out the most important features and find the number of optimal ones by determining the importance of features or their ranking. In this article, we will take a look at feature ranking.

Recursive Feature Elimination

The first element required for recursive elimination of features (recursive feature elimination) is an evaluator, such as a linear model or a decision tree.

Such models have coefficients for linear models and feature importance in decision trees. To select the optimal number of features, you need to train the evaluator and select features using coefficients or feature values. The least important features will be removed. This process will be repeated recursively until the optimal number of features is obtained.

Application in Sklearn

Scikit-learn can implement recursive feature elimination using the class sklearn.featureselection.RFEThe class takes the following parameters:

  • estimator a machine learning evaluator that can give out the importance of features through attributes coef or featureimportances attributes.

  • nfeaturestoselectnumber of features to select. Selects half by default.

  • stepan integer indicating the number of features to be removed at each iteration, or a number in the range from 0 to 1 indicating the percentage of features to be removed at each iteration.

After training, you can get the following attributes:

  • ranking – ranking of features.

  • nfeatures – the number of selected features.

  • supportan array indicating whether a feature was selected or not.

Application

As stated earlier, we will be working with an evaluator that suggests attributes featureimportances or coeff… Let’s take a quick example. There are initially 13 features in the dataset. We will work on identifying the optimal number of features.

import pandas as pddf = pd.read_csv(‘heart.csv’)df.head()

Let’s get the signs x and y

X = df.drop([‘target’],axis=1)
y = df[‘target’]

We’ll split the original dataset into test and training datasets:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0)

Let’s make some imports:

  • Pipeline – to help with cross-validation, help to avoid data leakage.

  • RepeatedStratifiedKFold – for multiple k-block cross-validation.

  • crossvalscore – for cross-validation scoring.

  • GradientBoostingClassifier – the estimator that we will use.

  • Numpy – to calculate the average of all ratings.

from sklearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import RFE
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier

In the first step, we create an instance of the RFE class indicating the evaluator and the number of features that will be selected. In our case, choose 6:

rfe = RFE(estimator=GradientBoostingClassifier(), n_features_to_select=6)

Next, we create an instance of the model we want to use:

model = GradientBoostingClassifier()

We use Pipeline to transform data. IN Pipeline we indicate rfe for the feature selection step and the model that will be used in the next step.

Then we set RepeatedStratifiedKFold with 10 splits and 5 reps. Multiple k-block cross-validation ensures that the number of samples of each class is balanced in each block. RepeatedStratifiedKFold uses multiple k-block cross-validation a given number of times with different randomization on each repetition.

pipe = Pipeline([(‘Feature Selection’, rfe), (‘Model’, model)])
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=36851234)
n_scores = cross_val_score(pipe, X_train, y_train, scoring=’accuracy’, cv=cv, n_jobs=-1)
np.mean(n_scores)

The next step is to use the pipeline on the dataset.

pipe.fit(X_train, y_train)

So we can check support and ranking. Support indicates whether a feature has been selected or not.

rfe.support_
array([ True, False,  True, False,  True, False, False,  True, False,True, False,  True,  True])

We can put this in a dataframe and see the result.

pd.DataFrame(rfe.support_,index=X.columns,columns=[‘Rank’])

We can also see the relative ranking.

rf_df = pd.DataFrame(rfe.ranking_,index=X.columns,columns=[‘Rank’]).sort_values(by=’Rank’,ascending=True)rf_df.head()

Automatic feature selection

Instead of manually adjusting the number of features, it would be nice if we could do it automatically. You can achieve this with recursive feature exclusion and cross-validation. This is where the class will help you sklearn.featureselection.RFECVIt takes the following parameters:

  • estimatoranalogue of RFE class.

  • minfeaturestoselect – the minimum number of features for selection.

  • cv – separation strategy for cross-validation.

Returned attributes:

  • nfeatures – the optimal number of features selected using cross-validation.

  • supportan array containing information about the selection of a feature.

  • ranking – ranking of features.

  • gridscores – the score obtained as a result of cross-validation.

The first step is to import the class and instantiate it.

from sklearn.feature_selection import RFECVrfecv = RFECV(estimator=GradientBoostingClassifier())

Next, we define the pipeline and cv… In this pipeline, we use the newly created rfecv

pipeline = Pipeline([(‘Feature Selection’, rfecv), (‘Model’, model)])
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=36851234)
n_scores = cross_val_score(pipeline, X_train, y_train, scoring=’accuracy’, cv=cv, n_jobs=-1)
np.mean(n_scores)

Now we apply the pipeline and get the optimal number of features.

pipeline.fit(X_train,y_train)

The optimal number of features can be obtained using the attribute nfeatures

print(“Optimal number of features : %d” % rfecv.n_features_)Optimal number of features : 7

Ranking and support can be obtained in the same way as last time.

rfecv.support_rfecv_df = pd.DataFrame(rfecv.ranking_,index=X.columns,columns=[‘Rank’]).sort_values(by=’Rank’,ascending=True)
rfecv_df.head()

Through gridscores we can plot the scores from the cross-validation.

import matplotlib.pyplot as plt
plt.figure(figsize=(12,6))
plt.xlabel(“Number of features selected”)
plt.ylabel(“Cross validation score (nb of correct classifications)”)
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()

Conclusion

In regression problems, this method is applied in a similar way. Just use regression metrics instead of precision metrics. Hopefully this article has given you some insight into how you can choose the optimal number of features for your machine learning tasks.


Learn more about the course “Machine Learning. Professional” and watch a lesson on the topic “Clustering” here

Read more:

  • How I regularly improve the accuracy of my training models from 80% to 90 +%

  • Fast Gradient Boosting with CatBoost

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *