Briefly about Ensemble methods with examples

Bagging

Bagging, or Bootstrap Aggregating, is a technique based on creating multiple models using random subsamples of data. The idea is to combine their predictions, thereby reducing scatter and increasing overall accuracy.

Let's start by implementing Bagging using RandomForest from the library scikit-learn:

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Загрузка данных
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Разделение данных на обучающую и тестовую выборки
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Обучение модели
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Предсказания
y_pred = model.predict(X_test)

# Оценка точности
accuracy = accuracy_score(y_test, y_pred)
print(f'Test Accuracy: {accuracy:.2f}')

Here we use RandomForestwhich creates many decision trees on random subsamples of data. By combining their results, a more robust model can be obtained.

Now let's see how you can implement Bagging using BaggingClassifier:

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Создание базовой модели
base_model = DecisionTreeClassifier()

# Обучение модели Bagging
bagging_model = BaggingClassifier(base_estimator=base_model, n_estimators=100, random_state=42)
bagging_model.fit(X_train, y_train)

# Предсказания
y_pred_bagging = bagging_model.predict(X_test)

# Оценка точности
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)
print(f'Bagging Test Accuracy: {accuracy_bagging:.2f}')

We create DecisionTreeClassifier as a base model and apply BaggingClassifier to combine the results.

Boosting

Boosting is a method that trains models sequentially, each one focusing on the mistakes of the previous ones. This helps create powerful predictive models that can significantly outperform regular models.

Let's start with the simplest example, implementing AdaBoost:

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# Создание базовой модели
base_model_adaboost = DecisionTreeClassifier(max_depth=1)

# Обучение модели AdaBoost
adaboost_model = AdaBoostClassifier(base_estimator=base_model_adaboost, n_estimators=100, random_state=42)
adaboost_model.fit(X_train, y_train)

# Предсказания
y_pred_adaboost = adaboost_model.predict(X_test)

# Оценка точности
accuracy_adaboost = accuracy_score(y_test, y_pred_adaboost)
print(f'AdaBoost Test Accuracy: {accuracy_adaboost:.2f}')

We use DecisionTreeClassifier with depth 1 as the base model.

Now let's implement Gradient Boosting using XGBoost:

import xgboost as xgb

# Преобразуем данные в формат DMatrix
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Параметры модели
params = {
    'objective': 'multi:softmax',
    'num_class': 3,
    'max_depth': 3,
    'eta': 0.1
}

# Обучение модели
xgboost_model = xgb.train(params, dtrain, num_boost_round=100)

# Предсказания
y_pred_xgboost = xgboost_model.predict(dtest)

# Оценка точности
accuracy_xgboost = accuracy_score(y_test, y_pred_xgboost)
print(f'XGBoost Test Accuracy: {accuracy_xgboost:.2f}')

Stacking

Stacking combines the predictions of multiple underlying models using a meta-model that takes those predictions as input. This allows models to learn from each other's mistakes.

We implement staking using StackingClassifier:

from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

# Определение базовых моделей
base_models = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('dt', DecisionTreeClassifier(max_depth=3)),
    ('ab', AdaBoostClassifier(n_estimators=100))
]

# Создание модели стекинга
stacking_model = StackingClassifier(estimators=base_models, final_estimator=LogisticRegression())
stacking_model.fit(X_train, y_train)

# Предсказания
y_pred_stacking = stacking_model.predict(X_test)

# Оценка точности
accuracy_stacking = accuracy_score(y_test, y_pred_stacking)
print(f'Stacking Test Accuracy: {accuracy_stacking:.2f}')

In this way, multiple models can be combined to obtain more accurate predictions. We use LogisticRegression as a meta-model.

Now let's see how we can implement stacking for regression:

from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

# Определение базовых регрессоров
base_regressors = [
    ('rf', RandomForestRegressor(n_estimators=100, random_state=42)),
    ('lr', LinearRegression())
]

# Создание модели стекинга
stacking_regressor = StackingRegressor(estimators=base_regressors, final_estimator=LinearRegression())
stacking_regressor.fit(X_train, y_train)

# Предсказания
y_pred_stacking_regressor = stacking_regressor.predict(X_test)

# Оценка точности
accuracy_stacking_regressor = stacking_regressor.score(X_test, y_test)
print(f'Stacking Regressor Test Score: {accuracy_stacking_regressor:.2f}')

So stacking can be used not only for classification, but also for regression. In this case we combine RandomForestRegressor And LinearRegression.


You can gain more relevant skills in analytics and analysis as part of practical online courses from industry experts, as well as by attending open lessons:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *