Example ML project with Pipelines+Optuna+GBDT


Introduction (how it all started)

It all started when I discovered Kaggle. In particular, I take part in a public competition Spaceship Titanic. This is a “younger” version of the classic Titanic. At the time of this writing, 2737 people (teams) are participating in the competition. The code, demonstrated in this article, got me 697th in the public rankings from the second submission. I know it’s not perfect and I’m working on it.

Data

The training dataset is available at link. In order to download it, you need to become a participant in the competition. In addition to the training dataset, a test dataset is available. For obvious reasons, it does not have a target column. There is also a sample submission for submission.

Data analysis and feature preparation

For data analysis I use Pandas Profiling and sweetwiz. These are very powerful libraries, save a lot of time.

An example of generating a report using Pandas Profiling

profile_train = ProfileReport(train_data, title="Train Data Profiling Report")
profile_train.to_file("train_profile.html")
profile_train.to_widgets()

An example of generating a report using SweetWiz

train_report = sv.analyze(train_data)

train_report.show_html(filepath="TRAIN_REPORT.html", 
            open_browser=True, 
            layout="widescreen", 
            scale=None)

train_report.show_notebook( w=None, 
                h=None, 
                scale=None,
                layout="widescreen",
                filepath=None)

The training data is balanced by target, without duplicates, there is a certain number of zero values ​​​​(I do not remove them), and there is also a correlation by numerical features.

The field with the cabin number itself does not carry any payload. But if we select the deck number and the side number from it, then I get two additional synthetic features for the model.

# Вытаскиваю номер палубы из номера каюты
def get_deck(cabin):
    if cabin is None:
        return None
    if isinstance(cabin, float):
        return None
    return cabin.split("/")[0]

#print(get_deck('F/1534/S'))
#print(get_deck("G/1126/P"))

train_data['deck'] = train_data.apply(lambda x: get_deck(x.Cabin), axis=1)
test_data['deck'] = test_data.apply(lambda x: get_deck(x.Cabin), axis=1)

# Вытаскиваю отдельно параметр side из номера кабины
def get_side(cabin):
    if cabin is None:
        return None
    if isinstance(cabin, float):
        return None
    return cabin.split("/")[2]

#print(get_side('F/1534/S'))
#print(get_side('G/1126/P'))

train_data['side'] = train_data.apply(lambda x: get_side(x.Cabin), axis=1)
test_data['side'] = test_data.apply(lambda x: get_side(x.Cabin), axis=1)

After that, I remove from the training dataset all fields that are not useful for training the model and separate numerical and categorical features

num_cols = ['Age','RoomService', 'FoodCourt','ShoppingMall', 'Spa', 'VRDeck']
cat_cols = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'deck', 'side']

Setting up pipelines

Pipelines is one of the most powerful modern tools used in machine learning projects. They greatly facilitate the process of data preparation and save a lot of time. Yes, you need to make some effort in order to understand how they are configured and work, study the documentation. But the time spent will certainly bring profit. In addition, the network has a lot of useful articles (cookbooks) on setting up pipelines. For example, it helped me a lot here is the article.

I created a separate Pipeline for numeric features and a Pipeline for categorical features

num_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='median')),
    ('scale',StandardScaler())
])
cat_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('one-hot',OneHotEncoder(handle_unknown='ignore', sparse=False))
])

For numerical signs used SimpleImputer, which fills the gaps with median values; for categorical features, the gaps are filled with the most frequently occurring values. In addition, applied StandardScaler and OneHotEncoder. By the way, the official scikit-learn documentation has great articlewhich provides a comparative analysis of many scalers.

Based on the received pipelines, I collect ColumnTransformer.

col_trans = ColumnTransformer(transformers=[
    ('num_pipeline',num_pipeline,num_cols),
    ('cat_pipeline',cat_pipeline,cat_cols)
    ],
    remainder="drop",
    n_jobs=-1)

Configuring Pipelines and Optuna

Optuna is a very powerful framework that allows you to automate the selection of hyperparameters for a model in the process of machine learning. In order to properly configure and start using it in a project, you need to spend some time. Fortunately, the official website of the developers has extensive documentation and videos that facilitate the learning process. Below is the code I use in my project.

def objective(trial):    
    # Список гиперпараметров для перебора (для CatBoostClassifier)
    param = {
        "objective": trial.suggest_categorical("objective", ["Logloss", "CrossEntropy"]),
        "colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.01, 0.1),
        "depth": trial.suggest_int("depth", 1, 12),
        "boosting_type": trial.suggest_categorical("boosting_type", ["Ordered", "Plain"]),
        "bootstrap_type": trial.suggest_categorical(
            "bootstrap_type", ["Bayesian", "Bernoulli", "MVS"]
        ),
        "used_ram_limit": "3gb",
    }

    if param["bootstrap_type"] == "Bayesian":
        param["bagging_temperature"] = trial.suggest_float("bagging_temperature", 0, 10)
    elif param["bootstrap_type"] == "Bernoulli":
        param["subsample"] = trial.suggest_float("subsample", 0.1, 1)
    
    # Определяю модель машинного обучения, которой передаются гиперпараметры
    estimator = CatBoostClassifier(**param, verbose=False)

    # Прикручиваю пайплайны
    clf_pipeline = Pipeline(steps=[
            ('col_trans', col_trans),
            ('model', estimator)
        ])
    # Код для вычисления метрики качества.
    # В этом проекте я вычисляю Accuracy методом кросс-валидации
    accuracy = cross_val_score(clf_pipeline, features_train, target_train, cv=3, scoring= 'accuracy').mean()
    return accuracy

# Инициализирую подбора гиперпараметров.
# Можно сохранять все промежуточные результаты в БД SQLLite (этот код я закомментировал)
#study = optuna.create_study(direction="maximize", study_name="CBC-2023-01-14-14-30", storage="sqlite:///db/CBC-2023-01-14-14-30.db")
study = optuna.create_study(direction="maximize", study_name="CBC-2023-01-14-14-30")
# Запускаю процесс подбора гиперпараметров
study.optimize(objective, n_trials=300)
# Вывожу на экран лучший результат
print(study.best_trial)

A few words about gradient boosting

into a variable estimator you can save any ML model. I experimented with DecisionTreeClassifier, RandomForestClassifier, LogisticRegression, but he managed to achieve more or less significant results in competitions after he started using gradient boosting models. Helped me a lot to understand the material. here is the article. I experimented with LGBMClassifier, XGBClassifier, CatBoostClassifier. The attached example uses CatBoostClassifier.

Conclusion

The code I have so far is available at GitHub. He’s not perfect. I continue to improve it (as well as my skills). For example,

  • it’s pretty slow. The enumeration process can last from 2 to 5 hours;

  • I also want to work out a few ideas related to the generation of synthetic features.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *