Example ML project with Pipelines+Optuna+GBDT
Introduction (how it all started)
It all started when I discovered Kaggle. In particular, I take part in a public competition Spaceship Titanic. This is a “younger” version of the classic Titanic. At the time of this writing, 2737 people (teams) are participating in the competition. The code, demonstrated in this article, got me 697th in the public rankings from the second submission. I know it’s not perfect and I’m working on it.
Data
The training dataset is available at link. In order to download it, you need to become a participant in the competition. In addition to the training dataset, a test dataset is available. For obvious reasons, it does not have a target column. There is also a sample submission for submission.
Data analysis and feature preparation
For data analysis I use Pandas Profiling and sweetwiz. These are very powerful libraries, save a lot of time.
An example of generating a report using Pandas Profiling
profile_train = ProfileReport(train_data, title="Train Data Profiling Report")
profile_train.to_file("train_profile.html")
profile_train.to_widgets()
An example of generating a report using SweetWiz
train_report = sv.analyze(train_data)
train_report.show_html(filepath="TRAIN_REPORT.html",
open_browser=True,
layout="widescreen",
scale=None)
train_report.show_notebook( w=None,
h=None,
scale=None,
layout="widescreen",
filepath=None)
The training data is balanced by target, without duplicates, there is a certain number of zero values (I do not remove them), and there is also a correlation by numerical features.
The field with the cabin number itself does not carry any payload. But if we select the deck number and the side number from it, then I get two additional synthetic features for the model.
# Вытаскиваю номер палубы из номера каюты
def get_deck(cabin):
if cabin is None:
return None
if isinstance(cabin, float):
return None
return cabin.split("/")[0]
#print(get_deck('F/1534/S'))
#print(get_deck("G/1126/P"))
train_data['deck'] = train_data.apply(lambda x: get_deck(x.Cabin), axis=1)
test_data['deck'] = test_data.apply(lambda x: get_deck(x.Cabin), axis=1)
# Вытаскиваю отдельно параметр side из номера кабины
def get_side(cabin):
if cabin is None:
return None
if isinstance(cabin, float):
return None
return cabin.split("/")[2]
#print(get_side('F/1534/S'))
#print(get_side('G/1126/P'))
train_data['side'] = train_data.apply(lambda x: get_side(x.Cabin), axis=1)
test_data['side'] = test_data.apply(lambda x: get_side(x.Cabin), axis=1)
After that, I remove from the training dataset all fields that are not useful for training the model and separate numerical and categorical features
num_cols = ['Age','RoomService', 'FoodCourt','ShoppingMall', 'Spa', 'VRDeck']
cat_cols = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'deck', 'side']
Setting up pipelines
Pipelines is one of the most powerful modern tools used in machine learning projects. They greatly facilitate the process of data preparation and save a lot of time. Yes, you need to make some effort in order to understand how they are configured and work, study the documentation. But the time spent will certainly bring profit. In addition, the network has a lot of useful articles (cookbooks) on setting up pipelines. For example, it helped me a lot here is the article.
I created a separate Pipeline for numeric features and a Pipeline for categorical features
num_pipeline = Pipeline(steps=[
('impute', SimpleImputer(strategy='median')),
('scale',StandardScaler())
])
cat_pipeline = Pipeline(steps=[
('impute', SimpleImputer(strategy='most_frequent')),
('one-hot',OneHotEncoder(handle_unknown='ignore', sparse=False))
])
For numerical signs used SimpleImputer, which fills the gaps with median values; for categorical features, the gaps are filled with the most frequently occurring values. In addition, applied StandardScaler and OneHotEncoder. By the way, the official scikit-learn documentation has great articlewhich provides a comparative analysis of many scalers.
Based on the received pipelines, I collect ColumnTransformer.
col_trans = ColumnTransformer(transformers=[
('num_pipeline',num_pipeline,num_cols),
('cat_pipeline',cat_pipeline,cat_cols)
],
remainder="drop",
n_jobs=-1)
Configuring Pipelines and Optuna
Optuna is a very powerful framework that allows you to automate the selection of hyperparameters for a model in the process of machine learning. In order to properly configure and start using it in a project, you need to spend some time. Fortunately, the official website of the developers has extensive documentation and videos that facilitate the learning process. Below is the code I use in my project.
def objective(trial):
# Список гиперпараметров для перебора (для CatBoostClassifier)
param = {
"objective": trial.suggest_categorical("objective", ["Logloss", "CrossEntropy"]),
"colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.01, 0.1),
"depth": trial.suggest_int("depth", 1, 12),
"boosting_type": trial.suggest_categorical("boosting_type", ["Ordered", "Plain"]),
"bootstrap_type": trial.suggest_categorical(
"bootstrap_type", ["Bayesian", "Bernoulli", "MVS"]
),
"used_ram_limit": "3gb",
}
if param["bootstrap_type"] == "Bayesian":
param["bagging_temperature"] = trial.suggest_float("bagging_temperature", 0, 10)
elif param["bootstrap_type"] == "Bernoulli":
param["subsample"] = trial.suggest_float("subsample", 0.1, 1)
# Определяю модель машинного обучения, которой передаются гиперпараметры
estimator = CatBoostClassifier(**param, verbose=False)
# Прикручиваю пайплайны
clf_pipeline = Pipeline(steps=[
('col_trans', col_trans),
('model', estimator)
])
# Код для вычисления метрики качества.
# В этом проекте я вычисляю Accuracy методом кросс-валидации
accuracy = cross_val_score(clf_pipeline, features_train, target_train, cv=3, scoring= 'accuracy').mean()
return accuracy
# Инициализирую подбора гиперпараметров.
# Можно сохранять все промежуточные результаты в БД SQLLite (этот код я закомментировал)
#study = optuna.create_study(direction="maximize", study_name="CBC-2023-01-14-14-30", storage="sqlite:///db/CBC-2023-01-14-14-30.db")
study = optuna.create_study(direction="maximize", study_name="CBC-2023-01-14-14-30")
# Запускаю процесс подбора гиперпараметров
study.optimize(objective, n_trials=300)
# Вывожу на экран лучший результат
print(study.best_trial)
A few words about gradient boosting
into a variable estimator you can save any ML model. I experimented with DecisionTreeClassifier, RandomForestClassifier, LogisticRegression, but he managed to achieve more or less significant results in competitions after he started using gradient boosting models. Helped me a lot to understand the material. here is the article. I experimented with LGBMClassifier, XGBClassifier, CatBoostClassifier. The attached example uses CatBoostClassifier.
Conclusion
The code I have so far is available at GitHub. He’s not perfect. I continue to improve it (as well as my skills). For example,
-
it’s pretty slow. The enumeration process can last from 2 to 5 hours;
-
I also want to work out a few ideas related to the generation of synthetic features.