AutoVIML: Automated Machine Learning

The translation of the article was prepared especially for the students of the course “Industrial ML on Big Data”

Machine learning has the advantage of learning algorithms that automatically improve using the experience gained. There are N different machine learning algorithms and methods, and you usually have to try many of them to find the best prediction model for your dataset – the one with the highest accuracy.

Most of the machine learning methods such as regression methods, classification and other models are available in Sklearn, but in order to choose which method is best for our particular case, we need to try all of these models together with hyperparameter tuning and find the most efficient model. All this work takes a lot of effort and time, which can be reduced by using the AutoVIML package in Python.

AutoVIML is an open source Python package that makes machine learning easy. What it does is visualize your data using various machine learning models and find the best model with the highest accuracy for a given dataset. There is no need to preprocess the dataset before passing it to AutoVIML, it automatically cleans up your data, helps you classify variables, performs function shortening, can process data of various types – text, numeric, dates, etc. using one model.

In this article, we will learn how to use AutoVIML to reduce the time and effort involved in creating machine learning models. We will see how the various parameters we use in AutoVIML affect the forecast.

Install AutoVIML

Like any other python library, autoviml we will install through pip…

pip install autoviml

Importing AutoVIML

from autoviml.Auto_ViML import Auto_ViML

Loading the dataset

Any dataset is suitable for mastering AutoVIML. For this article, I’m using a heart disease dataset that you can download from Kaggle. It has various attributes and a target variable.

import pandas as pd
df = pd.read_csv('heart_d.csv')
df

Now let’s figure out how to use autoviml to create a forecasting model with this dataset, and what parameters are in AutoVIML.

#Basic Example with all parameters
model, features, trainm, testm = Auto_ViML(
    train,
    target,
    test,
    sample_submission,
    hyper_param="GS",
    feature_reduction=True,
    scoring_parameter="weighted-f1",
    KMeans_Featurizer=False,
    Boosting_Flag=False,
    Binning_Flag=False,
    Add_Poly=False,
    Stacking_Flag=False,
    Imbalanced_Flag=False,
    verbose=0,
)

From the code above, you can see how to create a model using AutoVIML, and what parameters we can use. Now let’s discuss in detail what these parameters are responsible for.

train: should contain the location of your dataset or if you uploaded it to dataframethen name dataframe… In our case, we loaded it into dataframe With name “df“, So we will give it the value”df“.
target: contains the name of the target variable. In our case, it is called “TenYearCHD“.
test: contains a test dataset. We can also leave it blank (and use ””) if you do not have any test dataset so that AutoVIML separates the dataset into training and test.
sample_submission: we will leave it blank so that it will automatically create a view in the local directory.
hyper_param: We will use RandomizedSearchCV because it is three times faster than Grid Search CV. Let’s give it the value “RS”.
feature_reduction: Set to true to account for the most important predictor variable for creating the model.
scoring_parameter: you can set your own scoring parameter or it will be selected according to the model. Here we use “Weighted-f1“.
KMeans_featurizer: This parameter must be true for a linear classifier and false for XGboost or a random classifier, otherwise there is a risk of overfitting.
boosting_flag: used for boosting. Let’s put the value false…
binning_flag: defaults to false, but can be set to true when we want to convert the upper numeric variables to binary.
add_poly: install in false…
stacking_flag: defaults to false. If set to true, then an additional function will be added, which is derived from the predictions of another model. Leave it false.
Imbalanced_flag: if there is a value true, then it will check the data for balance and remove the imbalance using the method SMOTING…
Verbose: Typically used to display the steps in progress. Let’s set the value to 3.

Now let’s use all these parameters for our dataset and create the most efficient model with high accuracy using AutoVIML.

model, features, trainm, testm = Auto_ViML(
    train=df,
    target="TenYearCHD",
    test="",
    sample_submission="",
    hyper_param="RS",
    feature_reduction=True,
    scoring_parameter="weighted-f1",
    KMeans_Featurizer=False,
    Boosting_Flag=True,
    Binning_Flag=False,
    Add_Poly=False,
    Stacking_Flag=True,
    Imbalanced_Flag=True,
    verbose=3
)

Let’s analyze the result.

1. Part of data analysis

2. Data cleaning and feature selection

3. Balancing data and creating models

4. Classification report and adjacency matrix

5. Model rendering

6. Signs importance and predictive probability

In the set above, we saw how AutoVIML processes the data, cleans it up, balances the result variable and creates a better model along with visualizations for better understanding.

Likewise, you can learn AutoVIML using various datasets and share your experience in the comments. AutoVIML is an easy way to create powerful machine learning models based on your dataset.