Introducing PyCaret: An Open Low-Code Python Machine Learning Library

Hello everyone. In anticipation of the start of the course “Neural networks in Python” prepared for you a translation of another interesting material.


Glad to introduce you Pycaret – An open source Python machine learning library for learning and deploying models with and without a teacher in a low-code environment. PyCaret allows you to go from data preparation to model deployment in a few seconds in the notebook environment you choose.

Compared to other open machine learning libraries, PyCaret is a low-code alternative that can replace hundreds of lines of code with just a few words. The speed of more efficient experiments will increase exponentially. PyCaret is essentially a Python wrapper over several machine learning libraries, such as scikit-learn, XGBoost, Microsoft LightGBM, spaCy and many others.

PyCaret is simple and easy to use. All operations performed by PyCaret are sequentially stored in a pipeline fully ready for deployment. Whether it’s adding missing values, transforming categorical data, engineering features or optimizing hyperparameters, PyCaret can automate all this. To learn a little more about PyCaret, check out this short video.

Getting Started with PyCaret

The first stable release of PyCaret version 1.0.0 can be installed using pip. Use the command line interface or the notebook environment and run the command below to install PyCaret.

pip install pycaret

If you use Azure notebooks or Google colabrun the following command:

!pip install pycaret

When you install PyCaret, all dependencies will be installed automatically. You can see the list of dependencies here.

It couldn’t be easier

Walkthrough

1. Data acquisition

In this walkthrough, we will use the diabetic dataset, our goal is to predict the outcome of the patient (in binary 0 or 1) based on several factors such as pressure, blood insulin level, age, etc. This dataset is available on Github repositories PyCaret. The easiest way to import the dataset directly from the repository is to use the function get_data from modules pycaret.datasets.

from pycaret.datasets import get_data
diabetes = get_data('diabetes')


PyCaret can work directly with pandas dataframes

2. Setting up the environment

Any experiment with machine learning in PyCaret begins with setting up the environment by importing the necessary module and initializing setup(). The module that will be used in this example is pycaret.classification.

After importing the module setup() initialized by defining a data frame (‘Diabetes’) and the target variable (‘Class variable’)

from pycaret.classification import *
exp1 = setup(diabetes, target = 'Class variable')

All preprocessing takes place in setup(). Using more than 20 functions to prepare data before machine learning, PyCaret creates a pipeline of transformations based on the parameters defined in the function setup(). It automatically builds all the dependencies in the pipeline, so you do not need to manually control the sequential execution of transformations on a test or new (invisible) dataset.

PyCaret pipeline can be easily transferred from one environment to another or deployed to production. Below you can familiarize yourself with the preprocessing features that have been available in PyCaret since the first release.

Data preprocessing steps are mandatory for machine learning, such as adding missing values, coding quality variables, coding labels (yes or no to 1 or 0) and train-test-split, are performed automatically during initialization setup(). You can learn more about preprocessing features in PyCaret. here.

3. Comparison of models

This is the first step that is recommended to be taken when working with teacher training (classification or regression) This function trains all models in the model library and compares the estimated indicator with each other using cross-validation for K-blocks (10 blocks by default). Estimated indicators are used as follows:

  • For classification: Accuracy, AUC, Recall, Precision, F1, Kappa
  • For regression: MAE, MSE, RMSE, R2, RMSLE, MAPE

By default, metrics are evaluated using cross-validation over 10 blocks. The number of blocks can be changed by changing the value of the parameter fold.

The default table is sorted by “Accuracy” from the highest value to the lowest. Sort order can also be changed using the option sort.

4. Creating a model

Creating a model in any PyCaret module is so simple that you just need to write create_model. The function takes one parameter at the input, i.e. model name passed as a string. This function returns a table with cross-validated scores and a trained model object.

adaboost = create_model('ada')

In variable Adaboost the object of the trained model is stored, which the function returns create_model, which under the hood is an evaluator scikit-learn. Access to the source attributes of the training object can be obtained using the function period ( . ) after the variable. You can find an example of use below.

PyCaret has over 60 open source ready-to-use algorithms. A complete list of evaluators / models available in PyCaret you can find here.

5. Model customization

Function tune_model used to automatically configure machine learning model hyperparameters. PyCaret uses random grid search in a specific search space. The function returns a table with cross-validated estimates and an object of a trained model.

tuned_adaboost = tune_model('ada')

Function tune_model in non-teacher learning modules such as pycaret.nlp, pycaret.clustering and pycaret.anomaly, can be used in conjunction with teacher training modules. For example, the NLP module in PyCaret can be used to configure a parameter number of topics by evaluating an objective function or a loss function from a model with a teacher, such as “Accuracy” or “R2”.

6. The ensemble of models

Function ensemble_model used to create an ensemble of trained models. At the input, it takes one parameter – the object of the trained model. The function returns a table with cross-validated estimates and an object of a trained model.

# creating a decision tree model
dt = create_model('dt')
# ensembling a trained dt model
dt_bagged = ensemble_model(dt)

The “bagging” method is used when creating the ensemble by default, it can be changed to “boosting” using the parameter method in function ensemble_model.

PyCaret also provides features. blend_models and stack_models to combine several trained models.

7. Model visualization

You can evaluate the performance and diagnose a trained machine learning model using the function plot_model. It takes in the object of the trained model and the type of graph in the form of a string.

# create a model
adaboost = create_model('ada')
# AUC plot
plot_model(adaboost, plot = 'auc')
# Decision Boundary
plot_model(adaboost, plot = 'boundary')
# Precision Recall Curve
plot_model(adaboost, plot = 'pr')
# Validation Curve
plot_model(adaboost, plot = 'vc')

Here You can learn more about visualization in PyCaret.

You can also use the function evaluate_modelto see graphs using the notebook user interface.

Function plot_model in the module pycaret.nlp can be used to visualize the corpus of texts and semantic thematic models. Here You can learn more about them.

8. Interpretation of the model

When the data is non-linear, which happens in real life quite often, we invariably see that tree-based models work much better than simple Gaussian models. However, this is due to a loss of interpretability, since tree models do not provide simple coefficients, like linear models. PyCaret implements SHAP (SHapley Additive exPlanations) using the function interpret_model.

The interpretation of a particular data point in a test dataset can be estimated using the “reason” graph. In the example below, we test the first instance in the test dataset.

9. Predictive model

Up to this point, the results we obtained were based on cross-validation of K-blocks on a training dataset (70% by default). In order to see forecasts and model performance on a test / hold-out dataset, the function is used predict_model.

Function predict_model used to make an invisible dataset forecast. Now we will use the same dataset that we used for training, as a proxy for the new invisible dataset. In practice, the function predict_model will be used iteratively, each time on a new invisible dataset.

Function predict_model can also make predictions for a sequential chain of models that can be created using functions stack_models and create_stacknet.

Function predict_model can also make predictions directly for models hosted on AWS S3 using the function deploy_model.

10. Deploy models

One way to use trained models to create predictions for a new dataset is to use the function predict_model in the same notebook / IDE where the model was trained. However, generating a forecast for a new (invisible) dataset is an iterative process. Depending on the use case, the frequency of the forecasts can vary from real-time forecasts to batch predictions. Function deploy_model in PyCaret allows you to deploy the entire pipeline, including a trained model in the cloud from a notebook environment.

deploy_model(model = rf, model_name = 'rf_aws', platform = 'aws', 
             authentication =  {'bucket'  : 'pycaret-test'})

11. Save model / save experiment

After training, the entire pipeline containing all the preprocessing transformations and the trained model object can be saved in a binary pickle file.

# creating model
adaboost = create_model('ada')
# saving model
save_model(adaboost, model_name = 'ada_for_deployment')

You can also save the entire experiment, containing all the intermediate output, as a single binary file.

save_experiment (experiment_name = ‘my_first_experiment’)

You can load saved models and experiments using functions load_model and load_experimentavailable from all PyCaret modules.

12. Next Guide

In the next guide, we will show how to use the trained machine learning model in Power BI to generate batch predictions in a real production environment.

You can also read notepads for beginners in the following modules:

What is a development pipeline?

We are actively working to improve PyCaret. Our upcoming development pipeline includes a new time series forecasting module, TensorFlow integration, and major PyCaret scalability improvements. If you want to share your feedback and help us improve, you can fill out form on the site or leave a comment on our page on Github or LinkedIn.

Want to know more about a particular module?

Starting with the first release, PyCaret 1.0.0 has the following modules available for use. Follow the links below to familiarize yourself with the documentation and examples of work.

Classification
Regression
Clustering
Anomaly search
Natural Text Processing (NLP)
Associative rules training

Important links

If you liked PyCaret, put us ️ on GitHub.

To hear more about PyCaret, you can follow us on LinkedIn and Youtube.


Learn more about the course.


Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *