Checklist for a machine learning project

In this post, I put together a checklist that I constantly refer to while working on a comprehensive machine learning project.

Why do I need a checklist at all?

Since you need to deal with the many elements of a project (preparation, questions, models, tweaking, etc.), it’s easy to lose track. He will guide you through the following steps and prompt you to check whether each task was completed successfully or not.

Sometimes we try to find a starting point, a checklist helps you extract the right information (data) from the right sources to establish relationships and reveal correlation ideas.

It is recommended that each part of the project go through a verification paradigm.

As Atul Gavande says in his book The Checklist Manifesto,

the volume and complexity of what we know has exceeded our individual ability to correctly, safely and reliably deliver our benefits.
So let me guide you through this clear and short list of activities that will reduce your workload and improve your results …

Machine Learning Project Checklist

Here are 8 steps you must follow in almost every project. Some of the steps can be performed interchangeably in order.

1. Identify the problem from a high level perspective

This is to understand and formulate the business logic of the problem. This should tell you:

nature of the problem (controlled / uncontrolled, classification / regression),
the type of solutions you can develop
What indicators should you use to measure performance?
Is machine learning the right approach to solve this problem?
manual approach to solving the problem.
inherent premises of the problem

2. Identify data sources and get data

In most cases, this step can be performed before the first step, if you have data and you want to identify issues (problems) around them in order to better use the incoming data.

Based on the definition of your problem, you will need to determine the data sources, which can be a database, data warehouse, sensors, etc. To deploy the application in production, this step should be automated by developing data pipelines that provide incoming data to the system.

List the sources and amount of data you need.
check if the place is a problem.
check if you are allowed to use the data for your purposes or not.
get the data and convert it to a workable format.
check the data type (textual, categorical, numerical, time series, images)
Select a sample for final testing.

3. Initial data exploration

At this stage, you study all the features that affect your result / forecast / goal. If you have a huge data set, try this step to make the analysis more manageable.

Steps:

Use Notebook Jupyter as it provides a simple and intuitive interface for exploring data.
define the target variable
Define feature types (categorical, numerical, textual, etc.)
analyze the relationship between features.
Add multiple data visualizations to easily interpret the effect of each function on the target variable.
document your research results.

4. Research data analysis for data preparation

It’s time to complete the conclusions of the previous step by defining functions for data conversion, cleaning, selecting / developing features, and scaling.

Writing functions for data conversion and process automation for upcoming data packets.
Write functions to clear data (imputing missing values and handling sharply different values)
Write functions for selecting and designing features – remove redundant features, format the transformation of objects, and other mathematical transformations.
Features Scaling – standardization of features.

5. Develop a base model, and then explore other models to select the best.

Create a very basic model that should serve as the basis for all other complex machine learning models. Checklist of steps:

Train some commonly used models, such as the naive bayes model, linear regression model, SVM, etc., using the default settings.
Measure and compare the performance of each model with the base and with all the others.
Use the N-fold cross-validation for each model and calculate the mean and standard deviation of the performance metrics from the N-fold metrics.
Explore the features that have the greatest impact on the goal.
Analyze the types of errors that models make when predicting.
Design functions differently.
Repeat the above steps several times (trial and error) to make sure that we use the correct functions in the correct format.
Make a short list of the best models based on their performance indicators.

6. Fine tune your models from the shortlist and check for ensemble methods

This should be one of the decisive steps when you approach your final decision. Key points should include:

Configuring a hyperparameter using cross validation.
Use automatic tuning methods such as random search or grid search to find the best configuration for your best models.
Test an ensemble of methods such as a voting classifier, etc.
Test models with as much data as possible.
After completing the work, use the test sample that we set aside at the beginning to check whether it fits well or not.

7. Document the code and communicate your decision

The communication process is diverse. You must keep in mind all existing and potential stakeholders. Therefore, the main points include:

Document the code, as well as your approach to the entire project.
Create a dashboard, such as voila, or an insightful presentation with visualization that needs no explanation.
Write a blog / report on how you analyzed features, tested various transformations, etc. Describe your training (failures and methods that worked)
Finish with the main result and future volume (if any)

8. Deploy your model in production, monitoring

If your project requires testing deployment on real data, you must create a web application or REST API for use on all platforms (web, Android, iOS). Key points (will vary depending on the project) include:

Save your final trained model to an h5 or pickle file.
Serve your model with web services, you can use Flask to develop these web services.
Connect input sources and configure ETL pipelines.
Manage dependencies using pipenv, docker / Kubernetes (based on scaling requirements)
You can use AWS, Azure, or the Google Cloud Platform to deploy your service.
Do performance monitoring on real data or just for people so that they can use your model with their data.

Note. A checklist can be adapted depending on the complexity of your project.

Learn the details of how to get a sought-after profession from scratch or Level Up in skills and salary by completing SkillFactory paid online courses:

Machine Learning Course (12 weeks)
Learning Data Science from scratch (12 months)
Analyst profession with any starting level (9 months)
Python for Web Development Course (9 months)

Trends in Data Scenсe 2020
Data Science is dead. Long live Business Science
The coolest Data Scientist does not waste time on statistics
How to Become a Data Scientist Without Online Courses
450 free courses from the Ivy League
Data Science for the Humanities: What is Data
Steroid Data Scenario: Introducing Decision Intelligence