Classic machine learning

Hello! My name is Artem. I work as a Data Scientist at MegaFon (a platform for secure data monetization OneFactor). We build credit scoring, lead generation and anti-fraud models using telecom data, and also do geoanalytics.

In the previous article, I shared materials for preparing for one of the most exciting (for many) stages – Live Coding.

Let's remember what sections the interview process for the position of Data Scientist consists of:

  • Live Coding

  • Classic machine learning

  • Specialized machine/deep learning

  • Design of machine learning systems (middle+/senior)

  • Behavioral interview (middle+/senior)

A typical interview process through the eyes of Dall-E 3

A typical interview process through the eyes of Dall-E 3

In this article, we will look at materials that can be used to prepare for the section on classical machine learning.

Notes

I will provide links to materials not only in Russian, but also in English. If the same resource has a version in Russian and in English, then I will indicate both so that the reader can choose the appropriate option.

A large number of resources will be in English, so knowledge of English is a must have not only for working in an IT specialty, but also for preparing for an interview. Often the original version (in English) is easier to read than the translation, because it contains many terms and names that are established and, if translated poorly, only confuse the reader.

For books, I will attach links to publishers (if possible) so that you can choose where to purchase them. For technical literature, I recommend using electronic versions of books, because in them you can always highlight an important passage, leave a comment and quickly find the information you need.

By default, resources in Russian come before resources in English (where they exist).

Most of the materials in this article are free, but there are a few paid ones (marked paid). I recommend buying them only if you clearly understand that you do not want or cannot spend your personal time searching for information on your own.

I have highlighted my favorite materials ⭐.

Content

Classic machine learning

A section on classical machine learning is found in every Data Science interview (in one form or another), because it allows you to test basic knowledge of ML, without which there is no point in conducting a section on specialized ML/DL (RecSys, NLP, Time Series, CV, RL , ASR/TTS, etc.).

In this section, expect questions on the following topics:

Materials

Once we have figured out what topics need to be studied/refreshed, it’s time to move on to the materials that will help us with this.

Books

⭐ Machine learning tutorial from SHAD

Online tutorial in machine learning from ShAD – for those who are not afraid of mathematics and want to understand ML technologies. You will study classical theory and the intricacies of algorithm implementation, going from the basics of machine learning to topics that are raised in the latest scientific articles. New chapters will be added to the textbook, so stay tuned for updates, or better yet subscribe to the news.

The best resource in Russian to quickly “refresh” knowledge of ML theory.

An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, Rob Tibshirani + The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman

Analogues of the ML textbook from ShAD in English with a deeper dive into selected topics.
First book (An Introduction to Statistical Learning) is suitable for beginners, and the second (The Elements of Statistical Learning) will be easier to read for trained readers.

Machine Learning Simplified: A gentle introduction to supervised learning by Andrew Wolf

primary goal Machine Learning Simplified — develop your intuition in the field of machine learning. It uses intuitive examples to explain complex concepts, algorithms, and methods, and explains all the necessary math at a high level.

⭐ The Kaggle Book

Inside books the competition analysis, code examples, end-to-end pipelines are described in detail, as well as all the ideas, suggestions, best practices, tips and recommendations that Luca Massaron and Konrad Banahevich collected over the years of competitions on Kaggle (more than 22 years).

Interpreting Machine Learning Models With SHAP: A Guide With Python Examples And Theory On Shapley Values

This book will be your comprehensive guide to mastering the theory and practice of SHAP. It starts with the very interesting origins of game theory and explores how the sharing of taxi costs (among passengers) relates to explaining the predictions of machine learning models. Starting with using SHAP to explain a simple linear regression model, the book gradually introduces SHAP to more complex models. You will learn all the ins and outs of the most popular model interpretation method and how to apply it using the package shap.

Interpretable Machine Learning. A Guide for Making Black Box Models Explainable by Christoph Molnar

This book about how to make machine learning models interpretable.
As you explore the concepts of interpretability, you will learn about simple interpretable models such as decision trees and linear regression. The book focuses on model-independent methods for interpreting “black box” models such as feature importance and accumulated local effects, as well as explaining individual predictions using Shapley values ​​and LIME. In addition, the book introduces methods specific to deep neural networks.
All interpretation methods are explained in detail and critically discussed. How do they work under the hood? What are their strengths and weaknesses? How can their results be interpreted? This book will allow you to select and correctly apply the interpretation method that is most suitable for your machine learning project. The book is recommended reading for machine learning specialists, data scientists, statisticians, and anyone interested in making machine learning models interpretable.

Feature Engineering and Selection: A Practical Approach for Predictive Models by Max Kuhn and Kjell Johnson

The process of developing predictive models includes many steps. Most materials (books, blog posts, etc.) focus on algorithms, while other important aspects of the modeling process are ignored.
IN this book Methods for finding the best subset of features to improve the quality of the model are described. Various data sets are used to illustrate the methods, as well as programs (in R) to reproduce the results.

Clean Machine Learning Code by Moussa Taifi paid

Book contains “recipes” for writing “clean” code for training and inference of ML models:

  1. Basics of Clean Machine Learning Code

  2. Name optimization

  3. Function optimization

  4. Style

  5. “Pure” machine learning classes

  6. Software architecture in machine learning

  7. Machine learning through testing

Courses

⭐ Open Machine Learning Course by Yury Kashnitsky

Well was created by Yuri Kashnitsky and the community ods.ai in 2017.
This is a course balanced in theory and practice, providing both knowledge and skills (necessary, but not sufficient) of machine learning at the Junior Data Scientist level. It's not often you'll find detailed descriptions of the mathematics behind the algorithms used, Kaggle Inclass competitions, and examples of business applications of machine learning in one course. From 2017 to 2019, Yuri Kashnitsky and the large ODS team conducted live course launches twice a year – with homework, competitions and the overall rating of participants (the names of the heroes are imprinted here and there is even me, the author of this article). Now well in self-paced mode in English.
Instructions on how to “take” this course in Russian are here.
New course launch in Russian was plannedbut stalled.

Machine learning (course of lectures, K.V. Vorontsov)

IN course The main tasks of learning from precedents are considered: classification, clustering, regression, dimensionality reduction. Methods for solving them are being studied, both classical and new, created over the past 10–15 years. Emphasis is placed on a thorough understanding of the mathematical foundations, relationships, strengths, and limitations of the methods discussed. Theorems are mostly given without proof.

⭐ Applied problems of data analysis (course of lectures, A.G. Dyakonov)

Goals course are:

  • Solving real problems, for example, taken from competitive platforms (Kaggle), or problems from business (“small data” ~ up to 200 GB):

    • Most convenient for beginners

    • Simple data formats

    • Can be solved on PC

  • Study of theory (including advanced machine learning):

    • Estimates of mean, probability and density; weight schemes

    • CASE: Forecasting visits of supermarket customers and the amount of their purchases

    • CASE: Traffic Jam Problem

    • The art of visualization:

      • Part 1 – Historical

      • Part 2 – Univariate Analysis

      • Part 3 – Multivariate analysis

    • Quality Metrics

      • Part 1. Error functions in a regression problem

      • Part 2. Clear binary classification

      • Part 3: Scoring functions and curves in machine learning

      • Part 4: Multi-class problems, ranking, clustering

      • Part 5: Problems and cases

    • Data preparation

    • Feature generation

    • Ensembles

    • Social/Complex Network Analysis

    • Predicting the appearance of an edge in a dynamic graph (Link Prediction Problem)

    • Community Detection

    • Random Forest

    • Importance of features in tree ensembles

    • Gradient boosting

Machine Learning Algorithms from Scratch

IN this course you implement from scratch all the basic classical machine learning algorithms in pure Python, Pandas and NumPy.

Unlike other courses on machine learning, in this course the main emphasis in the presentation of material will be on the algorithm with the specification. programming, and not with TZ. mathematics. Although basic mathematical concepts will be given.

To successfully complete the course, you will need an understanding of the basics of machine learning: why divide a sample into train and test, what target and feature are, and other basic concepts, as well as professional jargon. In addition, you need to have a relatively good understanding of the Python language, its algorithms and data structures.

Stanford CS229: Machine Learning by Andrew Ng

Cult course on the basics of machine learning.
Themesstudied on the course:

  • Supervised learning (generative/discriminative models, parametric/non-parametric models, neural networks, support vector machines);

  • Unsupervised learning (clustering, dimensionality reduction, kernel methods);

  • Learning theory (bias/variance trade-off, practical advice);

  • Reinforcement learning and adaptive control.

⭐ Kaggle Learn

Kaggle provides free access to its short but interesting courses. Here are just a few of them:

Google Machine Learning Courses

  • Basic courses (basic concepts of machine learning):

    1. Introduction to Machine Learning

    2. Machine Learning Crash Course

    3. Problem Statement

    4. Preparing data and feature space

    5. Testing and Debugging

  • Advanced courses (tools and methods for solving various problems):

    • Decision Forests

    • Recommender systems

    • Clustering

    • Generative Adversarial Networks

    • Image classification

    • Fairness in the Perspective API

  • Guides (step-by-step instructions for solving problems using best practices):

    • ML Rules

    • Guide “People + AI”

    • Text classification

    • Good data analysis

    • Deep Learning Setup Tutorial

Websites

Machine learning for people. Let's understand in simple words

Great Introduction for those who want to finally understand machine learning – in simple language, without formula-theorems, but with examples of real problems and their solutions.
Suitable for those who have just begun to understand machine learning.

⭐ Small data analysis

Interesting Alexander Dyakonov's blog (author of the course “Applied Problems of Data Analysis”), in which new posts are not published so often (the author now writes in your telegram channel), but the archive contains useful articles, for example:

Kaggle Competitions

Available on Kaggle several types of competitions:

  • Getting Started
    The simplest competitions on Kaggle, intended for those who are just starting out in machine learning. Since there is no money to be paid for winning these competitions, other users willingly share their solutions in the code section (For example) – this is exactly what beginners need to “get into it.”

  • Playground Competitions
    These competitions are one step above the “Getting Started” level in difficulty and are aimed at beginners or “cagglers” interested in solving new problems. Here, small cash prizes can already be given for winning.

  • Research
    Research competitions feature more experimental problems. Winning these competitions usually does not reward prizes or points, but they do provide an opportunity to work on problems that do not have a clear or easy solution and that are integral to a particular subject area.

  • Featured
    These are serious tasks, usually of a commercial nature. Such competitions attract the most distinguished experts and offer prize funds reaching up to a million dollars. However, they remain accessible to anyone and everyone. Whether you're an expert in the field or a complete beginner, theme competitions are a valuable opportunity to learn skills and techniques from the very best in the field.

Machine Learning Mastery

Jason Brownlee website contains:

  • Guides (step-by-step guides divided into several levels):

    • Basics (Foundations)How to start learning machine learning, probability, statistical methods, linear algebra, optimization, mathematical analysis.

    • Newbie (Beginner)Python, understanding ML algorithms, introduction to sklearn, time series prediction, data preparation.

    • Average (Intermediate)Boosts, class imbalance, deep learning, ensembles.

    • Advanced (Advanced)LSTM, NLP, CV, GANs, attention mechanism and transformers

  • Tutorials (blog posts on the website on various topics)

  • E-books paid (extended materials from the site, combined into books by topic)

I admit, I haven't studied this material in full, but in my experience, if in the search results there is a choice between this site and medium, analytics vidhya, etc., it is better to go here.

The Illustrated Machine Learning

Metrics for distance-based algorithms

Metrics for distance-based algorithms

Idea site is to make the complex world of machine learning more accessible through clear, concise illustrations. The goal is to provide students, professionals, and anyone preparing for a technical interview with a visual aid to better understand the core concepts of machine learning. Whether you're new to the field or a seasoned professional looking to brush up on your knowledge, these illustrations will be a valuable resource on your journey to understanding machine learning.

MLU-EXPLAIN

Process of constructing an ROC curve

Process of constructing an ROC curve

Visual explanations of basic machine learning concepts.
Machine Learning University (Machine Learning UniversityMLU) is Amazon's educational initiative designed to teach the theory and practice of machine learning.
MLU-Explain exists [в рамках этой инициативы] to learn important machine learning concepts through visual essays in a fun, informative, and accessible way.

Cheat sheets

Supervised Learning:

  • Loss functions, gradient descent, likelihood

  • Linear models, Support Vector Machine (SVM), generative learning approach

  • Trees and ensembles, K-Nearest Neighbors (KNN), learning theory

Unsupervised Learning:

  • EM algorithm (Expectation Maximization), k-means, hierarchical clustering

  • Metrics for assessing the quality of clustering

  • Principal Component Analysis (PCA), Independent Component Analysis (ICA)

Tips and Tricks:

  • Confusion matrix, accuracy, precision, recall, F1-score, ROC

  • R2, Mallows's Cp​, Akaike information criterion (AIC), Bayesian information criterion (BIC)

  • Cross-validation, regularization, bias/variance trade-off, error analysis

Miscellaneous

StatQuest with Josh Starmer

Channel, in which Joshua Starmer explains various algorithms and concepts in ML in simple language, supported by visualizations and examples. These videos are suitable for a first introduction to the material and for repetition.
You can use the channel search function to find the video you need.

Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning by Sebastian Raschka

The proper use of methods for assessing model quality and selecting an algorithm is vital in academic machine learning research, as well as in many real-world business problems. IN this article the various methods that can be used for each of these sub-problems are reviewed and the main advantages and disadvantages of each method are discussed with reference to theoretical and empirical research. In addition, recommendations are made to encourage best but feasible practices in machine learning research and application.

⭐ How to avoid machine learning pitfalls: a guide for academic researchers by Michael A. Lones

IN this article describes common mistakes that occur when using machine learning and what can be done to avoid them. It covers five stages of the machine learning model development process: what to do before building a model, how to reliably build models, how to robustly evaluate models, how to fairly compare models, and how to report results.

A new perspective on Shapley values: Part I: Intro to Shapley and SHAP + Part II: The Naïve Shapley method

These two blog posts (1, 2) will help you study SHAP/Shapley values ​​to interpret models.

Let's sum it up

If you haven't read it yet, I recommend reading the blocks Learning How to Learn and Let's summarize the results from the first article, since everything said there is also applicable for preparing for the section on machine learning.

The materials collected in this article will be useful in preparing for interviews various positions in Big Data MegaFon.

And if you are just starting your career in Data Science, then pay attention to internships in large companies, where you can not only improve your knowledge, but also gain cool experience in applying theory to practical business problems. At MegaFon, an example of such an internship is the accelerator (email with the subject “internship in big data“), with the help of which data scientists (Data Scientists), analysts (Data Analysis) and data engineers (Data Engineers) find their first jobs every year.

What's next?

In the next article we will analyze materials for preparing for the section on specialized machine/deep learning.

You can find the latest resources for this series of articles in the repository Data Science Resources, which will be maintained and updated. You can also subscribe to my telegram channel Data Science Weeklyin which I share interesting and useful materials every week.

If you know of any cool resources that I didn't include in this list, please write about them in the comments.

PS Thanks to Daria Shatko for editing and proofreading this post!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *