How to find a problem that can be solved using machine learning

Hello, this is the Yandex Practicum data analysis team.
In this article, we will talk a little about the basics of machine learning: we will show how to choose a task for training and thus look at the process through the prism of a business. The material will be useful not only for novice data scientists, but also for managers who are thinking about introducing machine learning into business processes.

The following helped prepare the article:

Sergey Komarov

Machine learning specialist at Ingosstrakh and senior reviewer on the course “Data Scientist” in Yandex Practicum

Anton Morgunov

Machine Learning Engineer at Basis Center and Curriculum Lead of the Machine Learning Engineer course

Before explaining where and how machine learning can be applied, we briefly describe the main concepts of this article: machine learning, trained model, algorithm, accumulated data, pattern.

What is machine learning

Thanks to the creation and development of artificial intelligence (AI), a computer can now solve many problems: predict market fluctuations, suggest suitable products and movies to the user, predict cyber attacks, and filter spam. To accomplish these tasks, experts train the computer, that is, they are engaged in machine learning.

Machine learning (ML) is the creation of algorithms with the help of which a machine can generalize any experience and find patterns for known cases. For example, AI can learn to find abnormalities in CT or MRI scans and determine if a patient is healthy. As a result of ML, we get a model that can be used for automated or semi-automated decision making.

Trained model is a kind of computer program that reflects the desired pattern. With the help of such a model, it is possible to apply the found dependence to new special cases.

Example:
The trained model can determine from the customer’s search history whether the customer is interested in the company’s advertising. The Avito company in 2015 held machine learning competitionwhere, using machine learning algorithms, it was necessary to find patterns for the user between their search history on the site and the reaction to ads with contextual advertising. This pattern can be used to show ads to exactly those users who will be interested in this ad. This not only increases the effectiveness of the advertising itself, but also improves the user experience with the service.

So, machine learning is used to identify useful patterns. And this, in turn, requires learning algorithms And accumulated data about the object of observation.

To summarize, a machine learning problem always has three components:

  • regularity that exists in an explicit and implicit form;

  • a set of representative data within which a pattern can be traced;

  • algorithms (or models) that, based on data, reveal a pattern.

Let’s consider each of these components in more detail.

How to work with a pattern

We will call a regularity a certain set of features and their values, on the basis of which conclusions can be drawn. Signs can be completely different.

Example:
A person’s age is a sign that is taken into account when calculating the cost of an insurance policy. The values ​​of this attribute can, with some assumption, vary in the range from 1 to 100. The pattern hidden in the attribute of age can be as follows: the older the insurer, the higher the risk of health problems.

It is good when the pattern is simple: it is easy to detect or there are already experts who can describe it. But sometimes there are difficulties:

  • There are no rules at all.

  • The pattern is too complex to be generalized for practical use. It is impossible to describe the experience exhaustively.

  • The pattern changes over time.

If the pattern changes, you need to update the model, taking into account current experience. For example, if an online store enters the market of another country, then new customers may have different preferences and different behavior. But it’s not the end of the world! You can update the current pattern in automatic mode – this allows you to update the models as often as required.

In what problems can you find patterns with the help of ML

There are two “pillars” of application of ML: forecasting processes to make them more manageable and efficient, and automating routine actions. Anything can be predicted: from the demand for specific goods and services to the cost of these goods and services. Routine operations that are often performed by humans can be transferred to a computer. It’s faster and cheaper.

Think back to your experience when you registered your passport or driver’s license in a new car sharing app. Literally 5 minutes after you have photographed your document, the data from it is transferred to the system and you have access. This operation is most likely performed by technology, not people.

But machine learning is not a panacea. There are cases when it is impossible to obtain such a forecast that would be useful to the business. The accumulated data is not always detailed enough, and sometimes it cannot be obtained at all.

Example:
Now there is a big discussion about whether human health data can be given to third parties, whether it be an employer or a bank issuing a loan. After all, having this data, companies can see risks that were not available to them before. Based on them, decisions will be made that may not always be in favor of the job seeker or borrower.

The benefits of using machine learning can be measured in terms of financial impact, which determines the importance and quality of predictions: whether costs will be reduced, whether profits will increase.

How to collect data to look for patterns

Data for MO should contain sufficiently detailed information on already known cases. If you’re trying to build a model that can tell if an email is legitimate or spam, then the training data for the model should include a comprehensive set of both types of email.

Collecting data to solve a problem can take time: you need to organize the collection and storage, analyze new information, determine the sufficiency of data. It may take tens of thousands of examples to achieve good results when training models. Gathering this data is not easy, but it is possible. Sometimes you can use ready-made models trained by other specialists on the data they have collected.

A fairly popular portal in the professional environment Hugging face contains more than 200 thousand pre-trained models for a variety of tasks. On Hugging Face, you can also test ready-made pre-trained models. You can see, for example, the result of the model toxic-bert. This model determines how toxic the text is. If you need to determine the level of negativity in the comments on your company posts, then here it is, the solution. Take it and connect =)

Working with pretrained models most often concerns simple tasks related to automation. For more complex tasks, ready-made models can be upgraded and retrained. As a rule, this requires less data than if the model was trained from scratch to solve the problem.

The data itself can be different: texts, images, video, audio, tables, even X-rays of ancient manuscripts. Related to the latter is the competition for the biggest prize money to date on the popular machine learning competition platform. Kaggle. Yes, yes, by creating a high-quality model, you will receive a very decent fee for your work. Machine learning is, in principle, a well-paid field.

By the way, on Kaggle you can see what real problems machine learning is used to solve, what competitions they organize, and, by drawing parallels, find where ML can be useful to you.

Sometimes the data can be further enriched with information from external sources, which will help in the search for patterns. For example, find and download the desired images in Google Images or Yandex Pictures. But in this case, you will need to enrich the data that you will transfer to the finished model for predictions in an appropriate way, and establish the appropriate procedures. Otherwise, the pattern found simply cannot be used.

Let’s talk a little more about what data types can be used to find patterns.

What data can be used to find patterns

We mentioned earlier that data can be of a completely different nature: from tables and texts to images and videos. Let’s take a closer look at the principles of MO, which are applicable to any type of data, using the example of tables or, as experts also call them, tabulated data.

A row in a table is an ordered numerical sequence of a fixed length. Why fixed? Because the number of columns in the table is finite. In other words, for each row (or record in the table) there are n-th number of columns characterizing this record. The columns are signs, or features object to which this record corresponds in the data, and the number itself is the value of the corresponding feature.

Example:
The company has a new client who, over time, left a history of their purchases and views. Data about the history of this client enters the table as a new row, or record. What about signs? The features that describe the customer and are stored in the table can be the following: the location where the user is located, the type of his device, the categories of goods viewed, the average purchase price, the frequency of purchases. There can be a lot of information. We just gave a guide so that you can better understand what a record and features are in tabular data.

Mathematically, records about objects can be considered as vectors in some space. The machine learning model often cannot accept these vectors as they are in the table, so they are translated into a format suitable for use by machine learning algorithms. This stage is called data processing and transformation (data preprocessing/wrangling). Comparatively complex information, such as text, can be transformed in a variety of ways, which gives specialists room for experimentation.

When it comes to forecasting, the feature whose prediction the task involves is called target featureor target (target).

Example:
You want to predict whether a bank customer will repay a loan. At the same time, the bank has information about loans for many customers and whether the loan was returned. Using the client’s history and up-to-date information about him, you can guess with some probability whether the client will pay the debt. You suggest a class: the loan will or will not be returned.
You can also build a model that predicts what percentage of the loan the client will return to the bank, where 0% – will not return anything, and 100% – will return in full. This kind of problem is called a regression problem.

When the model is trained on the data, it is given a set of known data identical to those used in training. The output is a prediction. At first, this value is unknown, otherwise it would be impractical to predict it.

Signs, on the contrary, must be known and accessible. It is important to take this point into account when choosing predictive features for model training, evaluating whether they can be determined at the time of prediction for new special cases.

How to choose an algorithm for finding patterns

All machine learning tasks can be divided into several types, and based on this, choose an algorithm to search for patterns.

Learning with a teacher

Supervised learning is the most common type of task in ML. associated with forecasting. The data in this case contains the values ​​for the predictor features known at the time of prediction and the corresponding values ​​of the target predictive feature (target). In other words, at the training stage of the model, there is already a historical data set where the predicted feature is available.

Example:
Loan repayment prediction is a task for supervised learning. You already know for each borrower whether he repaid the loan and, if not, exactly how much was not repaid.

Tasks from the supervised class fall into two main groups: classification tasks (for example, comment toxicity) and regression tasks (for example, a bank loan). Such tasks have already been mentioned above.

The fundamental difference between the problems of regression and classification lies in the nature of the predicted feature – a number or a class:

  • If you want to predict some quantitative, measurable value, for example, the duration of a taxi ride, then the problem is solved regression. Look at the problem from the competition New York City Taxi Trip Duration – they just predicted the duration of the taxi ride.

  • If the predicted value cannot be quantified, it reflects belonging to a particular group, then such a task belongs to classification. An example of such a task is Homesite Quote Conversion – prediction of the purchase by the client of the proposed real estate insurance tariff. What is predicted here is not a measurable value, but belonging to one of two categories: those who bought insurance at the proposed rate, and those who refused the offer.

Learning without a teacher

Machine learning can be used for more than just making predictions. Unsupervised learning algorithms do not predict the value of any of the features, respectively, there is no target and predictors. These algorithms explore the data structure, relationships between objects. Typically, unsupervised learning is used for analytics or to help solve supervised learning problems.

Example:
Clustering – combining similar records into groups, or clusters – is performed using unsupervised learning algorithms. Clustering is often used to group similar users together. Then customers from the same cluster can be shown similar discounts or recommendations.

Example:
If it is required to transform the available features, then unsupervised learning algorithms can transform the records in the data table so that each record is represented by a smaller number of features (columns), but the information is not lost. This will help to safely compress the data, which in turn saves computer RAM when training supervised models.

Thanks to dimensionality reduction techniques, you can also visualize complex multidimensional data. This is what a dataset for a bank loan default forecasting problem might look like.

Loan defaults are marked in blue, loans returned in green.  In practice, the dots usually overlap and become more difficult to distinguish.

Loan defaults are marked in blue, loans returned in green. In practice, the dots usually overlap and become more difficult to distinguish.

Separately, it is worth highlighting the tasks of anomaly detection. If you need to predict some relatively rare phenomenon, such as bank fraud, it can be useful to use unsupervised learning, namely anomaly detection algorithms. These algorithms will help to detect atypical objects that stand out from the general structure. In practice, such methods can be used both independently and in combination with more common methods.

Reinforcement learning

Everything that was discussed in the article earlier refers mainly to classical machine learning. Unlike classical machine learning, reinforcement learning does not use input data in the form of a finite number of pairs “a description of a known case – the corresponding value of the target feature”.

It is assumed that the algorithm, having no access to information about correct and incorrect actions, through trial and error, iteratively interacts with some environment. The purpose of the algorithm is to find a strategy for interacting with the environment that will provide the algorithm with the maximum gain (some numerical signal). At the same time, actions in the environment determine not only the gain, but can also influence the environment itself, change it.

An illustrative example of reinforcement learning is considered In this articleand you can dive into the theoretical aspects in the online machine learning tutorial from ShAD. Reinforcement learning is a broad topic that allows you to solve complex business problems. Usually, reinforcement learning is studied when supervised and unsupervised learning algorithms are already well mastered, because they cover the largest part of the market needs.

If you want to learn how to work with data and machine learning algorithms to solve practical problems, then you can learn a new profession on the course “Data Scientist” from Yandex Practicum.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *