unless you have a deep understanding of the process.
At Jet Infosystems, we introduce machine learning in a wide variety of industries, and based on our experience we single out the necessary components for a successful implementation:
- goal setting aimed at optimizing the business priority metric;
- a team of data scientists who are competent and ready to dive deep into the process;
- data that is relevant to the business task;
- adequate choice of method.
In practice, all these elements are extremely rare together, according to statistics, only about 7% of projects with ML are considered successful. Projects with all of these components can be safely classified as breakthrough! To illustrate, we have formulated several points that can be called harmful tips about the introduction of machine learning in business.
Bad advice No. 1: “The task is simply to implement ML”
Often, the customer formulates the task as “just to introduce machine learning for some optimization”, without any connection with business metrics and prioritization of business tasks.
In this case, we can see several negative scenarios. For example, the targets will change as they work, but this means that all preprocessing and the choice of optimization methods will change, because they are directly related to the meaning of the target. Or a data scientist will choose some metric from machine learning, for example, auc, and will improve it, bring in all the hype frameworks and libraries, based on his sense of beauty – perfect the “fifth decimal place” in the chosen metric. At the same time, for business this work may be completely unimportant and not lead to successful implementation. Or some minor task for business will begin to be solved, when in fact there is much greater potential for introducing machine learning nearby.
As a result, you may encounter negative consequences:
- it is impossible to predict the timing and labor costs;
- models are improved in isolation from business metrics;
- made an investment in a minor task.
Bad advice # 2: “Any data scientist will do”
There is an opinion that you can take any data scientist from the market, plant him in isolation with excels and he will magically figure out what needs to be optimized. In our opinion, the mentality of data scientists who are involved in production optimization is extremely important. This means that they must be ready to dive deep into technological processes (for example, aluminum electrolysis, oxygen-alkaline cellulose treatment, blast furnace production, etc.). The willingness of data scientists to travel on distant business trips with the goal of personally speaking with technologists and operators at the factory is also important, in order to understand how everything really works. Without this, most likely, they will be doomed to a large number of thoughtless iterations of enumeration of models, and you can never reach a useful implementation.
Bad advice number 3: "Work should be patchwork"
The ideology of the most fragmented organization of work with the maximum division of labor to minimize costs is regularly met. For example, there is an analyst who understands the process, communicates with customers and technologists. There is an engineer date – he processes the data, generates features. And finally, there is a data scientist – he does just import sklearn and fit / predict. Thus, the work of a data scientist occurs in isolation from the realities of life, extremely laboratory, and there is a high risk of committing a large number of errors and missing important aspects of the original task.
Bad advice # 4: “Don't explain to data scientists how data is collected”
It's not always obvious that data scientists need to understand how and where data is collected. There are even cases when ML implementation contracts are signed without first familiarizing themselves with the data, and under such conditions there is a risk of never reaching the target values of the metrics described in the contract. With this approach, problems will inevitably arise both with assessing the quality of models and with the possibility of their real application.
Many data properties influence the choice of methods: averaging data and measurement errors, uneven sampling of examples, time lag in measurements. It is important to correctly clean data from noise in factors and targets, the causes of noise can be different: digitization errors, outlier, duplication of variables, instrument errors, etc.
The company should be interested in that data scientists thoroughly understand the nature of data, otherwise data processing will be long and will not lead to successful modeling. Without a deep understanding of the specifics of the process of collecting and storing data, one may encounter the following problems:
- data preprocessing will take a lot of time;
- the model may not be applicable in real conditions;
- terms of the contract may be unattainable.
Bad advice # 5: “Make data collection a complicated and incomprehensible process so that no one knows how it works. After the introduction of models, be sure to make changes to the process ”
Often, in parallel with the development and implementation of the model, technological processes change that affect data collection. Imagine that it is necessary to optimize the technological process, and after the introduction of the model, some units are reconfigured and this affects the data collection: features will “float”, distributions will change, the training sample will cease to be representative. Of course, no one knows about this in advance. As a result: the model will stop working and everything needs to be redone. For example, in cases with trees, an out of domain problem may occur.
It is important to coordinate in advance with data scientists all changes in technological processes so that they can quickly adapt models to new conditions.
Bad advice # 6: “Average the signs”
Some types of averaging lead to problems, for example:
- the task is to predict the hourly energy consumption, but at the same time, energy consumption data is stored only for months – in this situation, nothing can be done before the accumulation of raw data;
- averaging occurs over characteristics that are measured at significantly different points in time;
- using moving averages that capture the prediction period (which leads to a data leak problem and model distortion);
- worst of all, when the data is somehow averaged and this fact remains unknown.
In such cases, the task may not receive an adequate solution until the relevant raw data appears.
Bad advice No. 7: "Do not give out additional data"
There are several scenarios where data scientists ask for additional data:
- additional raw data needed;
- it is necessary to add new signs to the data set. For example, in the tasks of the banking sector and product recommendations, it is useful to use as many socio-demographic attributes as possible;
- increase data set size
- the amount of data is limited, but can be expanded due to historical data, or it is possible to create additional data, such as in image and video processing tasks.
Data scientists ask for additional data when they have experience solving similar problems in which the use of this data yields a positive result, otherwise the quality of the models can be much worse than potentially achievable.
Bad advice number 8: "The accuracy of manual marking is not important"
Let it be required to predict the quality of products based on manual marking, i.e. Production operators manually record target values. If at the same time operators receive bonuses for good results and punishment for bad ones, then:
- the target is likely to contain an offset;
- as a result of training, this bias will go into the model;
- the model will not predict the actual distribution of the target variable.
Similar problems can arise with the use of crowdsourcing solutions (for example, Yandex.Toloka), where experts receive a reward for marking up the data. In this case, you need to carefully validate the resulting markup. There are a number of approaches for this:
- Overlapping: several independent experts markup;
- Golden Set: examples with pre-known results are added to the data to evaluate the accuracy of the operators and their selection;
- Majority voting: Verdict selection algorithms based on overlap markup.
Conclusion: if there is a manual markup of data – you need to check it, otherwise systematic errors may occur.
Bad advice number 9: "Use the most fashionable"
Read popular articles and demand that the solution to the problem be based on a fashionable method.
Today, data science is a fashion field, a lot of articles are published, conferences are held almost every day, an increasing number of methods are being created. However, this does not mean that an arbitrarily taken popular method is optimal in industrial tasks. Usually you do not need to use LSTM in the task of optimizing the production of pig iron, nor do you need to use RL on small data sets of marketing or mining. In such tasks, it is reasonable to start with traditional methods (for example, gradient boosting), which can be quite difficult to convince customers. Fashionable ML methods are not always suitable for the tasks of the industry and often prove costly to implement.
The above set of tips is not exhaustive, but all of them are regularly met in practice. With this approach, it is likely to make sure that ML is not working in the industry and is simply a waste of money.
Summarizing, we can say that the truly breakthrough cases are ML-projects, implemented on time and stably bringing measurable profit to the business. To achieve this, the competencies of data analysis and machine learning are important, and the conditions when data scientists understand the whole picture of a business problem well.
Posted by Irina Pimenova, Head of Mining, Jet Infosystems