Business scenarios using ML (machine learning) cover all areas of the business and use most types of data: tabular, text and audio, images, etc. There are more and more projects, and the number of specialists is growing not so fast. There is an idea that part of the work of these “expensive” data Scientists can be automated. And here AutoML comes to the rescue.
By AutoML they mean different things. At SAP, we believe this is automation of the routine operations of Data Science. Probably, it is not necessary to describe the definition in more detail in this article, since Aleksey Natekin has already done quite well here.
If you watch the video there is no desire, then here are some thoughts on the topic:
There is a good example on this subject. Once, in the DS group, we discussed a case from practice – a person who claimed the role of Senior DS came for an interview, while
e, what he knew how to do was run one of the popular AutoML tools. To a reasonable question, how can one qualify for a Senior level with such knowledge, his answer was impeccable: “I bring money to business, and this is my tool.” That is, AutoML in scenarios where data is already neatly collected in storefronts, domain features are generated, and quality metrics are defined, which allows you to quickly launch a new service. Yes, the result may be worse than prof. DS, but most likely better than June, and in some cases, you can immediately use it.
Here are more examples of what popular people in the community think about this (the first comment refers to a discussion of the news that AutoML from Google took second place).
And the use of a large number of resources is obtained, because now there is no advanced meta-training. More precisely, it is pointwise in some decisions or at a very early stage of readiness. It can also be found in the form of prototypes. The rest is a random search for hyperparameters or more promising approaches: TPE, Bayesian optimization, NAS, RL.
So that AutoML solutions and approaches can be compared, an open benchmark. Commercial solutions do not like such comparisons for a very simple reason – open confrontation is almost impossible. In addition to accuracy, there is too much focus on data types, embedding and use. To make the model itself is 15-20% of the work (or maybe less), in addition there is a huge layer of other works – from transfers, to the publication of the service.
SAP takes its position in the AutoML market. We have several different engines with different levels of maturity.
The SAP Automated Preditive Library at SAP HANA, which historically appeared after the acquisition of KXEN in 2013, further developed exclusively as a tool for the quickest possible implementation of models. It is convenient when there is no heavy (in time) budget for training models, but a sufficiently high-quality result is important. In fact – consider this a fast version of AutoGBDT. Now there is a python wrapper familiar to most people, and it looks something like this (Fig. 1).
The second branch of AutoML solution in SAP Data Intelligence from SAP appeared in December 2019. This is an approach built on the basis of familiar open source tools and complemented by our own developments. Here the possible calculation time is set up, and within the cluster, the optimal combination of steps, algorithms, and hyperparameters is selected, where the final pipeline looks like this (Fig. 2).
This is AutoML, which is part of the SAP Data Intelligence platform and can work in both cloud and on-premise. Also, everything that is needed to manage datasets, integration, and, perhaps most importantly, the standard integration mechanisms in SAP S / 4HANA with the generation of interfaces and services appears here.
If we consider the next steps, it is quite obvious that the data, from the point of view of the business, should be saturated with annotations that will be relevant for certain tasks. These are domain signs, and the best forms of aggregation with certain relationships of business objects, and pre-trained micro-neural networks – feature extractors.
If you look at competitions and articles in the field of AutoML, you can clearly identify the following areas:
1) AutoTable – tabular data
2) AutoCV – images and videos
3) AutoNLP – texts
4) AutoTS – time series
5) AutoGraph – graphs
6) AutoSpeach – sound
7) AutoAD – search for anomalies
I suppose there will also be solutions under AutoRL – for training with reinforcements.
Currently, SAP is focusing on working with tabular data, time series, and anomalies in terms of AutoML solutions. The reason is simple, it is only possible to build an intelligent enterprise with a huge number of models in each of the business areas. Well, of course, each company has its own specifics, therefore, if standard models (typical) are not suitable, their customization is necessary. And the easiest way to do this is using tools that do not require the participation of DS-specialists.
In general, a lot of new and interesting things await us in the future …
Posted by Dmitry Buslov, Senior Business Solutions Architect, SAP CIS.