Hello Habr! We invite you to a free Demo lesson “Modern Big Data, Analysis and Performance Optimization of Distributed Applications”… And also in this article, we decided to tell you how the situation is developing on the market of Data Science specialists and specifically in Big Data and what awaits you on the course on industrial machine learning.
In large companies, data science in terms of fit-predict is a thing of the past
The first thing to note is that there are now an excess of juniors, and there is a trend among companies to look for a Middle / Senior specialist, give him some time to study his infrastructure and immediately assign him combat missions.
At the same time, a significant part of novice specialists still believe that the Data scientist is quite ready to implement the model – to train it on some data and give it to the Data engineer, and then they will somehow figure it out. But now everything is moving towards the fact that the learning and validation processes themselves are so built and understandable that even a non-specialist can make fit-predict. It turns out that people who can only do this are not really needed in rebuilt conveyors.
In addition, there is the problem of training specialists who would have knowledge in the engineering field at least at the bird view level. In the classic courses, there is little information on this part, also because it is difficult to immediately deploy the necessary infrastructure, and tasks on Kaggle do not require this. When you come to a large company, you are greeted by a cluster of tens of petabytes, where you have to write distributed algorithms on frameworks that differ from the standard set of Date Scientists. On the one hand, this scares many, and on the other, those who understand this at least at the basic level get an advantage when hiring.
Alternative specialty for Data Scientists and Software Engineers
Course “Industrial ML on Big Data” offers a symbiosis of the skills of Date Scientist and Date Engineer. As a rule, such specialists are required in large companies with a large-scale digital product, where they need to work with streaming data.
Accordingly, this profile can be mastered by both machine learning specialists and those who have a background in software engineering. Moreover, the second will be somewhat easier, because Basic ML is easier to learn than a full stack of engineering technologies.
Skills required to work with Big Data and distributed data
In short, you will need to know the peculiarities of processing distributed data, master the Spark framework and learn all the components of production.
We packed it all (and a little more) into an online course “Industrial ML on Big Data”…
The program is designed for 5 months and consists of 9 modules:
Module 1 is devoted to the initial knowledge that is necessary for mastering the further program. A quick review of ML: what models, metrics and types of training arehow we teach models, measure everything, validate and draw conclusions from the obtained results.
We also included here Scala lesson… While you can communicate with big data using the Spark framework in Python, we still suggest getting to know Scala so you can contact Spark through its native API. At the end of the module, you will have your homework in Scala.
In module 2, you will learn about technical foundations of distributed data processing… You will learn about the storage, how parallel algorithms evolved, what resource managers are in such distributed systems. Get started with Spark and do your homework on it.
In module 3, we start dive into distributed ML… We show how models learn in the distributed paradigm on Spark, how to select hyperparameters. Those. we are translating the local computing experience relevant to a Data Scientist into a distributed paradigm.
Module 4 focuses on streaming… This is primarily useful for those who have done competitive data analysis or have worked with limited resources. These skills are more related to working in large companies, where there is some kind of continuous stream of incoming data that needs to be processed, added, stored, applied to them on the fly ML.
Module 5 objective is to teach you form long-term and short-term goals for an ML-project… You will understand how to achieve these goals and measure the results. A couple of sessions are dedicated specifically to how to conduct A / B testing.
Module 6 answers the questions of how and why to train models. You will learn, how to roll out models in your infrastructure: wrap, version, play, serve, etc. All this is for big data and the distributed paradigm.
Module 7 is reserved for Python… You will master various practices: how to write on it in production and how to wrap it all, how to insert a model into servicing, create an API for it, pack it into containers and roll it out using the example of cloud systems like Amazon.
Module 8 is dedicated to advanced topics. Here we will analyze how to run neural networks in production, reinforcement learning, and finish the module gradient boostingwhere you will learn to run it in a distributed fashion on a cluster.
Module 9 focuses on project work… Here you have two options:
You can take your work case that you are currently working on. Then you will perform the assigned end to end task: starting with the data that comes in a stream or uploaded in the form of a dataset, and ending with the result that your models give in the form of a service, unloading, etc.
You can make a training project: a recommendation system based on the OTUS database.
The specialty that this program gives is not only the most applied, but every year it will become more and more promising. This is also due to the fact that more and more digital products focus on data processing and more and more often specialists are required not only to train the model, but also to properly prepare it for production.
If you are interested in the field of industrial ML, you can take the first steps in this direction already on October 19 at a demo lesson “Launching ML models into the industrial environment using the example of online recommendations”, which will be moderated by Dmitry Bugaychenko, managing director at Sberbank. Since the lesson is designed for specialists with experience in working with data, to register you will need to go through introductory testing…
The course “Industrial ML on Big Data” itself starts on October 30… You can get acquainted with the teaching staff and the program here…
See you in class!