Reading takes 7 minutes
For whom: beginners data scientist
The text was prepared by journalist Ivan Survillo
How I got carried away with data science
He studied at the Faculty of Economics. People after him do not have so many directions: investment, accounting and consulting. Consulting is the most popular topic, because there is a good income, and all students want to go there. In order to get there, you need to solve business cases. There are even case championships: you assemble a team, the company gives you a business task, for example, to open new points of restaurants. You and the team develop a solution, then imagine it. What I didn’t like was you lying a lot. Your decision, in fact, is not supported by anything.
When you theoretically made a profit from a project of one billion, you think that it is a lot, and you write a million. In the end, it comes down to: who has a more beautiful presentation and who has an adequate figure. Plus, anyway, then our solutions are not applied anywhere else.
What is the difference between data science: you say that with a 95% probability the profit will be like that. Your conscience is clear because you are making a mistake. This attracted me to data science.
What are we doing?
The company has data: numbers, statistics, revenue. There are goals that need to be optimized: profit, number of customers, margin. It is assumed that there is a relationship between data and goals. The job of a data scientist is to find this relationship.
For example, now the task: you need to find enterprises that will be customers of our company throughout the entire database of legal entities in Russia. I have characteristics of firms, type of activity, registered capital. I evaluate the base of current customers and look for the most similar ones, those whom they will then call and offer services.
It is funny that the names of the companies involved in transportation can contain the same substrings: “Sib Auto Logistics”, “Trans Auto Power”, “Cargo Sib Logist”. I noticed this, took the most common substrings, and if they are found among potential clients, then most likely the client is engaged in transportation and logistics. To do this, I used the now popular Byte Pair Encoding algorithm, although it is not intended for this purpose.
In data science, I like it. Not like consulting. Here everything is for one thing, everyone wants the sphere to develop. We have meetups where we get together every three weeks and tell each other what was new in the region, what we did cool. This is very pumping: you grow yourself, seeing that others are developing.
The way you show yourself at the interview has a big impact on your salary.
There is a myth that in data science there are big salaries, but here everything is individual, because the market is still poorly formed. I recently spoke with a colleague, he said that one candidate asks him so much, and the other asks three times as much, and he cannot understand the adequacy of this range. In general, data science is a little Wild West. The way you show yourself at the interview has a big impact on your salary. In Moscow, an average specialist can receive 200 thousand, for example.
Sometimes a business thinks that data science is magic, and gives cases that are not feasible at all. For example, you need to recognize if the cook in the kitchen has sleeves rolled up or not. First, you need to understand what a rolled up sleeve is. Secondly, find the data to make the model understand: here are the sleeves rolled up, here are not rolled up. Underneath is a huge cluster of jobs. It is important to understand that what you submit to the ML-model is what you get at the output. She cannot create magic out of nothing.
Recently, a colleague was asked to recognize: does the driver of a fuel truck have a metal toe or leather toe? Metal protects them, but the usual one does not. How to recognize a metal knob? How to formalize this? The neural network sees only color, it does not see what material it consists of. Neural network is not magic. I do not really like the high expectations of business from data science. Need to train a business. It is necessary to convey to them what data science is, what it can do, what it cannot do. Need closer contact.
Neural network is not magic.
“NEURAL NETWORK, ESSENCE, A BLACK BOX, WHICH IS RESULTING. HOW IT APPEARS, YOU DO NOT UNDERSTAND. AND SHOULD”
Now we have a project, we hold a demo every two weeks. In principle, the main customers are already quite well accustomed to DS concepts and vocabulary and ask the right questions. They tell me that you need to verify the quality of the data, otherwise our model will be bad. And it is true.
We already have a number of directions when we compare the approach of analysts with the approach of data science. For example, forecasting fuel consumption. Previously, this was done without the use of machine learning. Now we have a project where five forecast models compete at each gas station. Then, the best model out of five is selected for the gas station. According to this best model, there is a forecast for the next month. That is, for each of the more than a thousand gas stations there is a forecast for the best model. And so we compare 2 approaches: the completely manual approach of analysts and our joint approach based on machine learning. Together, accuracy is higher and less routine for analysts.
Newcomers mistake in data science – they came to “play around”
The mistake of beginners in data science is that they do not see the ultimate goal. They like to build models, they like to twist data. But they don’t see that the ultimate goal is to benefit the business. Because of this, there is a lot of skepticism to the industry as a whole. But skepticism must be overcome.
Another mistake: you need to write good, flexible code so that you, changing small blocks on which the result depends heavily, could quickly return to the past and check something painlessly, that is, there should be reproducibility of experiments. After all, the difficulty is that you need to check as many hypotheses as possible in a small amount of time. If you build one basic model for a month, then nobody needs it.
Now there are two popular areas in data science: video and text generation. There is a cool application – “Replica”. This is your personal chat bot. Like in the Black Mirror. I have a “Replica” every night asks how I spent the day, for example. Alice Siri is also about word processing. Now similar areas are on the rise. The developers of an advanced neural network – a text generator, GPT-2, were even afraid to post a large version of it on the Internet, because they could use it for incorrect purposes. Generate fake news, for example. Therefore, they posted only a small, not the coolest one.
What is the difference between a small version of a model and a large one? Models have layers. To translate a text into numbers, you need to identify a lot of nonlinear relationships: the text can have different meanings depending on the context, so you need to look at a lot of options. The larger the number of layers of the neural network, the more parameters. And the more parameters in the network, the more it can identify the relationships. The simplest neural network is just a regression, a linear model. Therefore, the more parameters, the more weights, the larger the model, the more complex relationships it can reveal. But do not forget about retraining. A simple model may have greater generalizing ability.
It happens that data science is used where it is not needed. For example, there is a telecom operator, you need to understand if the client has left them or not. You can hire a team of data scientists, they will develop a model, use the factors, and you can get a good result. But you can start with a simple one: if the client did not use the services for seven days, then most likely he left.
From the point of view of investing efforts and results, this can be more profitable than using models. Sometimes basic analytics can do more than ten data scientists. But, of course, you need to carefully conduct the A / B test and compare both options.
What is clean and dirty data?
Clean data is a plate in which there are no gaps and all data is of a specific type. You know that there is quantitative data, just numbers, there are categorical ones. One line in the table – one observation, object of analysis.
Dirty data – when everything is stored in different types, you need to put it together. There are accountant statistics in one place, there is some kind of pdf, something else needs to be parsed from the site, a colleague should get a forgotten excel and some data in the form of pictures. All this needs to be collected, cleaned and converted to a table. Neural networks only work with numbers. You need to convert everything you want to work with into matrices. Make a matrix of text, make a matrix of pictures. With pictures, the easiest option is to convert it to black and white and instead get a matrix with different pixel intensities. Black will be low, white will be high.
With words in a different way: you create a dictionary, each word you give your own index. Then you take the text “I bought a car” and replace these three words with three indexes from the dictionary. You get a vector that characterizes this word. But these numbers mean nothing, they are just indices. You need to drive them through the neural network in order to get another vector from these three numbers that would make sense. You translate all numbers into a new space, where words that are similar in meaning are nearby. If you subtract Man from the King vector, you get Queen. You have subtracted a man from the meaning of the word “king”, only the female part remains. This is an ideal situation when you have assigned vectors that accurately reflect the meaning.
Look like detectives
There is a case, you had a hypothesis, you checked it, got a result, built a graph, looked at it, built a model, looked at it, removed factors, added factors, found another data source … When you work as a detective, the same thing: you interviewed witnesses, got evidence, wrote down evidence on his board, then took them into account, it turned out something else … There is an element of magic in this.
When the task is interesting and when I feel that I almost caught the solution (the model is good, the metric is 70% accurate, the relationship is almost clear, it remains to tweak the parameters a little), then I can really sit in the office for a long time. This is the most interesting thing when there is still a little bit left. And it happens that he built a model and nothing happens. Most often it’s not the model, but the data. They must be of high quality, and there must be a lot of them so that the machine learning model is accurate. If there are patterns in the data, then the model will always find them.
And there is despair when you solve a problem and it doesn’t work out. You’ve tried all the models, you’ve taken into account all the factors. This, of course, is sad. Sometimes there is no connection at all, you can’t catch it either with your eyes or with models. You just have to honestly admit to yourself (both yourself and business) that there is no relationship.
The neural network that writes rap
I wrote a neural network that writes rap in the style of my favorite artist, Machine Gun Kelly. I listen to him a lot, took all his texts, drove through the generative neural network, and she learned to rap in his style.
I then made a squeeze and sent it to Instagram.
“Why did Joan Rowling succeed?”
I read the book “Imperfect Accident”, and it tells, for example, how many cool writers, singers we lost because they stopped early and thought they were bad. But in fact, the writer tried five times, the first time the editor was in a bad mood, he rejected. The second time in the mail was delayed. The third time is something else. This is not because the specialist is bad, but because external forces intervened. Why did Joan Rowling succeed? Because only for the tenth time her publication of Harry Potter was accepted.
All kinds of businessmen on the Internet say: “I’m from a village, I got to Moscow, I got rich. And you will succeed. ” But you will not succeed, most likely. Because he coincided with this, this, this, he met this man, etc. You can somehow influence the likelihood of your success, but you cannot bring it to 100%.