It is believed that the scientist performs most of the processes with the help of ready-made library solutions. But in reality, in typical problems you need to be able to check how suitable the selected method is and, if necessary, modify it to fit your conditions. Together with Peter Lukyanchenko, a teacher of higher mathematics for Data Science at OTUS, and in the past Team Lead Analytics at Lamoda, we look at how mathematics helps in real business problems.
The first of three parts of this topic is devoted to regression analysis.
Business challenge: The car-sharing company needs to identify the dependence, as a series of factors – driving experience, weather, condition of the car and road surface, traffic, city population, etc. – affect the probability of getting into an accident.
For Data Scientist, this task looks like this: Calculate the equation for the dependence of one set of observations on a set of other parameters.
Typical Solution Problem: Models that offer libraries use the normal distribution error by default. Her calculation is rather rough, and rarely approaches the obtained dependence. Moreover, the inclusion of an inaccurate error in the equation leads to the fact that with each new set of parameters the prediction becomes less and less accurate.
How mathematics saves
Let’s start with a description of the relationship for one factor – driving experience. The classic paired linear regression model employs two coefficients. First coefficient α (alpha) – an unconditional value when there would simply be a general probability of an accident, regardless of any parameters, simply by coincidence. Second coefficient β (beta) determines the sensitivity of the driving experience factor to the likelihood of an accident. Coefficient β also called slope in the equation of dependence. And since there will always be factors that we forgot or were unable to take into account, we must add some error U to the equationi.
We get the equation: yi= α + βxi+ Ui.
Actually, the analyst’s task is to search for such coefficients for which the error Ui was the smallest.
There are quite a few varieties of error calculation. Due to its simplicity, the most popular absolute error is the deviation of the predicted value from the absolute value. The common error in this case is the sum of the modules. The module problem is that this function is not differentiable on the whole space of numbers. Then the mathematicians came up with the idea of taking continuous transformation in order to generalize the error, and began to summarize the squares of such deviations. Since this function is continuous, we can apply Lagrange optimization (optimization of the function of two variables). Having calculated the derivative functions with respect to α and by β, we find the points of extrema, then we classify them through the Hessian property (according to the Hesse rule). Two coefficients are formed α ’ and β ’, least squares method. It underlies the Gauss-Markov theorem, which is the most optimal pair regression model. The grades she received are the best, and their results cannot be interrupted by any other methods.
Scale the process
Now we come to the fact that the probability of getting into an accident is affected by many other parameters that can be expressed in a quantitative assessment. It turns out that Y depends on the nth number of variables X. In order not to repeat the same calculation of all the coefficients α and β for each parameter, we move on to the matrix equation of dependence. Having carefully differentiated, we can obtain a matrix of coefficients, so we generalize the paired regression equation to multidimensional.
Error is the key
Another important point in solving regression problems is related to the choice of error. Often, analysts choose a normally distributed error. In fact, this is an obsolete method. It still works well in theoretical conditions, but is already too primitive for our algorithms that are constantly becoming more complicated and striving for truth. For a competent specialist, error is a research subject that helps to better understand the very essence of regression. Having built one regression, he looks at what errors it has generated and explores the entire cloud of errors. For example, if deviations increase, this is a sign of heteroskedasticity, i.e. then we forgot to take into account some variables of X and did not count them. If he discovers that errors are located according to some law and notices autocorrelation in them, then this is a sign that we made a mistake with the model. Ideally, you should strive to minimize the deviation of the error from zero.
So, what knowledge of higher mathematics did we need to build a complex dependence of the probability of an accident on a set of factors:
- Mat. analysis to optimize the regression function
- Linear Algebra, i.e. definition, properties and differentiation of matrices, for the transition from pairwise regression to multidimensional
- Analysis and selection of the type of error distribution. For example, a specialist may take a generalized normal distribution, a beta distribution, or a student distribution. This is especially necessary in cases where there is no good sample and when it cannot be improved. And also when the condition of the Gauss-Markov theorem is violated and the need arises to construct the regression equation differently or use other methods for classifying and estimating probability.
The ability to work with a mathematical apparatus is an important advantage of Data Scientist, which allows him to check the results and solve atypical problems. In the next article we will talk about mathematical solutions for advisory services. In the meantime, we invite you to courses in mathematics for Data Science, which will start this week.
Have time to sign up and pass the entrance test.