How a package with packages helped an analyst solve a business problem, or keep calm and import statsmodels

Hi all!
My name is Sabina, I am the leader of the data science team at VkusVill. We help businesses make decisions based on data. Today I will tell you about one such case. The article will be useful to analysts who want to stop worrying and start using linear regression from the Python stasmodels library.

Business context
The B2B department is working on a new assembly and delivery service. An important change is that we will transport goods not in bags, but in reusable containers. They are presented in several volume options, and you need to understand which ones to order and in what ratio.
In addition, it is necessary to take into account the assembly rules: products of the “Cooking”, “FROV” and “Frozen” categories should be folded separately from all others.

Analytical task
It is necessary to determine what volumes couriers most often carry, round up to the volumes of the proposed containers and calculate their ratio.

Data preparation
The database contains a directory of goods, but we do not have volume data for most of them. Of course, no one is going to interfere with pickers in the dark store by measuring the average volume of Mango Egypt, or look for it in the density table in order to move from weight to volume.
As you know, the most popular product among retail buyers is a package. And B2B clients agree with them on this. And the package just has volume.

Then you need to prepare a table in which the row is one purchase, and the fields are:

number of packages
number of autonomous categories (Cooking, FROV, Freezing, Rest)
and for each autonomous category there are two fields: the number of units of piece goods and the total weight of goods by weight.

All that remains is to calculate how much volume a kilogram and piece of each autonomous category takes up, and the solution is in our pocket (or in the bag).

If we had many orders in which there was only one autonomous category and only one unit (kg or pcs), then everything would be solved with a SQL query. But in this case this is not the case, so we use linear regression.

Linear regression
We formulate a theoretical model on what the number of packets depends on:

$Number of packages \sim \\ number of categories + FROV_{weight} + FROV_{pcs} + Cooking_{weight} + Cooking_{pcs} \\+ Freezing_{weight} + Frozen_{pcs} + Rest_{weight} + Rest_{ pcs}$

And using the Python library statsmodels we assemble a mathematical model:

import statsmodels.api as sm

reg_cols = ['колво_пакетов',
            'cnt_cat',
            'ФРОВ_вес',
            'ФРОВ_шт',
            'Кулинария_вес',
            'Кулинария_шт', 
            'Заморозка_вес', 
            'Заморозка_шт',
            'Остальное_вес', 
            'Остальное_шт'
           ]

y = df['колво_пакетов'] # зависимая переменная
x = df[reg_cols].drop('колво_пакетов', axis=1) # независимые переменные

x = sm.add_constant(x) # не забываем про интерсепт

model = sm.OLS(y, x).fit() # обучение модели
print(model.summary())

After executing the code we see a summary:

The feature Cooking_weight p_value > 0.05, which means that the coefficient is not significant” title=”The feature Cooking_weight has p_value > 0.05, which means that the coefficient is not significant.” width=”965″ height=”829″ src=”https://habrastorage.org/getpro/habr/upload_files/d47/870/cef/d47870cef715ba10d08bb3e462f64dbb.png”/></p><p><figcaption>The feature Cooking_weight has p_value > 0.05, which means that the coefficient is not significant.</figcaption></p></figure><p>We exclude the insignificant variable and run it again.</p><p>What we pay attention to:</p><figure class=

– what proportion of the variance of the dependent variable was explained by the mathematical model. Can take values [0, 1]. An important indicator, but you should not focus only on it. And there is no hard trash hold, after which value the model is good. Even in published scientific articles it can be 0.3.
P-value the entire model. Must be below the significance level (usually 0.05)
Odds with independent variables, the models must be tested for adequacy. For example, the more categories (cnt_cat), the more packages – the relationship is direct, the coefficient is positive.
Module from t-statistics shows how much a specific variable contributes to the model.
P-value of the variable below the significance level (usually the significance level is taken to be 0.05)
Durbin-Watson statistics reflects the presence of autocorrelation. If there is no autocorrelation, equal to 2.

Now this is what the model was built for: we restore the number of packages per category by substituting the coefficients from the model.

df['Пакетов_ФРОВ'] = (0.4136 / df['cnt_cat'] + 0.5183) * df['is_ФРОВ'] \
                    + 0.0933 * df['ФРОВ_вес'] + 0.0595 * df['ФРОВ_шт']

df['Пакетов_Кулинария'] = (0.4136 / df['cnt_cat'] + 0.5183) * df['is_Кулинария'] \
                          + 0.0374 * df['Кулинария_шт']

df['Пакетов_Заморозка'] = (0.4136 / df['cnt_cat'] + 0.5183) * df['is_Заморозка'] \
                          + 0.1739 * df['Заморозка_вес'] + 0.0294 * df['Заморозка_шт']

df['Пакетов_Остальное'] = (0.4136 / df['cnt_cat'] + 0.5183) * df['is_Остальное'] \
                        + 0.1259 * df['Остальное_вес'] + 0.0307 * df['Остальное_шт']

Using the example of the first line, Packages_FROV:

Intercept = 0.4136, but if you add it when calculating the packages of each category, you get more than intended. That is, you need to give part of the intercept to each category (we divide it by the number of categories)
0.5183 – coefficient for the number of categories
The intercept and the number of categories always exist, therefore, in order not to issue an extra package when a category is not in the order, we multiply by the Boolean variable of its presence.
0.0933 and 0.0595 are the coefficients for FROV_weight and FROV_pcs, respectively.

Conclusion
Next, we round up the package categories up to the volume of reusable containers and calculate the ratio in which they were found. It turned out that almost 100% of orders can be delivered in the first three volume options: 14, 32 and 46 liters. This is what was recommended to the customer.

Additionally
The new service has made delivery more convenient for the client and the courier. But we have made the job of the collector more difficult: before, he would not hesitate to take a standard package, but now he has to figure out which container to take under the autonomous category based on the contents of the basket.
Our linear regression will also help the collector if it is built into the interface: the resulting formula can, based on the items in the cart, suggest which containers to take for each autonomous category.