we divide the test bench between departments
Industrial mathematical programming is a topic promoted in the academic environment for standardized cases, but details of real implementations are rarely disclosed and many years later.
In this article I share my experience in developing and implementing an optimization solution based on mathematical programming into company processes. The material was expanded with research elements and a local mini benchmark.
Business task
At the entrance we have a large manufacturer of network equipment, for which one of the stages of product quality control is testing user, load or functional work scenarios on its own test benches.
A test bench is a physical network made up of equipment, testers and servers connected to each other by wires/cables of various types (twisted pair, optical fiber, coaxial cable and other). The task does not involve physically changing the stand, so we will perceive it as a network with a fixed set of attributes and properties.
Various divisions of the company work to improve certain product characteristics or test use cases. Each such department develops its own test scenarios and runs them on a test bench.
What does a test consist of? A test consists of a script and requirements for the equipment on which it can be run. The contents of a test script are a black box, and this understanding is sufficient to solve the problem. In turn, the equipment requirements are broken down into topology and special properties of equipment/connections. Based on this data, the test requirements and the test stand section for its execution are mapped.
We impose a task on the data
Approaching the solution of the problem based on the available data was quite problematic. First of all, there was a lack of understanding in which areas of the test bench one or another test case could be performed. From the description of the test's hardware requirements, it was not easy to determine where it could be performed on the test bench. Each test is a separate task to satisfy network constraints.
The second factor that we were unable to overcome was the expected duration of the test. This information would help to break down the test bench according to the expected load. But the disclosure of the contents of the test script script was kept in the shadows from us. History (or estimated test script execution time) was not collected or analyzed.
Now about human greed. The share of each department's test coverage was determined individually without general agreement and priorities. Obviously, everyone wants to pull the blanket over themselves. Here they accommodated us and added a judge to the process who set priorities. In addition to the priorities, they added consideration of the size of the department's many tests: the more tests a department has, the more pieces of equipment it is entitled to. Let's call this the condition of fair division.
It is not always possible to divide the stand according to the specified test coverage requirements. The question arises about the reliability of the model: should it produce an error or produce a solution with violations (soft restrictions)? Of course, the second option is the most practical from the user’s point of view, especially since departments have priority markings.
Size does NOT matter! Running a test on one piece of equipment is equivalent to running a test on 5 pieces of equipment. This facilitating circumstance takes us away from the multicriteria task. This assumption was made to simplify the material of the article, but in reality it is different.
Our interpretation of the conditions and our capabilities:
Offer pre-calculate all or a subset of test assignment options for test bench sections. Those. associate a set of valid test bench subnets with each test case.
Assumption: all tests have the same execution time. We started from what we had. We ignore the number of pieces of equipment required by the test scenario. In this case, the load on the test bench is determined by the number of tests performed.
Limitation: it is necessary to divide the test bench in such a way that each department can perform a certain percentage of its test set on its “piece” of the bench (minimum coverage condition).
Mat. model
Choosing a methodology for solving the problem. We started from what our task is like and how this similar problem is solved. Set partition problem (Set partitioning proplem, SPP) – partitioning a set of elements into disjoint subsets – fits perfectly into our formulation. SPP type problems are mainly solved by constraint programming, integer programming, or heuristics.
Integer programming or constraint programming provides a general mechanism for solving a problem on a defined feasible solution field, while the heuristic approach follows a predetermined path.
Our choice fell on constraint programming. Because part of the work is removed from us by a ready-made solver, less time is required to develop algorithms for finding a solution and testing it. This allows you to deliver results to your business faster. We acted on the 20/80 principle.
Let me move directly to the formulation of the problem in the form of an integer linear programming problem.
Indexes and sets
– many test scenarios;
– a set of indexes of test bench subnets where the test can be performed ;
– a lot of equipment that makes up the test bench;
– many departments of the company between which the test bench needs to be divided;
– a subset of test script subnets that belong to the department and use equipment ;
Constants
– number of department test scenarios which need to be covered;
– weighted priority of execution of volume coverage of test scenarios for the department ;
– the minimum number of pieces of equipment that must be transferred to the department ;
Decisive Variables
– binary variable, takes the value 1 if the equipment assigned to department ; 0 otherwise;
– binary variable, takes value 1 if test department can be performed on the test bench section assigned to the department; 0 otherwise;
– binary variable, takes the value 1 if all equipment on the subnet test department assigned to a department; 0 otherwise;
– integer variable, the volume of violation of the test coverage level for the department ;
Restrictions
A department test scenario can be run on a subnet if all equipment on the subnet is assigned to the department
Each piece of equipment can be assigned to no more than one department
A department test scenario can be executed if at least one subnet uses equipment completely assigned to this department
Condition for covering the volume of tests for each department. The constraint is formulated soft with the possibility of violating (slak variable )
Condition for fair division according to the volume of tests
Objective function
Python implementation
Data
The encumbrance of the privacy policy does not allow 3D immersion into the task. Therefore, I will use AI (random generator) to complete the picture with the missing data.
dep_id | tc_id | sn_id | device_id |
---|---|---|---|
1 | 0 | 0 | d_3 |
1 | 0 | 0 | d_6 |
1 | 0 | 0 | d_26 |
1 | 0 | 0 | d_15 |
1 | 0 | 0 | d_19 |
… | … | … | … |
The input dataset consists of one table with four fields:
dep_id – department/division identifier;
tc_id – test script identifier;
sn_id – subnet identifier where the test script can be executed;
device_id – test bench equipment identifier.
The data table is minimalistic, but contains enough information to make a decision on how to split the test bench. Below is a summary of the input data: the total number of test scenarios for the department, the total number of subnets to run, and the number of units of test bench equipment involved in the subnets (the test bench consists of 45 pieces of equipment).
dep_id | tc_cnt | sn_cnt | device_unique | |
---|---|---|---|---|
1 | 44 | 6803 | 45 | 2 |
2 | 89 | 14066 | 45 | 5 |
3 | 147 | 21202 | 45 | 8 |
4 | 184 | 25021 | 45 | 10 |
5 | 241 | 33406 | 45 | 13 |
The model includes a parameter for the number of test scenarios that need to be covered for each department . Let's set this value equal to 60% of the total number of department tests.
Different departments have different amounts of test scripts to run. As a result, the load on the subnets is different. Load-dependent distribution provides constraint (5), and the parameter indicates the minimum amount of equipment for the department. Parameter meaning will also need to be generated.
Let's create a budget for the sum of values : subtract 7 units from the total number of equipment (shift). We make a shift for additional variability of the task. We will distribute the resulting 38 units of equipment in proportion to the number of tests the department has. Let us introduce an additional lower constraint: each department must receive at least 2 units of equipment. Values are given in the table above.
We will set the values of the weighting coefficients equal to , i.e. there is no priority between departments. Under the conditions of limiting the load and maximizing coverage, priorities provide finer tuning of the target, which we can neglect in our experiment.
Python code
The syntax for describing models for CP and MIP in ORtools is different, but the model definition scheme is similar. Below I will give the code for solving the problem for the cp-sat solver.
Parameter is_agg
I will go into more detail in the Numerical experiment section. For the occasion is_agg=True
the formulation of the problem fully corresponds to the mathematical description above.
Reading and processing input data.
# Загрузка и обработка входных данных
import pandas as pd
import numpy as np
df_sns = pd.read_csv("https://raw.githubusercontent.com/Lozkins/ORP/master/data/input/01_subnetworks.csv", sep=";", encoding="cp1251")
coverage_rate = 0.60 # Процент покрытия тестов отдела
device_cnt_shift = 7 # Сдвиг общего кол-ва оборудования для генерации нижней границы кол-ва оборудования для отдела
device_cnt = df_sns["device_id"].nunique()
# Генерируем значения для справедливого дележа (минимальное кол-во оборудования для отдела)
dct_tc_parts = (
df_sns
.pipe(lambda _df: _df.groupby(["dep_id"], as_index=False)["tc_id"].nunique())
.assign(tc_part=lambda _df: _df["tc_id"] / _df["tc_id"].sum())
.assign(device_min=lambda _df: round(_df["tc_part"] * (device_cnt - device_cnt_shift)).astype(int))
.assign(device_min=lambda _df: np.where(_df["device_min"] < 2, 2, _df["device_min"]))
.set_index("dep_id")
.to_dict()["device_min"]
)
Solving the problem using the cp-sat solver in ORtools.
import pandas as pd
import numpy as np
from ortools.sat.python import cp_model
def cp_sat(time_limit, is_agg=True):
"""
Решение задачи разбиения сети на подсети с помощью cp-sat solver
"""
m = cp_model.CpModel()
# Инициализация переменных
# Переменные привязки оборудования к отделу
df_x = df_sns[["dep_id", "device_id"]].drop_duplicates()
df_x["var_id"] = np.ogrid[:df_x.shape[0]]
df_x["var"] = df_x["var_id"].apply(lambda x: m.NewBoolVar(f"x_{x}"))
# Переменная возможности выполнить тест
df_y = df_sns[["dep_id", "tc_id"]].drop_duplicates()
df_y["var_id"] = np.ogrid[:df_y.shape[0]]
df_y["var"] = df_y["var_id"].apply(lambda y: m.NewBoolVar(f"y_{y}"))
# Переменная возможности выполнить тест на определенной подсети
df_z = df_sns[["dep_id", "tc_id", "sn_id"]].drop_duplicates()
df_z["var_id"] = np.ogrid[:df_z.shape[0]]
df_z["var"] = df_z["var_id"].apply(lambda s: m.NewBoolVar(f"s_{s}"))
# Слаковая переменная - объем нарушения уровня покрытия тестов для отдела
df_p = df_sns.groupby("dep_id", as_index=False)["tc_id"].nunique()
df_p = df_p.rename({"tc_id": "w"}, axis=1) # Добавляем веса для целевой
df_p["var_id"] = np.ogrid[:df_p.shape[0]]
df_p["var"] = df_p.apply(lambda p: m.NewIntVar(0, round(p.w * coverage_rate), f"p_{p.var_id}"), axis=1)
# Добавляем ограничения в модель
# Ограничение: Тестовый сценарий отдела можно выполнить на подсети, если все оборудование подсети закреплено за отделом
dct_x = df_x.set_index(["dep_id", "device_id"]).to_dict()["var"]
# Добавим оборудование в таблицу
df_tmp = df_z.merge(df_sns[["dep_id", "tc_id", "sn_id", "device_id"]], how="left",
on=["dep_id", "tc_id", "sn_id"])
df_cnstr_1 = df_tmp.groupby(["dep_id", "device_id"], as_index=False).agg({"var": "sum", "sn_id": "count"})
if is_agg:
for cnstr in df_cnstr_1.itertuples():
key = cnstr.dep_id, cnstr.device_id
m.Add(cnstr.var <= cnstr.sn_id * dct_x.get(key, 0))
else:
for cnstr in df_tmp.itertuples():
key = cnstr.dep_id, cnstr.device_id
m.Add(cnstr.var <= dct_x.get(key, 0))
# Ограничение: Каждая единица оборудования может быть закреплена не более чем за одним отделом
df_cnstr_2 = df_x.groupby("device_id").agg({"var": "sum"})
for cnstr in df_cnstr_2.itertuples():
m.Add(cnstr.var <= 1)
# Ограничение: Тестовый сценарий отдела можно выполнить, если хотя бы одна подсеть использует оборудование, полностью закрепленное за этим отделом
dct_y = df_y.set_index(["dep_id", "tc_id"]).to_dict()["var"]
df_cnstr_3 = df_z.groupby(["dep_id", "tc_id"], as_index=False).agg({"var": "sum"})
for cnstr in df_cnstr_3.itertuples():
key = cnstr.dep_id, cnstr.tc_id
m.Add(cnstr.var >= dct_y.get(key, 0))
# Ограничение: Условие покрытия объема тестов для каждого отдела
dct_p = df_p.set_index("dep_id").to_dict()["var"]
df_cnstr_4 = df_y.groupby("dep_id", as_index=False).agg({"var": sum, "tc_id": "count"})
for cnstr in df_cnstr_4.itertuples():
m.Add(cnstr.var + dct_p.get(cnstr.dep_id, 0) >= round(cnstr.tc_id * coverage_rate))
# Ограничение: Условие справедливого дележа согласно объему тестов
df_constr_5 = df_x.groupby("dep_id", as_index=False).agg({"var": "sum"})
for cnstr in df_constr_5.itertuples():
m.Add(cnstr.var >= dct_tc_parts.get(cnstr.dep_id, 0))
# Целевая функция: минимизация взвешенного объема нарушений покрытия тестовых сценариев
m.Minimize(sum(df_p["var"]))
# Инициализация solver
solver = cp_model.CpSolver()
# Устанавливаем ограничения солвера
solver.parameters.max_time_in_seconds = time_limit
solver.parameters.log_search_progress = True
# Решение задачи
status = solver.Solve(m)
# Проверяем статус
print("Найдено оптимальное решение: ", status == cp_model.OPTIMAL)
print('Целевая функция = %f' % solver.ObjectiveValue())
# Извлекаем решение
df_x["sol"] = df_x["var"].apply(lambda x: solver.value(x))
df_y["sol"] = df_y["var"].apply(lambda x: solver.value(x))
df_z["sol"] = df_z["var"].apply(lambda x: solver.value(x))
df_p["sol"] = df_p["var"].apply(lambda x: solver.value(x))
return dict(df_x=df_x, df_y=df_y, df_z=df_z, df_p=df_p)
Numerical experiment
The performance of open source solvers is lower than that of commercial ones (true for solvers from the same mathematical programming paradigm). The volume of the problem is close to real, so I set a time limit on the duration of the search for the optimal solution equal to 1 hour. Let's consider two types of task implementation:
The set of solvers is as follows: SCIPcommercial solver (MIP) and cp-sat (CP). Two open source solvers and one commercial.
Dimension of the original problem: 985 constraints and 19702 binary variables and 5 integer variables (case is_agg=True
).
Alternative model
Building a model is a creative process. The mathematical formulation of the problem affects the speed of finding a solution. As part of the experiment, I will demonstrate the influence of this factor on the performance of solvers.
Consider constraints of type (1): A department's test case can be executed on a subnet if all equipment on the subnet is assigned to the department.
The left side represents the sum of the department's valid subnets. If at least one equipment is not under the jurisdiction of the department, then this amount should be equal to 0. I propose moving to the following set of restrictions:
The logic of the restriction has not changed, but the number of restrictions will increase, because for each valid subnet a set of restrictions will be created equal to the number of pieces of equipment in the topology of the test scenario.
When replacing restrictions (1) with new ones, we obtain a problem of the following dimension: 101258 restrictions and 19702 binary variables and 5 integer variables (the case is_agg=False
). The number of restrictions in the model has increased by ~106 times. The expediency of such a transition will be determined by experiment.
Performance of models and solvers
I considered three solvers: cp, scip and com; each solved two problems: agg (the original constraint (1)) and unzip (the unpacked constraint (1)). The graph shows the values of the objective function (along the y-axis) from the solver logs over time (the x-axis). I ran all solvers with the basic settings. The value of the objective function is the size of the coverage violation, expressed in the number of test scenarios required to satisfy the required coverage.
Let's look at the outsider of the experiment: scip. The first solution is found after 1959 seconds. and 3261 sec. for agg and unzip models, respectively. The low performance of the solver can be associated with the single-threaded nature of the solver at the stage of solving the MIP problem.
In the SCIP solver logs when solving the agg problem, you can find the following entry:
10 deleted vars, 235 deleted constraints, 100498 added constraints, 5 tightened bounds, 0 added holes, 0 changed sides, 4 changed coefficients
The scip solver added 100498 constraints to the model at the presolve stage. This suggests that the solver itself unpacked constraints of type (1). If you are familiar with the internal structure of the scip solver, please comment on your guess.
The commercial solver immediately found a valid solution to the agg problem, and for the unzip problem after 34 seconds. But I found a solution to 265 uncovered test scenarios faster in the case of unzip (3483 sec.) versus 3600 sec. for agg model. The agg problem converges better in the first half, while unzip converges a little faster in the second half of the allotted time.
Constraint programming turned out to be the most suitable for solving the problem, even taking into account the use of the open source cp-sat solver. In addition to the best value of the objective function (262 pcs.) among the considered solvers, in the case of unzip the solution was found in 479 seconds. Congratulations on winning the local benchmark!
The number of simulations performed is not large, the conclusion is not that objective. However, the formulation with an unpacked constraint of type (1) – unzip, shows better convergence and is more “convenient” for solvers (for this problem).
I ran two simulations of the experiment (2×6=12 calculations), the behavior of the solvers is similar. Below is a similar graph for the second simulation.
Commercial and scip solvers work quite stably, while cp-sat changed its shoes and showed the opposite dynamics. In the second run, the best solution is 264 uncovered scenarios on the agg version. Instability is the price of availability.
Additionally, I would like to note the following: all solvers were unable to improve the lower bound of the solution and deviate from level 0.
Result
Let's analyze the solutions of the first simulation. It was not possible to obtain a proven optimal solution within the established time limits. This was not a stumbling block for using the solution in industrial operation; the customer is satisfied and uses the model.
Experiments with a time limit of 12 hours also did not lead to a proven optimal solution. But we also failed to improve the value of the objective function. The problem appears to present difficulties for algorithms estimating the lower bound of the solution.
The table shows the number of not covered test scenarios by section. Column cover_req required number of tests for coverage. The first department suffered the most, not a single test was covered. The best coverage was in the largest department (dep_id = 5). Setting up the scales allows you to shift priority from the “largest” department, in addition, smarter settings It will also allow you to increase the variability of the task (make the division more democratic).
dep_id | cover_req | cp_unzip | com_unzip | com_agg | cp_agg | scip_agg | scip_unzip |
---|---|---|---|---|---|---|---|
1 | 26 | 26 | 26 | 26 | 26 | 26 | 26 |
2 | 53 | 51 | 50 | 49 | 51 | 50 | 53 |
3 | 88 | 73 | 70 | 73 | 74 | 83 | 88 |
4 | 110 | 82 | 91 | 86 | 86 | 101 | 100 |
5 | 145 | thirty | 28 | 31 | 31 | 63 | 63 |
Total | 422 | 262 | 265 | 265 | 268 | 323 | 330 |
The distribution of the number of pieces of equipment between departments is the same in all solutions {1: 2, 2: 5, 3: 8, 4: 10, 5: 20}
(restriction on the minimum binding volume : {1: 2, 2: 5, 3: 8, 4: 10, 5: 13}
). Those. all free equipment (shift of 7 units) is assigned to the “largest” department.
Conclusion
We developed a solution for dividing the test bench into non-overlapping subnets. The tool is not limited by the number of partitions and allows you to get an acceptable solution within an hour. Has a couple of controllable parameters And to configure the model. Provides the ability to model various hypotheses at a lower cost than manual modeling.
Time spent on prototype development ~60 man/hour. Development was carried out in “free” time from the main task. The customer's investments are fully justified.
The article reflected the basic statement of the problem and its solution. A solution with additional heuristics to speed up the search for a solution was put into use.
What we tried:
cp solvers have conditional constraints of the EnforceOnly type, which can be applied to the model from the article. In my experiments, there was no significant acceleration from their use.
Instead of soft coverage restrictions, you can use hard ones. In this case, we lose the reliability of the model, because we can get infeasible. In practical use, we will need to log places of violations and add work to the user. In addition, with strict restrictions, the search for a solution becomes difficult and there is a need to heuristically quickly find an acceptable solution.
The problem can be solved using the column generation method. The number of permissible bench partitions (columns) is quite large, so an additional algorithm for smart column construction is required to reduce options (build an ensemble of models).
Ways to speed up the search for a solution. You can customize solvers for the task, which will speed up the search for solutions (for example, in article auto-tuner for MIP is described, or Here). An alternative to diving into the details of the solver is to vary or decompose the model based on expert assumptions. We took the second path and succeeded in it.