Statistical analysis of the results of load testing of DBMS in cloud infrastructure conditions
The sample is considered suitable for further analysis, If the values of the median and mode are the same.
median random variable: in this case, it is defined as the number that divides the distribution in half. In other words, the median of a random variable is the number such that the probability of getting the value of the random variable to the right of it is equal to the probability of getting the value to the left of it (and they are both equal to 1/2). We can also say that the median is the 50th percentile0.5-quantile or the second quartile samples or distributions.
Median (statistics) – Wikipedia (wikipedia.org)
Fashion — one or more values in a set of observations that occurs most frequently
Fashion (statistics) – Wikipedia (wikipedia.org)
Analysis of samples for normal distribution
There are quite a few statistical tests available to determine whether a sample follows a normal distribution:
Shapiro-Wilk criterion,
asymmetry and excess criterion,
Durbin's criterion,
D'Agostino criterion,
excess criterion,
Vasicek criterion,
David-Hartley-Pearson criterion,
chi-square test,
Anderson-Darling criterion,
Philliben criterion.
The problem is that there are no implementations of tests for normality of distribution in PostgreSQL yet. Statistics in PostgreSQL are generally poor.
Therefore, for a practical solution to the problem, it was decided to greatly simplify the process of testing the sample for normal distribution.
To simplify the process of checking the sample for approximation to a normal distribution, the following technique was proposed using the PostgreSQL function – normal_rand
normal_rand(int numvals, float8 mean, float8 stddev) returns setof float8
Функция normal_rand выдаёт набор случайных значений, имеющих нормальное распределение (распределение Гаусса).
Параметр numvals задаёт количество значений, которое выдаст эта функция. Параметр mean задаёт медиану нормального распределения, а stddev — стандартное отклонение.
Given the values of the number of observations, median and standard deviation, using the function normal_rand a test sample is being formed
The assessment of the sample's approximation to normal is performed on the basis of the obtained value of the variance of the differences between the values of the original sample and the sorted values obtained using normal_rand
There are 2 approaches you can use:
We sort the sample by the median value of performance. Because we are looking for a sample in which the performance is maximum.
We sort the sample by the dispersion value. That is, we look for a sample with the minimum dispersion of performance values.
Practical implementation
Scenario
Test load testing for 4 days.
19.08.2024 14:00 – 16:00
20.09.2024 11:00 – 17:00
21.09.2024 10:00 – 17:00
22.09.2024 10:00 – 17:00
During the test load testing and formation of the initial set, a set of 235 samples satisfying the condition Median = Mode.
Preparing samples for normal distribution fit analysis
Maximum performance value
Minimum variance
Testing for normal distribution, generating test sample using normal_rand
Maximum performance value
Minimum variance
Result of the experiment
As can be seen from the comparison, quite expectedthe original sample with the minimum dispersion value is closest to the normal distribution and therefore is the solution to the problem.
Thus, based on the results of the load testing conducted during the period:
19.08.2024 14:00 – 16:00
20.09.2024 11:00 – 17:00
21.09.2024 10:00 – 17:00
22.09.2024 10:00 – 17:00
The following results of DBMS performance can be recorded under this load, in this infrastructure:
Performance value: 2028
Lower bound for performance degradation: 2022
Performance degradation: -1.28%