Statistical analysis of the results of load testing of DBMS in cloud infrastructure conditions

median smoothing with a period of 1 hour.

The sample is considered suitable for further analysis, If the values ​​of the median and mode are the same.

median random variable: in this case, it is defined as the number that divides the distribution in half. In other words, the median of a random variable is the number such that the probability of getting the value of the random variable to the right of it is equal to the probability of getting the value to the left of it (and they are both equal to 1/2). We can also say that the median is the 50th percentile0.5-quantile or the second quartile samples or distributions.

Median (statistics) – Wikipedia (wikipedia.org)

Fashion — one or more values ​​in a set of observations that occurs most frequently

Fashion (statistics) – Wikipedia (wikipedia.org)

Analysis of samples for normal distribution

There are quite a few statistical tests available to determine whether a sample follows a normal distribution:

  • Shapiro-Wilk criterion,

  • asymmetry and excess criterion,

  • Durbin's criterion,

  • D'Agostino criterion,

  • excess criterion,

  • Vasicek criterion,

  • David-Hartley-Pearson criterion,

  • chi-square test,

  • Anderson-Darling criterion,

  • Philliben criterion.

The problem is that there are no implementations of tests for normality of distribution in PostgreSQL yet. Statistics in PostgreSQL are generally poor.

Therefore, for a practical solution to the problem, it was decided to greatly simplify the process of testing the sample for normal distribution.

To simplify the process of checking the sample for approximation to a normal distribution, the following technique was proposed using the PostgreSQL function – normal_rand

normal_rand(int numvals, float8 mean, float8 stddev) returns setof float8

Функция normal_rand выдаёт набор случайных значений, имеющих нормальное распределение (распределение Гаусса).

Параметр numvals задаёт количество значений, которое выдаст эта функция. Параметр mean задаёт медиану нормального распределения, а stddev — стандартное отклонение.

  1. Given the values ​​of the number of observations, median and standard deviation, using the function normal_rand a test sample is being formed

  2. The assessment of the sample's approximation to normal is performed on the basis of the obtained value of the variance of the differences between the values ​​of the original sample and the sorted values ​​obtained using normal_rand

There are 2 approaches you can use:

  1. We sort the sample by the median value of performance. Because we are looking for a sample in which the performance is maximum.

  2. We sort the sample by the dispersion value. That is, we look for a sample with the minimum dispersion of performance values.

Practical implementation

Scenario

Test load testing for 4 days.

  1. 19.08.2024 14:00 – 16:00

  2. 20.09.2024 11:00 – 17:00

  3. 21.09.2024 10:00 – 17:00

  4. 22.09.2024 10:00 – 17:00

During the test load testing and formation of the initial set, a set of 235 samples satisfying the condition Median = Mode.

Preparing samples for normal distribution fit analysis

  1. Maximum performance value

Fig.1 Maximum performance value

Fig.1 Maximum performance value

Fig. 3. Probability distribution for the period 12:01-13:01 08/21/2024

Fig. 3. Probability distribution for the period 12:01-13:01 08/21/2024

Fig. 4. Probability distribution for the period 12:01 - 13:01 08/21/2024 - graph

Fig. 4. Probability distribution for the period 12:01 – 13:01 08/21/2024 – graph

  1. Minimum variance

    Fig.5. Minimum variance of performance

    Fig.5. Minimum variance of performance

    Fig. 6. Probability distribution for the period 10:32 - 11:32 08/21/2024

    Fig. 6. Probability distribution for the period 10:32 – 11:32 08/21/2024

Fig. 7. Probability distribution for the period 10:32 - 11:32 - graph

Fig. 7. Probability distribution for the period 10:32 – 11:32 – graph

Testing for normal distribution, generating test sample using normal_rand

  1. Maximum performance value

Fig.8. Test sample

Fig.8. Test sample

Fig.9. Test sample graph

Fig.9. Test sample graph

Fig.10. Graph of the original sample

Fig.10. Graph of the original sample

Fig.11. Difference table between the original and test samples

Fig.11. Difference table between the original and test samples

Fig.12. Variance according to the difference table

Fig.12. Variance according to the difference table

  1. Minimum variance

Fig.13. Test sample

Fig.13. Test sample

Fig.14. Test sample graph

Fig.14. Test sample graph

Fig.15. Graph of the original sample

Fig.15. Graph of the original sample

Fig.16. Difference table.

Fig.16. Difference table.

Fig. 17. Dispersion.

Fig. 17. Dispersion.

Result of the experiment

As can be seen from the comparison, quite expectedthe original sample with the minimum dispersion value is closest to the normal distribution and therefore is the solution to the problem.

Thus, based on the results of the load testing conducted during the period:

  1. 19.08.2024 14:00 – 16:00

  2. 20.09.2024 11:00 – 17:00

  3. 21.09.2024 10:00 – 17:00

  4. 22.09.2024 10:00 – 17:00

The following results of DBMS performance can be recorded under this load, in this infrastructure:

  1. Performance value: 2028

  2. Lower bound for performance degradation: 2022

  3. Performance degradation: -1.28%

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *