Correlation Analysis for Resolving DBMS Performance Incidents

This article should have appeared earlier than the one published yesterday https://habr.com/p/827156/ , since the planned methods for analyzing the performance of a separate query are a special case of solving a general problem – analyzing the performance of a DBMS.

Introductory part.

A well-known basic physical concept:

Productivity is the amount of work that is completed in a given unit of time.

To rephrase this definition for a DBMS, we introduce the concept of performance for a DBMS:

Productivity The DBMS is the number of volumes of information processed per unit of time.

The unit of DBMS information is the page of the DBMS buffer cache. During operation, the DBMS reads, writes, and modifies pages in the buffer cache. Thus, the DBMS performance is the modulus of the vector N of dimension (n1, n2, n3), where:

n1 – number of blocks read per second;
n2 – number of written blocks per second;
n3 – the number of modified blocks per second;

We will also add 2 more dimensions to the vector: QPS (number of completed requests per second), TPS (number of recorded transactions per second).

Taken separately, QPS and TPS cannot be performance metrics because they do not display the volume of processed information, but serve only as indicators of the DBMS's performance. The simplest analogy is a tachometer in a car. The resulting power used by a car depends on, but is not determined only by, engine speed.

As shown earlier https://habr.com/ru/posts/827260/ , the metric “DBMS response time” is not an indicator of DBMS performance and is not used in incident analysis.

So – the metric of DBMS performance(Further – CPI) is calculated as the modulus of the vector (n1,n2,n3,n4,n5) and will be used to monitor and identify DBMS performance incidents.

Using metrics to monitor DBMS performance

The simplest script using cron, on the DBMS server regularly calculates the CPI metric, saves it in the service table and transmits it for monitoring. To smooth out noise and outliers, the data for monitoring is smoothed.

To draw the graphs, median smoothing is used over 10-minute and 1-hour intervals.

The result is a graphical picture of the DBMS performance status:

Normal operation of the DBMS: blue graph - short moving average, red - long moving average. — Normal operation of the DBMS: blue graph – short moving average, red – long moving average.

Example of solving a DBMS performance degradation incident

Anomaly “shift” on the DBMS performance monitoring graph

Task – establish the reason for the drop in DBMS performance.

Step 1 – Specify the time points of the incident

It is almost impossible to determine the exact time of the incident on the monitoring chart. For this purpose, data on sliding loaded into Excel is used for clarity.

Short - Long Sliding — Short – Long Sliding

11:10 – 12:58 – Normal performance
12:58 – 13:08 – Performance degradation incident
13:08 – 14:20 – Performance has degraded – there is a problem.

Step 2 – Confirmation of performance degradation

It is necessary to exclude the reason for the decrease in DBMS performance due to the decrease in the load on the DBMS. Or in other words, check that the decrease in the number of active sessions did not lead to a decrease in performance (this is not an incident). But an increase in the number of active sessions (and therefore sessions not only executing a request, but also waiting for the release of some resource) and a decrease in the volume of processed information per unit of time is a clear sign of a problem.

By calculating the correlation coefficient between the metric values CPI and smoothed values of the number of active sessions(Active connections) is found:

11:10 – 12:58 – Normal performance: Weak direct correlation between values CPI And Active connections
12:58 – 13:08 – Performance degradation incident: Strong inverse correlation between values CPI And Active connections
13:08 – 14:20 – Performance has degraded: Very weak inverse correlation between values CPI And Active connections

Hypothesis – an increase in the number of DBMS waits in the period 12:58 – 13:08 led to a DBMS performance incident.

Step 3 – Establishing a correlation between wait events and DBMS performance degradation

We calculate the correlation coefficients between waiting events and the CPI metric.

Wait event count statistics sorted by correlation coefficient between wait event count and performance

A very important result of the calculation of the correlation coefficient – the number of wait events does not correlate with the performance of the DBMS.

Hypothesis – the greatest impact on DBMS performance is exerted by wait events with the largest negative correlation coefficient value.

In other words, if you analyze wait monitoring graphs or wait time reports, the top ones will be wait events that are not correlated with a drop in performance.

Step 4 – identifying active sessions that have the greatest negative impact on DBMS performance.

Next is a purely technical question: using any available method, we obtain a sample from the history of the system representation values. pg_stat_activity and pay special attention to sessions with wait_event_type=”Lock” and wait_event=”transactionid”.

These sessions are most likely the cause of performance degradation.

In this case, the reason was the use of queries select 1 * from *** limit * for update .