Plotting a histogram of maximum and average query execution times for PostgreSQL
Background.
Emergency situation – the information system is terribly slow.
As usual, all managers are in panic.
As usual – “what’s up with the DBMS”?
As usual – “everything is fine with the DBMS – there are no critical errors, the response does not increase, there are no anomalies in the DBMS monitoring metrics.”
New introductory note: “All our requests have become very slow.”
Unfortunately, at the time of events, there was no way to get an objective picture of the query execution time.
The metric that allows you to obtain the DBMS response time shows the “average temperature in the hospital.” And given that the system is highly loaded, the average response time cannot be a reliable metric for assessing the performance of a DBMS. So, from various variations on the theme
SUM(total_time) / SUM(calls)
– not much benefit.
Convincing developers and managers that there are no problems on the DBMS side is not possible.
And that’s why the idea arose – to make
simple
a mechanism that allows you to obtain an assessment of the performance of the DBMS – histograms of the maximum and average query execution time, based on the extension already used
SQL – query
To build a histogram, use the standard function
width_bucket ( operand numeric, low numeric, high numeric, count integer ) → integer
width_bucket ( operand double precision, low double precision, high double precision, count integer ) → integer
Returns the number of the bucket in which operand falls in a histogram having count equal-width buckets spanning the range low to high. Returns 0 or count+1 for an input outside that range.
width_bucket(5.35, 0.024, 10.06, 5) → 3
The idea is as follows:
Build buckets of query execution time intervals:
- Up to one minute in 1 second increments
- Over 1 minute to 1 hour in 1 minute increments
As a result, a query was created to build a picture of the distribution of the average query execution time (for the maximum time it is similar, only the column is used
max_exec_timeinstead of mean_exec_time
from view
pg_stat_statements
):
WITH
total_count AS
(
select count(*) AS "count" from pg_stat_statements
),
under_1m AS
(
select
width_bucket(mean_exec_time, 0::double precision , 60000::double precision , 60) as b,
count(*) AS "count"
from
pg_stat_statements
group by
b
order by
b
),
over_1m AS
(
select
width_bucket(mean_exec_time, 60000::double precision , 3600000 , 60) as b,
count(*) AS "count"
from
pg_stat_statements
group by
b
order by
b
)
SELECT
CASE
WHEN u1m.b = 1 THEN 0
ELSE u1m.b-1
END AS "backet" ,
CASE
WHEN u1m.b = 1 THEN '< 1s'
ELSE '['||to_char(u1m.b-1 , '99')||'s -'||to_char(u1m.b , '99')||'s )'
END AS "range" ,
u1m.count AS "count" ,
ROUND( ( u1m.count::numeric / total_count.count::numeric) * 100 , 2 ) AS "pct" ,
repeat('*',( u1m.count::float / total_count.count * 100 )::int ) as bar
FROM
under_1m u1m , total_count
WHERE
u1m.b <= 12
UNION
SELECT
CASE
WHEN o1m.b = 1 THEN 61
ELSE (o1m.b-1) + 61
END AS "backet" ,
CASE
WHEN o1m.b < 61 THEN '['||to_char(o1m.b , '99')||'m -'||to_char(o1m.b+1 , '99')||'m )'
ELSE '> 1h'
END AS "range" ,
o1m.count AS "count" ,
ROUND( ( o1m.count::numeric / total_count.count::numeric) * 100 , 2 ) AS "pct" ,
repeat('*',( o1m.count::float / total_count.count * 100 )::int ) as bar
FROM
over_1m o1m , total_count
WHERE
o1m.b > 0
ORDER BY
1 ;
Result of the request:
backet | range | count | pct | bar
--------+---------------+-------+-------+-----------------------------------------------------------------------------------------------------
0 | < 1s | 4964 | 99.36 | ***************************************************************************************************
1 | [ 1s - 2s ) | 9 | 0.18 |
2 | [ 2s - 3s ) | 4 | 0.08 |
3 | [ 3s - 4s ) | 2 | 0.04 |
4 | [ 4s - 5s ) | 1 | 0.02 |
5 | [ 5s - 6s ) | 5 | 0.10 |
8 | [ 8s - 9s ) | 1 | 0.02 |
9 | [ 9s - 10s ) | 2 | 0.04 |
61 | [ 1m - 2m ) | 1 | 0.02 |
62 | [ 2m - 3m ) | 1 | 0.02 |
(10 rows)
Development of the idea
The rest is a matter of technique. You can make a bash script and use cron to receive “snapshots” in the form of text files, which, through simple manipulations, can be converted into Excel tables:
PS
All that remains is to wait for the next accident and save a lot of time searching for the real bottleneck of the information system.