Plotting a histogram of maximum and average query execution times for PostgreSQL

Background.

Emergency situation – the information system is terribly slow.
As usual, all managers are in panic.
As usual – “what’s up with the DBMS”?
As usual – “everything is fine with the DBMS – there are no critical errors, the response does not increase, there are no anomalies in the DBMS monitoring metrics.”
New introductory note: “All our requests have become very slow.”

Unfortunately, at the time of events, there was no way to get an objective picture of the query execution time.

The metric that allows you to obtain the DBMS response time shows the “average temperature in the hospital.” And given that the system is highly loaded, the average response time cannot be a reliable metric for assessing the performance of a DBMS. So, from various variations on the theme

SUM(total_time) / SUM(calls)

– not much benefit.

Convincing developers and managers that there are no problems on the DBMS side is not possible.

And that’s why the idea arose – to make

simple

a mechanism that allows you to obtain an assessment of the performance of the DBMS – histograms of the maximum and average query execution time, based on the extension already used

pg_stat_statements.

SQL – query

To build a histogram, use the standard function

width_bucket

width_bucket ( operand numeric, low numeric, high numeric, count integer ) → integer

width_bucket ( operand double precision, low double precision, high double precision, count integer ) → integer

Returns the number of the bucket in which operand falls in a histogram having count equal-width buckets spanning the range low to high. Returns 0 or count+1 for an input outside that range.

width_bucket(5.35, 0.024, 10.06, 5) → 3

The idea is as follows:

Build buckets of query execution time intervals:

  • Up to one minute in 1 second increments
  • Over 1 minute to 1 hour in 1 minute increments

As a result, a query was created to build a picture of the distribution of the average query execution time (for the maximum time it is similar, only the column is used

max_exec_timeinstead of mean_exec_time

from view

pg_stat_statements

):

WITH 
total_count AS
(
 select count(*) AS "count" from pg_stat_statements
),
under_1m AS
(
 select 
  width_bucket(mean_exec_time, 0::double precision , 60000::double precision , 60) as b,
  count(*) AS "count"
from 
  pg_stat_statements
group by 
  b
order by 
  b
),
over_1m AS
(
 select 
  width_bucket(mean_exec_time, 60000::double precision , 3600000 , 60) as b,
  count(*) AS "count"
from 
  pg_stat_statements
group by 
  b
order by 
  b
)
SELECT 
  CASE 
	WHEN u1m.b = 1 THEN 0
	            ELSE u1m.b-1 
  END AS "backet"  , 
  CASE 
	WHEN u1m.b = 1 THEN '< 1s'
	            ELSE '['||to_char(u1m.b-1 , '99')||'s -'||to_char(u1m.b , '99')||'s )'
  END AS "range"  , 
  u1m.count AS "count" , 
  ROUND( ( u1m.count::numeric / total_count.count::numeric) * 100 , 2 ) AS "pct" ,
  repeat('*',( u1m.count::float / total_count.count * 100 )::int ) as bar
FROM
 under_1m u1m  , total_count
WHERE 
 u1m.b <= 12
UNION 
SELECT 
  CASE     
	WHEN o1m.b = 1 THEN 61
	            ELSE (o1m.b-1) + 61
  END AS "backet"  , 
  CASE
	WHEN o1m.b < 61 THEN '['||to_char(o1m.b , '99')||'m -'||to_char(o1m.b+1 , '99')||'m )'  
	     ELSE '> 1h' 
  END AS "range"  ,  
  o1m.count AS "count" , 
  ROUND( ( o1m.count::numeric / total_count.count::numeric) * 100 , 2 ) AS "pct" ,
  repeat('*',( o1m.count::float / total_count.count * 100 )::int ) as bar
FROM
 over_1m o1m  , total_count
WHERE 
 o1m.b > 0
ORDER BY
 1 ; 
 

Result of the request:

 backet |     range     | count |  pct  |                                                 bar
--------+---------------+-------+-------+-----------------------------------------------------------------------------------------------------
      0 | < 1s          |  4964 | 99.36 | ***************************************************************************************************
      1 | [  1s -  2s ) |     9 |  0.18 |
      2 | [  2s -  3s ) |     4 |  0.08 |
      3 | [  3s -  4s ) |     2 |  0.04 |
      4 | [  4s -  5s ) |     1 |  0.02 |
      5 | [  5s -  6s ) |     5 |  0.10 |
      8 | [  8s -  9s ) |     1 |  0.02 |
      9 | [  9s - 10s ) |     2 |  0.04 |
     61 | [  1m -  2m ) |     1 |  0.02 |
     62 | [  2m -  3m ) |     1 |  0.02 |
(10 rows)

Development of the idea

The rest is a matter of technique. You can make a bash script and use cron to receive “snapshots” in the form of text files, which, through simple manipulations, can be converted into Excel tables:

PS

All that remains is to wait for the next accident and save a lot of time searching for the real bottleneck of the information system.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *