The RED (Rate, Errors, Duration) method is one of the popular approaches to performance monitoring. It is often used for monitoring microservicesthough nothing prevents you from using it for databases like MySQL.
IN Percona Monitoring and Management (PMM) v2, all the necessary information is collected in the ClickHouse database, and then it is a matter of technology to create using the built-in ClickHouse data source dashboard to visualize metrics.
When creating the dashboard, in addition to panels for RED, several additional panels were added to show some interesting things that can be done with Grafana + ClickHouse as a data source and the information that we store about MySQL query performance.
We see the classic RED panels showing the Query Rate (requests per second), Error Rate (errors), as well as the average and 99th percentile Query Latency (query execution time) for all nodes in the system. The panels below display information on specific nodes, which is very useful for comparing their performance. If one of the nodes starts to work differently from other similar nodes, then this is a reason for investigation.
With the help of filters (“Filters” at the top of the dashboard), you can view only the data you need. For example, you can select only “sbtest” schema queries for hosts located in the “datacenter4” region:
This ad-hoc filtering is very convenient. You can use regular expressions in filters, search by a specific QueryID, analyze requests from specific client hosts, etc. For a description of the columns available in ClickHouse, see the post Advanced Query Analysis in Percona Monitoring and Management with Direct ClickHouse Access…
From most of the panels, you can quickly go to Query Analytics to view detailed information about the performance of queries, or if you notice something unusual on one of the hosts, through the “Data Links” you can view the queries of this host – click on graph and follow the dedicated link:
For each of the systems separately, you can look at the same RED metrics as for the entire system as a whole. By default, I would leave these panels minimized, especially if you are monitoring many hosts.
We got acquainted with the panels of the RED method. Now let’s take a look at the Additional Dashboards in this dashboard.
Row Based Efficiency (row-based efficiency) shows how many rows were parsed per row returned or changed. Typically, values greater than 100 indicate bad indexes or very complex queries that read a lot of data and only return a few rows. Both of these cases require analysis.
Time-Based Efficiency (time-based efficiency) is based on the same math, but looks at the query execution time rather than the number of rows scanned. This allows you to diagnose problems with a slow disk or conflicting requests. Typically, a high-performance system should expect fractions of a millisecond to send or change a string to the client. Queries that return or update many rows will have a lower value.
Queries Per Host (the number of requests by host) speaks for itself and next to it it is very useful to see the Query Load Per Host (load by hosts), which shows the number of concurrent active requests. Here we can see that despite the fact that mysql4 does not have the highest number of queries (query rate), it has the highest load and the highest average number of active queries.
While thinking about what other metrics might be useful, I added the following additional panels:
These panels separate Query Processing Efficiency into READ queries (which return rows) and WRITE queries (which have row_affected).
QueryTime Based Efficiency – this is the same as described above, only with an emphasis on certain types of queries.
Data crunching efficiency (data processing efficiency) is a slightly different look at the same data. This shows how many rows are examined by the query versus the execution time of the query. This, on the one hand, shows the processing power of the system. A system with a lot of cores, having all the data in memory, can process millions of rows per second and do a lot of work. But that doesn’t mean query efficiency. In fact, systems that process a lot of data quickly often perform many full table scans.
Finally, there are several lists with requests.
Frequent queries, slowest queries (by average execution time), queries with the highest load, and queries that failed or failed. You can also see these queries in Query Analytics, but I wanted to show them here as an example.