One of the methods for monitoring and analyzing DBMS errors

A very unpleasant situation for a DBA: “something is happening with the DBMS, but there is no information about what exactly.” The first and sometimes sufficient way to avoid uncertainty is to always have complete information about abnormal situations in the operation of the DBMS for the current period and in history. To solve this generally standard task, you need to monitor and analyze the DBMS log. PostgreSQL is no exception in this case.

We offer for consideration one of the ways to monitor PostgreSQL DBMS errors and obtain summary information using Zabbix. This article is not a tutorial or a roadmap, but rather a sketch for exchanging opinions with colleagues.

Monitoring DBMS errors

The method of monitoring a DBMS is very simple – constant parsing of the DBMS log and notification of an error.

The service script, which is executed via cron, is based on a very simple design:

cat $log_file | grep -E 'FATAL|ВАЖНО|PANIC|ПАНИКА|ERROR|ОШИБКА' 

This construction can be performed using the capabilities of the Zabbix agent, but in this case there are several points:

  1. The Zabbix agent must have read access to the DBMS log folder
  2. The Zabbix agent has timeout restrictions and in case of increased load on the server and an emergency situation with the infrastructure, there may be gaps in monitoring. And it is in such scenarios that error monitoring is especially important.

Therefore, a slightly different method is used – the Zabbix agent does not parse the DBMS log, but only reads the metric value from files that are generated by a simple bash script on a cron schedule:

# ERROR | ОШИБКА
cat $log_file | grep -E 'ERROR|ОШИБКА' | wc -l > /tmp/error.count

# FATAL | ВАЖНО
cat $log_file | grep -E 'FATAL|ВАЖНО' | wc -l > /tmp/fatal.count

# PANIC | ПАНИКА
cat $log_file | grep -E 'PANIC|ПАНИКА' | wc -l > /tmp/panic.count

As a result, no additional settings are needed for the Zabbix agent to work, and the overhead costs for receiving data are minimal.

For a more in-depth analysis, separate files can and should be generated for some error codes:

#Класс 53 — Нехватка ресурсов
cat $log_file | grep -E 'FATAL|ВАЖНО|PANIC|ПАНИКА|ERROR|ОШИБКА' | grep -E ' 53000\|| 53100\|| 53200\|| 53400\|| 53500\|' | wc -l > /tmp/error53.count

# Класс 58 — Ошибка системы (ошибка, внешняя по отношению к PostgreSQL)
cat $log_file | grep -E 'FATAL|ВАЖНО|PANIC|ПАНИКА|ERROR|ОШИБКА' | grep -E ' 58000\|| 58030\|| 58P01\|| 58P02\|' | wc -l > /tmp/error58.count

# Класс<h3></h3> XX — Внутренняя ошибка
cat $log_file | grep -E 'FATAL|ВАЖНО|PANIC|ПАНИКА|ERROR|ОШИБКА' | grep -E ' XX000\|| XX001\|| XX002\|' | wc -l > /tmp/errorXX.count

To quickly respond to errors, you can and should configure a standard Zabbix alert:

Problem started at HH:MI:SS on YYYY.DD.MM
Problem name: ERRORXX
Host: ХХХ
...

Zabbix metrics

As a result of adding Zabbix metrics, a picture is obtained that is quite accessible for operational and historical analysis:



Bottom line

A development of the idea is to parse and analyze part (or all) of the DBMS log to obtain a summary report on error codes:

Класс 25 — Неверное состояние транзакции:48
Класс 42 — Ошибка синтаксиса или нарушение правила доступа:453
Класс 57 — Вмешательство оператора:186
...

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *