Problems using a MySQL database with Zabbix
While the database was small and the number of metrics stored in it was small, everything was wonderful. The regular housekeeper process that starts Zabbix Server itself successfully deleted obsolete records from the database, preventing it from growing. However, as soon as the number of captured metrics increased and the database size reached a certain size, everything became worse. Houserkeeper stopped managing to delete data for the allotted time interval, the old data began to remain in the database. During the operation of the housekeeper, there was an increased load on the Zabbix Server, which could last for a long time. It became clear that it was necessary to somehow solve the current situation.
This is a known problem, almost everyone who worked with large volumes of monitoring on Zabbix faced the same thing. There were several solutions too: for example, replacing MySQL with PostgreSQL or even Elasticsearch, but the simplest and most proven solution was to switch to partitioning tables that store metric data in the MySQL database. We decided to go just this way.
Migrating from regular MySQL tables to partitioned ones
Zabbix is well documented and the tables where it stores metrics are known. These are the tables:
historywhere float values are stored,
history_strwhere short string values are stored,
history_textwhere long text values are stored and
history_uintwhere the integer values are stored. There is still a table
trends, which stores the dynamics of changes, but we decided not to touch it, because its size is small and a little later we will return to it.
In general, what tables needed to be processed was clear. We decided to make partitions for each week, with the exception of the last, based on the numbers of the month, i.e. four partitions per month: from the 1st to the 7th, from the 8th to the 14th, from the 15th to the 21st and from the 22nd to the 1st (next month). The difficulty was that we needed to turn the tables we needed into partitioned "on the fly", without interrupting the work of Zabbix Server and collecting metrics.
Oddly enough, the very structure of these tables came to our aid in this. For example, a table
history has the following structure:
`itemid` bigint (20) unsigned NOT NULL, `clock` int (11) NOT NULL DEFAULT '0', `value` double (16,4) NOT NULL DEFAULT '0.0000', `ns` int (11) NOT NULL DEFAULT '0',
KEY `history_1` (` itemid`, `clock`)
As you can see, each metric is eventually entered into a table with two very important and convenient for us fields itemid and clock. Thus, we can create a temporary table, for example, with the name
history_tmp, configure partitioning for it and then transfer there all the data from the table
historyand then rename the table
history_old, and the table
historyand then add the data that we have underfilled from
history and delete
history_old. You can do this completely safely, we won’t lose anything, because the above fields itemid and clock they provide a link metric to a specific time, and not to some serial number.
The transition procedure itself
Attention! It is very desirable, before starting any action, to make a full backup from the database. We are all living people and can make a mistake in the set of commands, which can lead to data loss. Yes. a backup copy will not provide maximum relevance, but it is better to have one than none!
So, do not turn anything off or stop. The main thing is that on the MySQL server itself there should be a sufficient amount of free disk space, i.e. so that for each of the above tables
history_uintat least, there was enough space to create a table with the suffix "_tmp", given that it will be the same amount as the original table.
We will not describe everything several times for each of the above tables and consider everything using only one of them as an example – tables
So, create an empty table
history_tmp based on table structure
CREATE TABLE `history_tmp` LIKE` history`;
We create the partitions we need. For example, let's do it for a month. Each partition is created based on the partition rule based on the field value. clockwhich we compare with the timestamp:
ALTER TABLE `history_tmp` PARTITION BY RANGE (clock) ( PARTITION p20190201 VALUES LESS THAN (UNIX_TIMESTAMP ("2019-02-01 00:00:00")), PARTITION p20190207 VALUES LESS THAN (UNIX_TIMESTAMP ("2019-02-07 00:00:00")), PARTITION p20190214 VALUES LESS THAN (UNIX_TIMESTAMP ("2019-02-14 00:00:00")), PARTITION p20190221 VALUES LESS THAN (UNIX_TIMESTAMP ("2019-02-21 00:00:00")), PARTITION p20190301 VALUES LESS THAN (UNIX_TIMESTAMP ("2019-03-01 00:00:00")) );
This operator adds partitioning for the table we created
history_tmp. We clarify that the data with the field value clock less than "2019-02-01 00:00:00" will fall into the party p20190201, then data whose field value clock more than "2019-02-01 00:00:00" but less than "2019-02-07 00:00:00" will get to the party p20190207 and so on.
Important Note: And what happens if we have data in the partitioned table where the clock field value is greater than or equal to "2019-03-01 00:00:00"? Since there is no suitable partition for this data, they will not fall into the table and will be lost. Therefore, you must not forget to create additional partitions in a timely manner, in order to avoid such data loss (about which below).
So, the temporary table is prepared. Fill the data. The process can take quite a long time, but fortunately it does not block any other requests, so you just have to be patient:
INSERT IGNORE INTO `history_tmp` SELECT * FROM history;
The IGNORE keyword is not required during the initial filling, because there is still no data in the table, however, you will need it when adding data. In addition, it can be useful if you had to interrupt this process and start again when filling in the data.
So, after some time (maybe even a few hours), the first data upload has passed. As you understand, now the table
history_tmp does not contain all the data from the table
history, but only those that were in it at the time the request began. Here, in fact, you have a choice: either we make one more pass (if the filling process lasted a long time), or we immediately proceed to rename the tables mentioned above. First, let's take the second pass. First we need to understand the time of the last inserted record in
SELECT max (clock) FROM history_tmp;
Let's say you got: 1551045645. Now we use the obtained value in the second pass of data filling:
INSERT IGNORE INTO `history_tmp` SELECT * FROM history WHERE clock> = 1551045645;
This passage should end much faster. But if the first pass was performed hours, and the second was also performed for a long time, it may be correct to make the third pass, which is performed completely similar to the second.
In the end, we again perform the operation of obtaining the time of the last insertion of the record into
SELECT max (clock) FROM history_tmp;
Let's say you got 1551085645. Save this figure – we will need it for reloading!
And now, actually, when the primary data fill in
history_tmp over, proceed to rename the tables:
BEGIN; RENAME TABLE history TO history_old; RENAME TABLE history_tmp TO history; COMMIT;
We designed this block as one transaction in order to avoid the moment of inserting data into a nonexistent table, because after the first RENAME until the second RENAME is executed, the table
history will not exist. But even if between RENAME operations in a table
history some data will come in, but the table itself will not exist yet (due to renaming), we will get a small number of insertion errors that can be neglected (we have monitoring, not the bank).
Now we have a new table
history with partitioning, but it does not have enough data that was received during the last pass of inserting data into the table
history_tmp. But we have this data in the table
history_old and we will now share them from there. To do this, we need the previously saved value 1551085645. Why did we save this value and did not use the maximum fill time from the current table
history? Because new data is already coming into it and we will get the wrong time. So, we measure the data:
INSERT IGNORE INTO `history` SELECT * FROM history_old WHERE clock> = 1551045645;
After the end of this operation, we have in a new, partitioned table
history there is all the data that was in the old one, plus the data that already came after renaming the table. Table
history_old we don’t need anymore. You can immediately delete it, or you can make a backup copy of it (if you have paranoia) before deleting it.
The whole process described above needs to be repeated for tables
What needs to be fixed in Zabbix Server settings
Now the maintenance of the database in terms of the history of data rests on our shoulders. This means that Zabbix should no longer delete old data – we will do it ourselves. So that Zabbix Server does not try to clean the data itself, you need to go to the Zabbix web interface, select "Administration" in the menu, then the "General" submenu, then select "Clear history" in the drop-down list on the right. On the page that appears, remove all the checkmarks for the “History” group and click on the “Update” button. This will prevent unnecessary table cleanup.
history * through housekeeper.
Pay attention on the same page to the group “Dynamics of changes”. This is just a table.
trendswhich we promised to return to. If it also has become too large for you and needs to be partitioned, uncheck this group as well, and then process this table exactly as it did for the tables
Further database maintenance
As it was written earlier, for normal operation on partitioned tables, it is necessary to create partitions in time. You can do this like this:
ALTER TABLE `history` ADD PARTITION (PARTITION p20190307 VALUES LESS THAN (UNIX_TIMESTAMP (" 2019-03-07 00:00:00 ")));
In addition, since we created partitioned tables and forbade Zabbix Server’s to clean them, deleting old data is now our concern. Fortunately, there are no problems at all. This is done simply by deleting the partition whose data we no longer need. For example:
ALTER TABLE history DROP PARTITION p20190201;
Unlike DELETE FROM statements with a date range, DROP PARTITION is completed in a couple of seconds, does not load the server at all and works just as smoothly when using replication in MySQL.
The described solution is time-tested. The data volume is growing, but there is no noticeable slowdown in performance.