Zabbix Performance Tuning

When is it time to tune Zabbix

When it seems that Zabbix can work faster – you are not imagining it. Visible performance problems can appear in the following places:

  • Zabbix queue has a lot of pending items (see Administration -> Queue)

  • Frequent gaps in graphs, missing data for some elements

  • False positives of nodata() triggers

  • Slow web interface

  • No events or many events

  • Suspicious messages in logs

The presence of one or more of the listed symptoms indicates that it is time to think about tuning Zabbix performance.

What are we going to tune in Zabbix?

Let's start right away with where to look to understand optimization points:

  • number of data elements and their collection interval

  • types of information and data elements

  • number of triggers and their complexity

  • housekeeper settings and database size

  • low level detection settings

  • number of users working with the Zabbix web interface

Let's figure out how the load changes with different number of nodes and data elements. To keep things simple, let's take the data collection interval as a constant. Let's look at an example in the case of 60 data elements per host and a collection interval of 1 time per minute:

Number of nodes

Number of new values ​​per second

100

100

1000

1000

10000

10000

And here's what the picture will look like in the case of 300 data elements per host and a collection interval of 1 time per minute:

Number of nodes

Number of new values ​​per second

100

500

1000

5000

10000

50000

Notice how the number of new values ​​per second (NVPS) changes if you change, for example, the collection interval or the number of data items.

The next place to look is types of information. Numeric is more efficient than String in terms of storage and operation. Please note that you can mostly get rid of String values ​​by converting them to numeric values ​​at the preprocessing stage. To add human readability, use Value Mapping. Numeric values ​​are also good because Zabbix calculates trends from them for optimal storage over a long distance.

Let's take a look at data element types globally: active and passive. In terms of efficiency, the active collection type is preferable. It's simple: with the passive collection format, Zabbix server or proxy pollers are forced to wait for data to be received. If something somewhere is slow in giving out data, the rest of the data elements will be collected more slowly. We do not argue that asynchronous pollers appeared in Zabbix version 7, but nevertheless, the active format works faster. Ideally, move all data collection to a proxy.

Number of triggers and their complexity plays a significant role in performance. To understand where the optimization is here, let's look at two trigger calculation functions: avg And trendavg.

Let's start with avg. To speed up the calculation of trigger expressions, calculated elements and some macros, Zabbix has a value caching option (aka value vache). This cache is used to access historical data instead of SQL queries to the DB. If historical values ​​are missing from the cache, the missing values ​​are requested from the DB, and the cache is updated accordingly. Obviously, if you have a lot of calculated things, then the cache should be kept large. In the Zabbix server config, there is a ValueCacheSize parameter (by default it is 8 MB, you can set it from 128K to 64G).

Element values ​​remain in the Value Cache until:

  • the element is deleted (cached values ​​are deleted after the next configuration synchronization)

  • the element value is outside the time or quantity range specified in the trigger/computed element expression (the cached value is cleared when a new value is received)

  • The time or counter range specified in the trigger/calculated item expression is modified so that less data is required for the calculation (unnecessary cached values ​​are removed after 24 hours).

Now let's move on to trendavg. If value vache is used for calculations over short periods, then trend function cache is used for calculations over much longer time intervals. In terms of overhead, trend calculations are less resource-intensive, but also less accurate, since hourly values ​​of calculated trends are used for calculations.

This type of cache is controlled by the Zabbix server parameter TrendFunctionCacheSize (by default it is 4MB, you can set it from 128K to 2G).

Conclusion: Use the correct function type in each situation.

Housekeeper settings and database size — the next point on our way. Housekeeper is a tool that cleans out obsolete data. If you do this using standard Zabbix tools (SQL queries) and with standard settings (once an hour), you will regularly receive a load on the DB. The point for optimization is setting up partitioning in MySQL and using TimescaleDB for PostgreSQL. In the first case, delete partitions, in the second, chunks and you will be happy.

We can't help but remind you about the throttling settings, which will allow you to avoid writing duplicate values ​​to Zabbix. These are the same preprocessing steps that Discard unchanged and Discard unchanged with heartbeat. There is also Custom interval and Scheduled interval, so as not to collect data when it is not needed.

The approach of “if you don’t hide, it’s not my fault” is not very effective in relation to low level detection. The key idea is the less often, the better. Take inventory of the most common patterns and check whether it is really necessary to check for new objects with the frequency that is currently set.

Itself number of users we won't be able to tune it, so we'll just adjust the Zabbix configuration to the required load. The key is the number of connections to the database.

Diagnostic tools

There are different ways to conduct a primary diagnosis:

  • Use utilities top, ntop, iostat, vmstat, sar

  • View the Zabbix web interface for process utilization metrics in the Template App Zabbix Server, Template App Zabbix Proxy, Template App Zabbix Agent templates (such as the Alerter, Configuration syncer, DB watchdog, discoverer, escalator, history syncer, http poller, housekeeper, icmp pinger, ipmi poller, poller, trapper, and other processes)

  • View cache utilization metrics in the Zabbix web interface (history write cache, value cache, trend write cache, vmware cache, and others)

  • Use strace or look in the log, having previously increased the logging level

  • Check with ps aux | grep zabbix_server

  • Set LogSlowQueries=3000 and look for possible database problems (grep slow /var/log/zabbix/zabbix_server.log)

  • Enable Debug mode for a user in Zabbix web interface

  • Check database performance using innotop or pg_top

We have prepared some examples to help you understand the bad and good situations.

An example of an unpleasant situation with queues, but not critical:

An example of a good situation with queues:

Example of command output ps ax | grep sync in a problematic situation:

# ps ax | grep sync
history syncer #1 [synced 1020 items in 311.198752 sec, syncing history]
history syncer #2 [synced 915 items in 311.177799 sec, syncing history]
history syncer #3 [synced 3401 items in 311.936376 sec, syncing history]
history syncer #4 [synced 1194 items in 311.280719 sec, syncing history]

Example of command output ps ax | grep sync when everything is good:

# ps ax | grep sync
zabbix_server: history syncer #1 [synced 2405 items in 0.458134 sec, syncing history]
zabbix_server: history syncer #2 [synced 31 items in 0.090514 sec, idle 4 sec]
zabbix_server: history syncer #3 [synced 0 items in 0.000018 sec, idle 4 sec]
zabbix_server: history syncer #4 [synced 0 items in 0.000009 sec, syncing history]

After enabling Debug mode in the Zabbix web interface, you can see for the user how bad or good everything is with your database and web server. Let's look at the situation in case of a problem with the database:

******************** Script profiler ********************
Total time: 10.960905
Total SQL time: 10.749027
SQL count: 5636 (selects: 4065 | executes: 1571)
Peak memory usage: 180.5M
Memory limit: 2G

Now let's look at the moment of the problem with the web server:

******************** Script profiler ********************
Total time: 10.960905
Total SQL time: 0.749027
SQL count: 5636 (selects: 4065 | executes: 1571)
Peak memory usage: 180.5M
Memory limit: 2G

And here's when everything is almost perfect:

******************** Script profiler ********************
Total time: 0.960905
Total SQL time: 0.749027
SQL count: 5636 (selects: 4065 | executes: 1571)
Peak memory usage: 180.5M
Memory limit: 2G

What conclusions can be drawn?

In this article, we tried to analyze the most effective points for optimizing Zabbix performance. There are more, but these are the ones you should pay attention to first. We hope that the article was useful and helped you look at your installation from a different angle. Thank you very much for your attention!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *