Monitoring a distributed system using Zabbix using the example of Apache Ignite

Introduction

Monitoring complex distributed systems can be a real headache, both from the point of view of setting up metrics and keeping them up to date, and from a performance point of view. The easiest way to prevent most problems in advance, even at the design stage.

The most common problems are:

  • Performance – when processing a large number of metrics from many nodes, the monitoring system may not be able to cope with the processing of the incoming data stream.

  • Impact of monitoring on system performance – collecting metrics can be quite expensive for the end system.

  • Unnecessary complexity – the monitoring system should be the part that you trust. The more complex the solution, the higher the probability of failure, especially in the event of any changes.

General recommendations when building a monitoring system:

  • Simpler is better.

  • Reduce the load on the metrics collection server – with a large number of nodes, it is better to perform complex calculations on them and send ready-made values ​​to the server.

  • Reduce the frequency of collection of metrics – it should be minimal enough to identify problems, especially for “heavy” metrics.

  • Automate regularly performed actions – manual actions will inevitably lead to errors with an increase in the number of nodes.

Using these guidelines, we will create a template and configure monitoring for the test cluster. The resulting template is available in repositories zabbix.

Creating a template

To build your dream monitoring, you need to have a good understanding of the product you want to track and evaluate.

Apache Ignite Is an in-memory computing platform, a platform used as a cache, distributed computing system and database. You can start a more detailed study of the product with official documentation… Key external indicators of system performance are almost a standard set:

  • CPU load

  • Recycling RAM

  • Disposal of disks in case of persistence

  • Network

Templates for monitoring metrics data are already contained in zabbix, and for advanced metrics like disk utilization there are ready-made solutions available in the zabbix repository, for example, such

A decrease in external indicators, to extreme values, will indicate failures in the hardware that distributed systems, as a rule, experience without consequences, or about the complete unavailability of the system when the service is no longer available. In order to detect dangerous situations in advance, before an incident occurs, it is necessary to monitor the internal parameters of the product. At the time of this writing, there were no ready-made templates for Apache Ignite, so we will write our own.

Since Ignite is written in Java, JMX will be the main monitoring method for it. If we just download and run Ignite, the jmx port will be open on the first free port between 49112 and 65535. For monitoring, this approach does not suit us, since, as a rule, the port is configured in advance and there are no ways to automatically detect it by default. After familiarizing yourself with the startup script, it becomes clear that in order to specify the required port yourself, you can use the IGNITE_JMX_PORT environment variable. Thus, by running the command export IGNITE_JMX_PORT=49112 and starting the node, we will get access to the JMX on the static port we know.

Since Ignite 2.10, the jmx port is set differently

Since version 2.10, this behavior changed and in future versions it is necessary to use the JVM_OPTS variable. You can open the jmx port as follows:

export JVM_OPTS="-Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.port=49112 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false"

A feature worth noting is that by default there is a classloader in the path to mbean, which will change after each restart of the node. It was added to be able to run multiple Ignite instances within the same JVM to avoid metric conflicts, but this is not our case now. For us, this will mean that zabbix will detect the tags again after each restart. You can solve this problem by adding the JMX option -DIGNITE_MBEAN_APPEND_CLASS_LOADER_ID = false, it will remove the classloader from the path.

mbean with classloader
mbean with classloader

Ignite is in opensource, so if you are sure that some functionality should work differently, then you can participate in development. In this case, I think so and immediately started task in JIRA ASF.

As a result, the metric tree will look like this:

Jconsole interface
Jconsole interface

To facilitate the process of adding new metrics to yourself and to simplify the process of changing them in the future, first of all, we will determine which objects have a similar essence. In Java, this is usually done using an interface implementation. In our case, examining, for example, the metrics in the Thread pool section, you can see that all objects implement the ThreadPoolMXBean interface.

Description of the object
Description of the object

As part of the problem being solved, we are interested in the fact that each of these objects will have one basic set of metrics. In relation to zabbix templates, this means that we can configure a discovery rule for these metrics, after which the monitoring server will detect all identical objects based on our rule and apply the template to them.

For example, a rule to find all dataRegions would look like this:

Creating a discovery rule
Creating a discovery rule

As the value {HOST.CONN}: {HOST.PORT} zabbix will substitute the address at which the host is available, with the applied template and the value specified as the jmx port.

Debag jmx discovery

If it becomes necessary to debug discovery in JMX, this can be done using the zabbix_get command. An example of a discovery request for a list of data regions:

zabbix_get -s localhost -p 10052 -k '{"request":"java gateway jmx","jmx_endpoint":"service:jmx:rmi:///jndi/rmi://HOST:49112/jmxrmi","keys":["jmx.discovery[beans,"org.apache:group=DataRegionMetrics,name=*"]"]}' 

The result will look like the following result:

{
   "{#JMXDOMAIN}":"org.apache",
   "{#JMXOBJ}":"org.apache:group=DataRegionMetrics,name=sysMemPlc",
   "{#JMXNAME}":"sysMemPlc",
   "{#JMXGROUP}":"DataRegionMetrics"
},{
   "{#JMXDOMAIN}":"org.apache",
   "{#JMXOBJ}":"org.apache:group=DataRegionMetrics,name=default",
   "{#JMXNAME}":"default",
   "{#JMXGROUP}":"DataRegionMetrics"
},{
   "{#JMXDOMAIN}":"org.apache",
   "{#JMXOBJ}":"org.apache:group=DataRegionMetrics,name=TxLog",
   "{#JMXNAME}":"TxLog",
   "{#JMXGROUP}":"DataRegionMetrics"
}

An example of a metric template:

Creating a metric based on a discovery rule
Creating a metric based on a discovery rule

The parameters {#JMXNAME} and similar zabbix are taken from the result of the discovery request.

As a result, I identified several groups of metrics for which the discovery mechanism should be used:

  • Date Regions

  • Cash groups

  • Caches

  • Thread pools

The rest of the metrics, such as the current coordinator, the number of client and server nodes, the number of transactions on the node, etc., are added to a separate group.

Deployment and automation

Having all the necessary templates and understanding of the product’s operation, we will set up the monitoring of the test bench. I will use docker for process isolation.

How it will all work:

  1. Zabbix server registers a new node after receiving the first request from the zabbix agent.

  2. Zabbix server runs a script that adds a jmx port and applies templates to the new node.

  3. Zabbix server starts sending requests to the Java gateway, which, in turn, polls the application and returns metrics.

  4. Zabbix agent receives a list of collected active metrics from the server and starts sending them to the zabbix server.

  5. Zabbix server requests the values ​​of passively collected metrics from the zabbix agent.

Metrics from the application are received via JMX, new nodes are registered after the first call of the zabbix agent to the server.

More about why a self-written script is used in step 2
  • Initially, there was a desire to use the zabbix functionality, but out of the box zabbix does not know how to assign a jmx port to new nodes, and without this it is impossible to bind a template using JMX. Proposal for revision is in jira zabbix since 2012, but it’s still open.

  • Implementation of this functionality through the API is possible, but it will require the creation of a service user and will be more difficult if you need to register a large number of nodes.

  • The option through the database, described in the ticket, may be applicable for postgresql, but does not work for Oracle, MySQL and MariDB, since you cannot configure a trigger in these databases that will insert into the same table on which it triggered.

  • The option with adding only the interface within the script was also unsuccessful, since Zabbix does not allow you to control the order of operations performed in an action. They are executed in the order in which they were created, but external scripts and sending an alert are placed in a separate queue that is processed after all other operations are completed.

Test environment diagram
Test environment diagram

How to install:

  1. Download and install docker and docker-composeif not already installed.

  2. Download all the necessary files from repository

  3. Go to the downloaded folder.

  4. We start building the image: docker-compose -f docker-compose-zabbix.yml build

  5. We start the cluster and the monitoring server: docker-compose -f docker-compose-zabbix.yml up

  6. In a few seconds, zabbix will be available on port 80. The default account is Admin / zabbix

How to import templates:

  1. Go to the Configuration-> Templates-> Import tab and import the ignite_jmx.xml template (located in the folder you downloaded earlier). Along with the template itself, the ‘Templates / Ignite autoregistration’ group will be added, this name will be used further to add templates from this group to new nodes.

  2. In each template that should be applied, we indicate the group created in the previous step. The Template App Ignite JMX template is already in it, I added Template App Generic Java JMX and Template OS Linux by Zabbix agent.

Create a script for agent auto-registration:

  1. In the zabbix interface, go to the Configuration-> Actions tab, select Autoregistration actions in the drop-down list and create a new action.

  2. We give a name to the action. We can customize the conditions for adding a node.

  3. In operations, add the Add host item. This will create a new node in zabbix if the conditions above are met.

  4. Add the launch of the autoreg.php script, which will add a jmx port to the settings and apply templates from the specified group to the passed node. For those who deploy a test cluster from the image, it will be located in the / var / lib / zabbix folder, for those who install it on their own – in the repository specified above. In my case, it will run as a command php /var/lib/zabbix/autoreg.php {HOST.HOST} 'Templates/Ignite autoregistration' '{HOST.METADATA}'… It should look like this:

Adding a script launch to an auto-registration rule
Adding a script launch to an auto-registration rule

If everything was done correctly, the nodes should appear in zabbix with the configured jmx port and applied templates from the group. If something went wrong, the first thing I recommend to check is Reports-> Audit log.

Results and where to go next

When organizing monitoring, you will always have a choice between redundancy of metrics and the performance of both the product and the monitoring system itself. As a result of the work done, we got a two-node cluster with minimal monitoring sufficient to start using the product on an industrial cluster.

After setting up monitoring for your solution, you will inevitably need to refine it. If the current configuration was unable to prevent the emergency, it is necessary to add new metrics. Also, sometimes it will be useful to remove metrics that you do not use from monitoring, this will unload the monitoring system, applications, and hardware.

In addition to Apache Ignite, your solution is likely to contain many more components, such as client applications, frontends, queues, network equipment, storage systems, and the like – all this also requires monitoring, without which some of the emergency situations will not be detected in time.

For many, security issues will be relevant. JMX and zabbix support both metrics transfer and interface operation using SSL connection. But this is a separate big topic.

What to read

  1. Site Reliability Engineering

  2. Zabbix documentation portal

  3. Out-of-the-box monitoring and control tool for Ignite and Gridgain

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *