Monitoring systemd services in real time with Chronograf

Today classes start in the new course group “Monitoring and logging: Zabbix, Prometheus, ELK” from OTUS… During next week everyone will have the opportunity join the course on special price… Well, right now we are sharing the traditional translation of useful material on the topic


All system administrators are familiar with systemd. Designed by Lennart Pöttering (Lennart Poettering) and freedesktop.org, systemd is a very handy tool for managing services on Linux. Most modern software comes in the form of systemd services.

But what happens when a service crashes? In most cases, you will find this when some damage has already been done.

Today we will create a dashboard to monitor systemd services in real time. It will have active, inactive and dropped services, and also send messages to Slack!

Dashboard for systemd
Dashboard for systemd

1. A few words about D-Bus

Before we get into the architecture and coding, let’s quickly recap what D-Bus is and how it can help you achieve your goal (if you’re a D-Bus expert, feel free to skip to section 2).

D-Bus is an interprocess communication bus that allows multiple applications registered on it to communicate.

Applications using D-Bus are either servers or clients. The client connects to a D-Bus server listening for incoming connections. When connected to a D-Bus server, applications register on the bus and receive a name to identify them.

The applications then exchange messages and signals, which can be intercepted by clients connected to the bus.

D-Bus’s purpose may seem a little hazy at first, but it is a very useful tool for Linux systems.

For example, the UPower service (which monitors the power supplies) can communicate with the thermald service (which monitors the overall temperature) to reduce power consumption if it detects overheating problems (you won’t even notice).

But what’s the relationship between D-Bus and systemd service monitoring? Systemd is registered on D-Bus as a service org.freedesktop.systemd1. In addition, it sends some signals to clients when the state of systemd services changes. And that is what we will use for our monitoring.

2. Receiving D-Bus signals

For this tutorial, I am using a machine with Xubuntu 18.04 and a standard kernel. It should run dbus-daemon and the utility is installed busctl

This can be verified by running:

ps aux | grep dbus-daemon

As a result, there should be at least two records: one system bus and one session bus.

busctl status

This command checks the bus status and returns its configuration.

Definition of useful D-Bus signals

As stated earlier, the systemd service registers itself on the bus and sends signals when something related to systemd happens.

When a service starts, stops, or crashes, systemd sends a bus signal to all available clients. Since systemd sends a lot of events, we will redirect standard output to a file to analyze them.

sudo busctl monitor org.freedesktop.systemd1> systemd.output 

In the file, we see a lot of messages, method calls, method returns, and signals.

Systemd bus signals
Systemd bus signals

Notice the line “ActiveState” with the value “deactivating”? This stops my InfluxDB service. We can even get the time associated with the change!

Service org.freedesktop.systemd defines six different states: active, reloading, inactive, failed, activating, deactivating. Obviously, we are especially interested in the failed signal because it signals a service failure.

Now that we have the ability to manually intercept systemd signals on the system, let’s create a fully automated monitoring system.

3. Architecture and implementation

We will use the following architecture to monitor systemd services:

The ultimate systemd monitoring architecture
The ultimate systemd monitoring architecture

The architecture is pretty simple. The main thing is to make sure that dbus-daemon is running.

Next, we need to create a simple D-Bus client (in Go!) That will subscribe to signals from systemd. All incoming signals will be analyzed and stored in InfluxDB.

After saving the data to InfluxDB, we will create a dashboard in Chronograf that displays statistics on services and their current state.

When the service crashes, Kapacitor (the streaming engine) picks up the signal and automatically sends a message to Slack to the system administrators.

It’s that simple! Right?

Building a D-Bus client in Go

The first step to catching signals from systemd is to create a simple client that will:

  1. Connect to the bus

  2. Subscribe to systemd signals

  3. Parse data and send it to InfluxDB

Note: You might be wondering why I chose Go to create the D-Bus client. Client libraries dbus and InfluxDB are written in Go, making it an ideal candidate for this little experiment.

The client code is long enough to be fully coded in this article, but I’ll list below the main function that does most of the work. Full code is available at Github

For every single signal from systemd, a dot is created in InfluxDB. I chose this implementation because I wanted to have a complete history of all the changes taking place in different services. This can be quite useful for investigating some recurring failures over a period of time.

Implementation options

For the data structure in InfluxDB, I chose the service name (for indexing purposes) as tags (labels), and the state (failed, active, activating …) as the value.

I also use simple mapping of state values ​​to numbers. IQL aggregate functions work better with numeric values ​​than with text values.

Note: In the above code snippet, you can see that I am getting a lot of properties from systemd, but I only retrieve the “ActiveState” property that you saw in the first section.

Now that we have a simple Go client, let’s turn it into a service, launch it, and navigate to Chronograf.

4. Dashboard for sysadmins

When we have data in InfluxDB, the fun begins. Let’s create a dashboard in Chronograf displaying service statistics and indicators for the individual services we want to track.

A dashboard consists of three main parts:

  1. The number of active, inactive, and dropped services at a given time.

  2. A table showing the complete history of state changes over time for each service.

  3. 12 indicators showing 12 different systemd services that we want to highlight.

Note: It is assumed that you have some prior knowledge of Chronograf, you know how to set it up and link with InfluxDB. Documentation available here

Number of active, inactive and dropped services

Here is an option for creating a block with one metric:

Complete history of state changes

We create history tables in a similar way:

Indicators for individual services

Of course, I encourage you to play with the widgets and create your own dashboards. Your dashboard should not be an exact copy of the above.

Now that we have a dashboard, we can monitor systemd services in real time. Great!

But what if we still had real-time alerts in Slack?

5. Sending alerts when service fails

We will be using Kapacitor (a streaming engine), which is responsible for generating and handling alerts in the event of a service failure.

After installing and running Kapacitor, let’s go back to Chronograf and go to the alert panel.

When you click on the “Manage Tasks” button, you will see two sections: alert rules and tick scripts. Let’s create a new alert by clicking on the “Build Alert Rule” button.

Below is the complete alert configuration:

This alert is configured to send an alert to the Slack webhook if the service does not respond within fifteen minutes (i.e., the status value is minus one). On the Slack side, the alert looks like this:

6. Conclusion

I learned a lot by creating this little project. Having no experience with D-Bus and Golang, this experiment taught me that getting out of my comfort zone (even in programming) is the way to develop new skills.

The process of creating such a dashboard may seem time-consuming, but once deployed, it brings real value to operations teams and sysadmins.

If you like building your own monitoring solutions, you can certainly get some ideas from this article. But if you are more into using off-the-shelf tools, then I would definitely recommend SignalFX or Telegraf. These are reliable and efficient solutions for your infrastructure.

View special offer

Recommended reading articles:

  • Monitoring Kubernetes with Prometheus and Thanos

  • Managing alerts with Alerta.io

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *