In the previous article, we deployed the Prometheus + Grafana bundle and now it’s time to connect the sources and set up the visualization. But first, let me remind you from which elements of the IT infrastructure we are going to collect metrics. First of all, these are hardware, operating systems and additional software, that is, everything without which the normal functioning of our application would be impossible. Then monitoring the application itself, for example, which components consume more of certain resources. And finally, monitoring the business logic of the application. This can be, for example, the collection of information about user activities, cash receipts, etc.
It is not enough just to collect metrics, it is important to interpret them correctly, so first we will see what exactly we want to monitor and then we will visualize the necessary metrics. Depending on the type of resource from which the metrics are collected, we need different information. If we collect metrics from the host, then we need: CPU, Memory, Processes, Disk, Network, etc. If we have a Docker container, then we will collect: CPU, Memory, Network, Block I / O, + Docker Daemon.
With application monitoring, things are somewhat more complicated. As a rule, developers know better than anyone what your application is doing and can implement more relevant metrics. The main goals of collecting metrics from applications is to identify the state and performance of the code. Also, it will not be superfluous to monitor the use of the application by the end user. Examples of metrics collected in applications are query response time, number of failed user logins, and so on.
A distinctive feature of collecting metrics from applications is that these metrics are almost impossible to collect by external means, as this is done, for example, when monitoring the operating system. The collection of metrics is always described in the code of the application itself.
The collection of business metrics can be carried out both by means of the application itself, and with the help of additional tools, where possible.
Recommendations for collecting metrics
Consider recommendations for collecting various metrics. The best method for collecting infrastructure metrics is USE: Utilization (usage), such as disk load, Saturation (saturation), such as disk queue, Errors (errors), such as disk I / O errors.
Here is a typical list of resources that USE recommends collecting metrics from.
CPUs: sockets, cores, hardware threads (virtual CPUs)
Storage devices: I/O, capacity
Controllers: storage, network cards
Interconnects: CPUs, memory, I/O
Keep in mind that some components are two types of resources: storage devices are a service request (I/O) resource, and a capacity resource. Both of these types of resources can become a bottleneck in the system. Some physical components have been omitted, such as hardware caches (eg MMU TLB/TSB, CPU). The USE method is most effective for resources that degrade under high load or saturation, resulting in a bottleneck. Caches improve performance under high load.
Deciding whether or not to include a particular resource in monitoring should be done empirically. That is, first enable monitoring of the necessary resources and look at the result – if it does not suit you (for example, the metric is not informative, always 0 or a constant), look in the same Prometheus for another similar metric for monitoring.
The developer of the USE method, Brendan Gregg, also provides another method for determining which resources you need to collect metrics for. The author proposes to draw a functional block diagram of the system, which will show the relationships, which can be very useful in finding bottlenecks in the data flow. Here is an example of such a scheme for a SunFire server:
Based on such a scheme, as applied to equipment, you can specify the bandwidth of buses and interfaces, where applicable, you can specify the amount of memory, frequency, temperature, and other parameters. As a result, thanks to such a “enriched” schema, we can effectively collect metrics and monitor.
In general, the USE method is shown in flowchart form below.
If the USE method is more suitable for infrastructure monitoring, then the RED method is more suitable for selecting application and service metrics. The abbreviation RED stands for: Rate – requests per second, Errors – errors per second, Duration – time for each request. The main metrics that are proposed to be measured by the RED method are:
Rate (number of requests per second)
Errors (number of requests that failed)
Duration (the amount of time these requests take)
A distinctive feature of the RED method is the ability to monitor how happy your customers can be. If your site has a lot of loading errors, or the site loads in tens of seconds, then site visitors are unlikely to be happy with it.
Four Golden Signals by Google
The Four Golden Signals is a metric selection principle described in Site Reliability Engineering by Google. These are the following four signals:
Latency – response time
Traffic – request frequency
Errors – error rate
Saturation (saturation) – how much the resource is utilized (loaded)
By monitoring these four types of signals, you will be able to detect most of the problems and bottlenecks in the system. This method can be used for both infrastructure monitoring and application monitoring.
Why do you really need visualization? The first answer that can come to mind is for beauty. And in fact, this option will not be completely delusional. The fact is that on beautiful graphs, which are ultimately obtained as a result of visualization, you can quite effectively observe changes in certain systems, track trends in work and analyze the result. Therefore, a completely legitimate stage in the development of any monitoring system is the visualization of the collected metrics.
In the previous article, we deployed Prometheus and Grafana, now we will connect and visualize metrics from Docker as an example.
Keeping track of containers
In order to start collecting metrics with Docker, we first need to create the /etc/docker/daemon.json file with the following content:
"metrics-addr" : "127.0.0.1:9323",
"experimental" : true
Where metrics-addr is the address of the Prometheus server (in my case, everything is located on the same host) and port 9323. For the settings to take effect, you need to restart Docker.
systemctl restart docker
Next, you need to make changes to the Prometheus settings. We need the /etc/prometheus/prometheus.yml file. In it we find scrape_configs (it should already contain a settings block for collecting metrics of Prometheus itself) and add the following block there:
- job_name: 'docker'
- targets: ['localhost:9323']
It should look something like this:
Now we go to Prometheus, Status->Targets and make sure there is a task for collecting metrics from Docker.
That’s it with Prometheus. Now we go to the Grafana interface and check that we have a Prometheus data source on port 9090 in the Data Sources section. Let’s move on to creating a new dashboard. In my example, there will be four panels:
engine_daemon_image_actions (the graph will show the total number of actions with containers),
engine_daemon_network_actions (network activities),
engine_daemon_events_total (total number of events) and
engine_daemon_container_states (statistics on container states).
To add panels, click New dashboard -> Add query. Next, specify the required metrics. For example
The entered query will display container statistics. Next, select the Visualization icon (lower left). And choose the type of chart. In my example for this metric, I will leave the first option.
In the third step, you can not change anything. On the fourth, you can specify the conditions for creating an Alert.
Let’s repeat all these steps for the rest of the panels and get the following dashboard.
We now have a dashboard that displays the state of Docker.
In this article, we reviewed the main recommendations for monitoring infrastructure and applications and, as an example, connected the collection of metrics from Docker to Prometheus and Grafana and made the corresponding dashboard. The next article will be completely devoted to collecting traces directly from applications.
In conclusion, I would like to recall one simple truth. Collecting data is cheap, but not having it when needed can be very costly. Therefore, it is necessary to ensure that all useful data that is reasonable to collect is collected.
I would also like to invite you to free webinarwhere we consider the main tools for working with the network in Linux, found in such popular distributions as CentOS, Ubuntu, ArchLinux.