We make dashboards for Grafana from what the Prometheus metrics exporter gives

While working with the data pipeline, as a result of which we had data in Timescale, which we visualized in the form of heat maps in the last article, we had many different components involved, each of which strove to fall or contribute to the delay in the appearance of data in base and front-end.

Therefore, it was also my responsibility to monitor, and not only for the stack familiar to me as a Spring Boot developer (in our case it was Apache Camel), but in general everything that could be reached. If there were no problems with Timescale, Apollo Nodejs Graphql, MQTT, JMeter (there was constant load testing on the test bench) and Golang services, after setting up the export of their standard metrics to Prometheus in the Micrometer format, we managed to find almost immediately suitable ready-made dashboards on the Grafana websiteif it was even easier with dashboards for k8s metrics – they went as provisioned in helm-chart kube-prometheus-stackthen with the rest of the zoo it turned out to be more difficult.

So, for Apache Camel and its custom metrics configured for each route, I had to significantly finish one of the ready-made dashboards adapted for data visualization when enabled Spring Boot Actuator and collected from the endpoint /actuator/prometheus. Dashboard based on this adventure posted in the gallery of dashboards with brief instructions on how to use it.

The situation with Apache Spark was much worse. For the first one, metrics were included using some copy-paste for its config, which allows you to return internal “spark” metrics in Micrometer format, filtering and renaming them a little, and which has become almost an industry standard. At the moment when these metrics began to be collected, a huge array of metrics fell out on us, which we must first look at with our eyes before figuring out what a particular metric is about and whether it is informative. It was necessary to somehow automate the process – take a footcloth issued by Apache Spark metrics and create a dashboard about dozens of panels in order to understand them already by looking at them. This, to put it mildly, is against the approach when dashboards in Grafana are made smoothly and very meaningfully, avoiding the incomprehensible and superfluous, but we needed this more primarily for exploratory testing than for industrial monitoring.

Code was written that created a JSON config for importing data as a new dashboard into Grafana, parsing the copied output of the Prometheus metrics exporter in the Micrometer format. He did it quite simply and primitively – it turned out that the present human-readable names of metrics, which sometimes come in the section #HELP Micrometer format are completely uninformative. The team memorized the metrics by the very name and writing of the metrics, and if the dashboard wanted to give a more “business” name, then this was done much later, not at the time the dashboard was created.

In addition, it quickly became clear that the analysis of any metric is best done by looking at 2 graphs – the normal form of the measured indicator and its “first derivative” with respect to time – the Grafana function rate(). At the same time, in the header of the dashboard, when there are a lot of panels in it, it turned out to be more convenient to display rate(имя_метрики[5m]) in order to instantly find with the eyes a comparison of the original graph with the values ​​​​with it.

Subsequently, another dashboard was added to monitor Hadoop. It was not supposed to have any plugins for exporting metrics, so we went for a trick. You can go to Grafana connect as a second “Prometheus”… Loki. Further, by selecting “just such Loki” as the “datasource” (Loki connected as Loki will not work), you can add alerts to requests that follow the log counter in Loki.

For example, you can ask to build such a metric:

count_over_time({app="hadoop"} |="INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving"[5m])

What will give a certain graph, how many times in 5 minutes the log was recorded, and this log is a sign of stable operation. Then, with the graph, you can build a derivative, you can hang an alert on the thresholds set for it, in general, the full functionality of the metric from Prometheus. So it was possible to make alerts for his work.

Made dashboards have been uploaded to my humble gallery grafana.com. It turned out that over the past year someone even downloaded them, which meant only one thing: monitoring citizens continue to face a shortage of ready-made dashboards in a situation where they have a whole bunch of new metrics that they had not previously monitored. Indeed, in addition to standard metrics from applications and frameworks, developers can create a great variety of application-level metrics, but they may not want to make beautiful panels for each metric manually. If we don’t make a panel for us, we won’t want to keep track of our metrics, we need to somehow simplify life and make the implementation of metrics and monitoring them as simple as possible.

It was decided to make a small web utility that would allow us to do online the same thing that we did earlier in the console application for “spark” metrics. You can try it here: http://eljah.tatar/micrometer2grafana

What should be done? First, copy the text of the metrics as they are given by the exporter of the application you are interested in

Example taken for Apache Camel under Spring Boot
Example taken for Apache Camel under Spring Boot

Copy in the input field. Choose whether you want the original metrics, “time derivative” with the specified “step” time, or both options at once, you also need to enter the name of the datasource in your “grafana”, a unique identifier and a title for the new dashboard (if you are not satisfied with the default ones ).

From the moment of publication, I hope there will be additional fields and settings, if there is feedback about it
From the moment of publication, I hope there will be additional fields and settings, if there is feedback about it

Copy the returned JSON, this is your config to be imported into Grafana.

There was not enough time and perseverance to make a page with a special copyable output.  In my opinion, it is possible to select and copy in 1 click for output from text / plain, but nothing else is needed
There was not enough time and perseverance to make a page with a special copyable output. In my opinion, it is possible to select and copy in 1 click for output from text / plain, but nothing else is needed
This is where you need to load the JSON config
This is where you need to load the JSON config

After import, if you have selected both the original metric and the derivative at the same time, you can see what sometimes different information can be obtained in each case:

In the first case, the growth dynamics of the connection time is visible, but this is not very informative, because _sum will always accumulate and grow.  The lower metric, which shows how the growth rate of this metric has changed, is much more informative about the operation of the system.
In the first case, the growth dynamics of the connection time is visible, but this is not very informative, because _sum will always accumulate and grow. The lower metric, which shows how the growth rate of this metric has changed, is much more informative about the operation of the system.

If you found this article, used the utility, and got feedback, write in the comments. I will monitor!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *