Monitoring NATS JetStream in Grafana

Hello, my name is Alexander, I am a backend developer. In this publication I would like to share my experience in setting up NATS JetStream monitoring. Tell us why this might be needed in principle. And also give an example of the necessary stack of services raised in docker for monitoring. The article does not discuss the settings of dashboards in Grafana, the principles and features of NATS.

The question may arise as to why monitoring is required, especially for a software product that, in principle, is operational. Because monitoring is not free – it requires setting up and maintaining several services. However, to optimize and add new features, you need to understand how the application functionality is actually used. And it is metrics that can help us with this. A metric is a numerical measure of some property or behavior of software. Unlike logs, metrics do not collect all the details, but only the finished extract: for example, the number of requests to the service.

A simple example: there is an endpoint that runs for, say, 10 seconds, and at first glance, it seems that you can try to allocate resources to optimize it. But, collecting statistics for a month shows that this endpoint was practically not used. Therefore, a logical question arises: is it worth spending resources on optimizing this endpoint?

First, a few words about NATS JetStream. NATS is a message queuing system that appeared in 2010, written in Go. NATS is well suited for real-time messaging. Today, NATS JetStream features durability and guaranteed message delivery.

The following stack will be used to monitor NATS:

  • NATS JetStream

  • Prometheus nats exporter – a service that acts as a kind of adapter between NATS and the Prometheus service

  • Prometheus is a separate service for collecting telemetry

  • Grafana – for directly displaying results

To run in docker you can use the following docker-compose file

version: "3.8"
services:
  nats:
    image: nats:latest
    command: --js --debug --trace --sd /data -p 4222 -m 8222
    ports:
    - 4222:4222
    - 6222:6222
    - 8222:8222
    volumes:
    - ./jetstream-cluster/n1:/data

  prometheus-nats-exporter:
    image: natsio/prometheus-nats-exporter:latest
    command: "-connz -varz -channelz -serverz -subz -healthz -routez http://host.docker.internal:8222"
    ports:
      - "7777:7777"

  prometheus:
    image: prom/prometheus:latest
    hostname: prometheus
    volumes:
      - "./prometheus.yml:/etc/prometheus/prometheus.yml"
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana
    hostname: grafana
    ports:
      - "3000:3000"

A little more detail about the settings of these services.

In NATS, you need to enable monitoring on the desired port. This can be done using the -m parameter when starting the service. More details can be read in official documentation.

NATS supports various types of metrics. The following can be distinguished:

  • varz – general statistics

  • connz – connection statistics

  • routez – information about routes

  • subsz – information about subscribers

  • jsz – information about streams (jetstreams)

The results displayed in these metrics can be viewed in the NATS service at the following path:http://localhost:8222/ metrics>, for example http://localhost:8222/varz

Next, in the NATS Prometheus exporter configuration file, you need to specify the data source, i.e. NATS service (port specified in the -m parameter), and also specify which metrics need to be collected. This is done using the command

prometheus-nats-exporter -varz "http://localhost:5555

where the necessary metrics are indicated through a hyphen.

The final result of the obtained metrics from Prometheus nats exporter can be seen along the path http://localhost:7777/metrics. This path can be overridden using the -path parameter.

Metrics have the following form: first comes its description and type, and then the metric itself and its value in the format {label} .
Label content can be used in Grafana to group/sort or display results. For example, the metric “CPU resource consumption (CPU, %)”

gnatsd_varz_cpu{server_id=} 1

For more details about the available metrics, see official documentation.

The Prometheus service collects telemetry, it has a list of data sources, and it periodically polls them. To launch Prometheus you can use the file prometheus.yml

In the targets block: [‘host.docker.internal:7777’]) the data source (host and port) is indicated, which, in this case, is prometheus nats exporter. However, when running in docker, you cannot specify localhost. Because inside the docker environment, prometheus nats exporter does not have access to the open ports of the host machine. You need to use either http://host.docker.internal or an address with the container name. Additionally, you can read about setting up prometheus in the article.

Next, you can check the availability of the prometheus nats exporter data source. All available data sources for prometheus are displayed at http://localhost:9090/targets

prometheus data sources

prometheus data sources

Finally, we need to configure the display of the metrics we are interested in in Grafana, and we can proceed directly to monitoring.

As an example of how metrics work, the following situation will be considered: 2 messages are published into a new stream, and after a few minutes a Subscriber appears reading this stream.

To send a message, use the nats_cli utility; the sending command looks like:

nats publish

The time of appearance of Subscriber on the graphs is indicated by a vertical line. The graph of the number of Subscribers (metric gnatsd_connz_total) looks like this:

Number of Subscribers

Number of Subscribers

The graph of unprocessed messages (metric jetstream_consumer_num_pending) shows that when Subscriber appears, the number of unprocessed messages becomes zero.

Number of unprocessed messages

Number of unprocessed messages

At the same time, no changes occur in the graph of the total number of messages (metric jetstream_server_total_messages).

Total number of messages

Total number of messages

The following important NATS metrics can be identified.

  • gnatsd_varz_cpu – CPU resource consumption (CPU, %)

  • gnatsd_varz_mem – RAM consumption

  • gnatsd_connz_total – Number of Subscribers

  • jetstream_server_total_messages – Total messages in jetstream

  • jetstream_server_total_message_bytes – Size of all messages in jetstream

  • jetstream_consumer_num_pending – Pending messages in jetstream

  • gnatsd_varz_in_msgs – Incoming messages from Publisher in NATS

  • gnatsd_varz_out_msgs – Replies to messages from Subscriber in NATS

All these metrics can be displayed on a separate screen, which will allow you to quickly analyze the current state of the entire NATS JetStream service.

A wide range of different metrics are available in NATS. The necessary metrics should be selected based on the specific situation and depending on the purpose of monitoring.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *