Zabbix vs Prometheus. What to choose for heterogeneous infrastructure?
Introduction
It's no secret that regulators are currently actively introducing domestic operating systems based on Linux/Unix into the infrastructure of state-owned companies. This creates a headache for ordinary users accustomed to the intuitive Windows interface, working with the corresponding text editors and programs familiar over the years, as well as for network engineers who have been working with Windows infrastructure for years. They are faced with the task of migrating services to another OS and, often, training users to work in a new environment.
As a rule, sooner or later the system reaches an equilibrium state: some users work on Windows workstations, specialists work on Unix-like operating systems. Our infrastructure is becoming heterogeneous, and monitoring is important for its sustainable operation.
In this article we will look at the advantages and disadvantages of popular monitoring systems, and also deploy them in a heterogeneous Windows-Linux environment.
A little theory
What does monitoring do?
1. Detect problems in time
2. Evaluate the performance of servers and hosts and prevent them from “falling”
3. Detect abnormal activity and respond to intrusion attempts
4. Analyze trends and plan host and network optimization in a timely manner
5. Prepare reports and analytics for management to convincingly prove the need for cash investments
As always, the choice of means depends on the tasks. Let's compare the Zabbix system with Prometheus (we will use Graphana to visualize metrics).
Zabbix
Advantages:
1. Complete solution from a single source: Zabbix is a comprehensive solution that offers monitoring, alerting and reporting in one system.
2. Ready to use out of the box: Has many pre-installed templates and integrations, making it easy to configure and deploy.
3. Real-time data collection: Supports long delays between casts and can monitor data in real time.
4. Support various data types: Can work with agents for various platforms, support SNMP, IPMI, JMX, etc.
5. Extensive notification system: Supports various notification channels and flexible trigger conditions.
Flaws:
1. Difficult to customize: Despite the availability of templates, integration and adaptation for specific needs can require significant effort.
2. Data storage: Uses relational databases to store data, which on a large scale can require serious resources.
3. Scalability: With large volumes of data, scalability and performance issues may arise.
Prometheus + Grafana
Advantages:
1. Modularity and flexibility: Prometheus is responsible for collecting, storing and processing metrics, while Grafana is used to visualize them, providing flexibility in the choice of tools.
2. Good scalability: Prometheus is designed to collect metrics on large and dynamic systems, providing high performance and scalability.
3. Powerful query language: PromQL allows you to perform complex analytical tasks and obtain the necessary information from the collected metrics.
4. Integration with cloud services: Support for clouds and container environments such as Kubernetes.
5. Community and Ecosystem: Large ecosystem and strong community of developers and users.
Flaws:
1. Short-term data storage: Primarily focused on short-term data storage, although integration with external storage can solve this problem.
2. Lack of an integrated alert system: Although Prometheus includes an alertmanager mechanism, it is less integrated compared to Zabbix.
3. Complexity of setup and operation: You need to configure and administer two or more separate components (Prometheus and Grafana) plus Alertmanager.
Practice
We will sequentially deploy the systems in accordance with the diagrams
Zabbix
Installing Zabbix server is very easy. On the official website there is configurator With all the necessary commands with which you can prepare a Linux host, we will not consider this issue in detail.
Server configuration
If we want to automate the process of connecting agents to the server, group them by operating system type and link them to specific templates, then we should configure auto-registration. To do this we go to:
Alert – action – auto-registration action
Create – name – autoreg_linux
Conditions – Host metadata – contains – Linux
Operations:
Add host
Activate host
Send a message to users: Admin via all notification methods
Add to host groups: Station_linux
Attach to templates: Linux by Zabbix agent
Create – name – autoreg_windows
Conditions – Host Metadata – Contains – Windows
Operations:
Add host
Activate host
Send a message to users: Admin via all notification methods
Add to host groups: Station_Windows
Attach to templates: Windows by Zabbix agent
Don't forget to create the appropriate groups in advance.
Preparing Hosts – Linux
Manual installation is as easy as server installation.
Let's take a better look at how to install centrally and automatically. I use Puppet and Ansible for these purposes. The first is for supporting long-term infrastructure, where hosts are not working machines and may periodically be unavailable, the second is mainly for virtual hosts and servers.
Why is this so? Puppet uses agents on hosts that periodically check the system configuration against the configuration on the server. Reconciliation occurs regularly and sooner or later all machines will be brought to uniformity. For Ansible, it is important that the host is currently turned on and ready to accept the standalone python package that the server sends to it. Ansibe does not have agents, which is its advantage and at the same time a disadvantage in certain cases.
In our case, we use Ansible and to work we must create 3 directories with our own structure:
Among the features, we will indicate the address of the Zabbix server in global variables. I apologize in advance if this is unusual, but to make an already long article more compact, I will indicate all the commands and comments in the code line in this specific format:
touch inventory/group_vars/all.yml
#>
---
zabbix_server_addr: 192.168.2.101
#<>
Role
mkdir roles
# создание каталогов роли
ansible-galaxy role init roles/zabbix_agent
# редактируем:
roles/zabbix_agent/tasks/main.yml
#>
---
- name: Add repo Zabbix
ansible.builtin.template:
src: zabbix.list.j2
dest: /etc/apt/sources.list.d/zabbix.list
owner: root
group: root
mode: '0644'
- name: Add Zabbix GPG Key
ansible.builtin.copy:
src: zabbix-official-repo.gpg
dest: /etc/apt/trusted.gpg.d/
owner: root
group: root
mode: '0644'
notify: Update_apt_cache
- name: Install Zabbix
ansible.builtin.apt:
name: zabbix-agent
state: present
- name: Config Zabbix-agent
ansible.builtin.template:
src: zabbix_agentd.conf.j2
dest: /etc/zabbix/zabbix_agentd.conf
owner: zabbix
group: zabbix
mode: '0755'
vars:
current_hostname: "{{ ansible_facts.hostname }}"
- name: Ensure log directory exists
ansible.builtin.file:
path: /var/log/zabbix
state: directory
owner: zabbix
group: zabbix
mode: '0755'
- name: Start, autostart Zabbix-agent
ansible.builtin.service:
name: zabbix-agent
state: restarted
enabled: true
# --- тест службы
- name: Pause
ansible.builtin.pause:
seconds: 3
tags: test
- name: Test service status
ansible.builtin.systemd:
name: zabbix-agent
state: started
register: service_status
tags: test
- name: Message service fail
ansible.builtin.fail:
msg: "Служба не запущена!"
when: service_status.status.ActiveState != 'active'
- name: Message service success
ansible.builtin.debug:
msg: "Служба запущена успешно!"
when: service_status.status.ActiveState == 'active'
#<>
touch roles/zabbix_agent/templates/zabbix.list.j2
#>
# Generate for Ansible
{% for repo in zabbix_agent__repo %}
{{ repo }}
{% endfor %}
#<>
roles/zabbix_agent/defaults/main.yml
#>
---
zabbix_agent__repo:
- deb https://repo.zabbix.com/zabbix/6.4/debian bookworm main
- deb-src https://repo.zabbix.com/zabbix/6.4/debian bookworm main
#<>
# переносим ключ репозитория в роль
cp /etc/apt/trusted.gpg.d/zabbix-official-repo.gpg roles/zabbix_agent/files/
# обработчик
roles/zabbix_agent/handlers/main.yml
#>
---
# обновление кеша пакетного менеджера
- name: Update_apt_cache
ansible.builtin.apt:
update_cache: true
#<>
# шаблон конфига zabbix agent
touch roles/zabbix_agent/templates/zabbix_agentd.conf.j2
#>
# Generate for Ansible
PidFile=/run/zabbix/zabbix_agentd.pid
LogFile=/var/log/zabbix/zabbix_agentd.log
LogFileSize=0
Server={{ zabbix_server_addr }}
ServerActive={{ zabbix_server_addr }}
Hostname={{ current_hostname }}
# Include=/etc/zabbix/zabbix_agentd.d/*.conf
HostMetadataItem=system.uname
#<>
What the role does:
Connects repositories via a template and adds the corresponding key, which must be downloaded in advance (it will appear when you install the zabbix server)
Installs the agent and also configures it through the jinja template (in the agent configuration, the line HostMetadataItem=system.uname allows the server autoregistration service to understand which operating system it should work with)
Checks that the agent service is running without errors.
Don't forget to open port 10050 on the host and 10051 on the server.
Playbook
mkdir playbooks
touch playbooks/zabbix.yml
playbooks/zabbix.yml
#>
- name: Zabbix
hosts: station
roles:
- zabbix_agent
#<>
After launching the playbook, the specified hosts will appear on our server and connect to the necessary templates.
Preparing Hosts – Windows
Download the msi package from the official website and check that the installer version matches the server version. The most convenient way to distribute the package is through group policies of the Active Directory domain; I will not describe it in detail, since this is a simple point and there are a lot of materials on this topic. Let me just say that before distribution, our msi package must be modified using an editor Orca. You can also edit the configuration file, but personally I like this method better.
To do this, enter rows with values into the Property table:
SERVER = 192.168.2.101
SERVERACTIVE = 192.168.2.101
LISTENPORT = 30001
Do not forget that for auto-registration on the server we need to transfer metadata to the agent configuration file; for this we prepare a script and organize its launch through group policies
We update the policies, make sure that the necessary ports are open and go to the server
Prometheus + Grafana
Here everything is a little more complicated, since Prometheus is designed more for lightweight launch in containers; to run as a monitoring server, we need to create services for the operation of the corresponding executable files, connect the necessary components for alerts. Doing all this by hand is extremely time consuming, let’s deploy the server via Ansible using the role
mkdir roles
# создание каталогов роли
ansible-galaxy role init roles/prometheus_server
# редактируем:
roles/prometheus_server/tasks/main.yml
#>
---
# Promrtheus
- name: Create Prometheus directory
ansible.builtin.file:
path: /etc/prometheus
state: directory
mode: '0755'
- name: Download Prometheus
ansible.builtin.get_url:
url: "{{ prometheus_server__url }}"
dest: "{{ prometheus_server__archive_path }}"
mode: '0644'
force: no
- name: Extract Prometheus
ansible.builtin.unarchive:
src: "{{ prometheus_server__archive_path }}"
dest: /tmp/
remote_src: true
- name: Move Prometheus binaries
ansible.builtin.copy:
src: "{{ prometheus_server__extract_path }}/{{ item }}"
dest: /usr/local/bin/
mode: '0755'
remote_src: true
with_items:
- prometheus
- promtool
- name: Move Prometheus configs
ansible.builtin.copy:
src: "{{ prometheus_server__extract_path }}/{{ item }}"
dest: /etc/prometheus/
remote_src: true
mode: '0644'
with_items:
- consoles
- console_libraries
- name: Create Prometheus user
ansible.builtin.user:
name: prometheus
shell: /bin/false
- name: Change ownership of Prometheus directories
ansible.builtin.file:
path: /etc/prometheus
owner: prometheus
group: prometheus
recurse: true
- name: Create Prometheus directory
ansible.builtin.file:
path: /var/lib/prometheus/
state: directory
mode: '0755'
owner: prometheus
group: prometheus
# Alertmanager
- name: Create configuration directory Alertmanager
ansible.builtin.file:
path: /etc/alertmanager
state: directory
mode: '0755'
- name: Download Alertmanager
ansible.builtin.get_url:
url: "{{ prometheus_server__alertmanager_url }}"
dest: "{{ prometheus_server__alertmanager_archive_path }}"
mode: '0644'
- name: Extract Alertmanager
ansible.builtin.unarchive:
src: "{{ prometheus_server__alertmanager_archive_path }}"
dest: /tmp/
remote_src: true
- name: Move Alertmanager binaries
ansible.builtin.copy:
src: "{{ prometheus_server__alertmanager_extract_path }}/{{ item }}"
dest: /usr/local/bin/
mode: '0755'
remote_src: true
with_items:
- alertmanager
- amtool
- name: Create Alertmanager user
ansible.builtin.user:
name: alertmanager
shell: /bin/false
- name: Change ownership of Alertmanager directories
ansible.builtin.file:
path: /etc/alertmanager
owner: alertmanager
group: alertmanager
recurse: true
- name: Create Alertmanager directory
ansible.builtin.file:
path: /var/lib/alertmanager/
state: directory
mode: '0755'
owner: alertmanager
group: alertmanager
# config files
- name: Deploy config alert_rules
ansible.builtin.copy:
src: "files/{{ item }}"
dest: /etc/prometheus/
owner: prometheus
group: prometheus
mode: '0644'
with_items:
- alert_rules.yml
- name: Deploy config prometheus
ansible.builtin.template:
src: "prometheus.yml.j2"
dest: /etc/prometheus/prometheus.yml
owner: prometheus
group: prometheus
mode: '0644'
- name: Deploy config alertmanager
ansible.builtin.template:
src: "alertmanager.yml.j2"
dest: /etc/alertmanager/alertmanager.yml
owner: alertmanager
group: alertmanager
mode: '0644'
# service
- name: Template Prometheus service
ansible.builtin.template:
src: templates/prometheus.service.j2
dest: /etc/systemd/system/prometheus.service
mode: '0644'
- name: Template Alertmanager service
ansible.builtin.template:
src: templates/alertmanager.service.j2
dest: /etc/systemd/system/alertmanager.service
mode: '0644'
- name: Reload systemd
ansible.builtin.systemd:
daemon_reload: true
- name: Enable and restart Prometheus service
ansible.builtin.systemd:
name: prometheus
enabled: true
state: restarted
- name: Enable and restart Alertmanager service
ansible.builtin.systemd:
name: alertmanager
enabled: true
state: restarted
# --- tests
- name: Pause
ansible.builtin.pause:
seconds: 3
tags: test
- name: Test service status prometheus
ansible.builtin.systemd:
name: prometheus
state: started
register: service_status_prometheus
tags: test
- name: Test service status alertmanager
ansible.builtin.systemd:
name: alertmanager
state: started
register: service_status_alertmanager
tags: test
- name: Message service fail prometheus
ansible.builtin.fail:
msg: "Служба не запущена! - prometheus"
when: service_status_prometheus.status.ActiveState != 'active'
- name: Message service fail alertmanager
ansible.builtin.fail:
msg: "Служба не запущена! - alertmanager"
when: service_status_alertmanager.status.ActiveState != 'active'
- name: Message service success prometheus
ansible.builtin.debug:
msg: "Служба запущена успешно! - prometheus"
when: service_status_prometheus.status.ActiveState == 'active'
- name: Message service success alertmanager
ansible.builtin.debug:
msg: "Служба запущена успешно! - alertmanager"
when: service_status_prometheus.status.ActiveState == 'active'
#<>
# создаем шаблоны служб и конфигурационных файлов prometheus и alertmanager
roles/prometheus_server/templates/alertmanager.service.j2
#>
[Unit]
Description=AlertManager Service
After=network.target
[Service]
User=alertmanager
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager
Restart=always
[Install]
WantedBy=multi-user.target
#<>
roles/prometheus_server/templates/alertmanager.yml.j2
#>
route:
group_by: ['alertname', 'instance', 'severity']
group_wait: 20s
group_interval: 20s
repeat_interval: 12h
receiver: 'telepush'
receivers:
- name: 'telepush'
webhook_configs:
- url: '{{ prometheus_server__url_telepush }}'
http_config: {}
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
#<>
roles/prometheus_server/templates/prometheus.service.j2
#>
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus/
[Install]
WantedBy=multi-user.target
#<>
roles/prometheus_server/templates/prometheus.yml.j2
#>
global:
scrape_interval: 20s
evaluation_interval: 10s
rule_files:
- alert_rules.yml
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# server
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# --- lup
- job_name: 'linux'
static_configs:
- targets: ['192.168.2.101:9100']
- targets: ['192.168.2.103:9100']
# windows
- job_name: 'windows'
static_configs:
- targets: ['192.168.2.100:9182']
- targets: ['192.168.2.102:9182']
#<>
# внутренние переменные роли для создания ссылок для скачивания и директорий размещени файлов
roles/prometheus_server/vars/main.yml
#>
---
prometheus_server__url: "{{ prometheus_server__repo }}/v{{ prometheus_server__version }}/prometheus-{{ prometheus_server__version }}.linux-amd64.tar.gz"
prometheus_server__repo: "https://github.com/prometheus/prometheus/releases/download"
prometheus_server__archive_path: "/tmp/prometheus-{{ prometheus_server__version }}.linux-amd64.tar.gz"
prometheus_server__extract_path: "/tmp/prometheus-{{ prometheus_server__version }}.linux-amd64"
prometheus_server__alertmanager_url: "https://github.com/prometheus/alertmanager/releases/download/{{ prometheus_server__alertmanager_filename }}.tar.gz"
prometheus_server__alertmanager_filename: "v{{ prometheus_server__alertmanager_version }}/alertmanager-{{ prometheus_server__alertmanager_version }}.linux-amd64"
prometheus_server__alertmanager_archive_path: "/tmp/alertmanager-{{ prometheus_server__alertmanager_version }}.linux-amd64.tar.gz"
prometheus_server__alertmanager_extract_path: "/tmp/alertmanager-{{ prometheus_server__alertmanager_version }}.linux-amd64"
#<>
# обработчик для обновления кеша пакетного менеджера
roles/prometheus_server/handlers/main.yml
#>
---
- name: Update_apt_cache
ansible.builtin.apt:
update_cache: true
#<>
# конфиг с правилами алертменеджера разместил в директрии files из за ошибок при работе с шаблонами (нужно экранировать определенные символы, решил не мучать себя этим, тем более переменных не передаю)
roles/prometheus_server/files/alert_rules.yml
#>
groups:
- name: Critical alerts
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute.'
summary: Instance {{ $labels.instance }} down
#<>
# ну и в дефолте разместил переменные которые можно переопределить при запуске плейбуков
roles/prometheus_server/defaults/main.yml
#>
---
prometheus_server__version: "2.54.0"
prometheus_server__alertmanager_version: "0.27.0"
#<>
In my case, the notification works via telegram, so we add the file to the inventory sinventory/secrets.yml with variable “prometheus_server__url_telepush” with url for communication with the telegram bot telepush. Don't forget to encrypt this file using Ansible Vailt. You can see how to do this here.
In my case, notifications about hosts look like this.
Setting up Linux stations
To configure stations, you need a role that creates services for running executable files and installs Node Exporter
mkdir roles
# создание каталогов роли
ansible-galaxy role init roles/prometheus_nodeexporter
# редактируем:
roles/prometheus_nodeexporter/tasks/main.yml
#>
- name: Skip
when: ansible_os_family != 'Debian'
block:
- name: Message
ansible.builtin.debug:
msg: "Skipping tasks ."
- name: Install Node Exporter
when: ansible_os_family == 'Debian'
block:
- name: Download Node Exporter
ansible.builtin.get_url:
url: "{{ prometheus_nodeexporter_url }}"
dest: "/tmp/node_exporter.tar.gz"
mode: '0644'
- name: Extract Node Exporter
ansible.builtin.unarchive:
src: "/tmp/node_exporter.tar.gz"
dest: "/usr/local/bin"
remote_src: true
- name: Move Node Exporter binaries
ansible.builtin.copy:
src: /usr/local/bin/node_exporter-{{ prometheus_nodeexporter__exporter_version }}.linux-amd64/node_exporter
dest: /usr/local/bin/node_exporter
owner: root
group: root
mode: '0755'
remote_src: true
- name: Template service
ansible.builtin.template:
src: node_exporter.service.j2
dest: /etc/systemd/system/node_exporter.service
mode: '0644'
- name: Reload systemd
ansible.builtin.systemd:
daemon_reload: true
- name: Enable and start Node Exporter service
ansible.builtin.systemd:
name: node_exporter
enabled: true
state: restarted
- name: Clean up
ansible.builtin.file:
path: /tmp/node_exporter.tar.gz
state: absent
# --- тест службы
- name: Pause
ansible.builtin.pause:
seconds: 3
tags: test
- name: Test service status
ansible.builtin.systemd:
name: node_exporter
state: started
register: service_status
tags: test
- name: Message service fail
ansible.builtin.fail:
msg: "Служба не запущена!"
when: service_status.status.ActiveState != 'active'
- name: Message service success
ansible.builtin.debug:
msg: "Служба запущена успешно!"
when: service_status.status.ActiveState == 'active'
#<>
roles/prometheus_nodeexporter/templates/node_exporter.service.j2
#>
[Unit]
Description=Node Exporter
[Service]
User=nobody
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=default.target
#<>
roles/prometheus_nodeexporter/vars/main.yml
#>
---
prometheus_nodeexporter_url: "https://github.com/prometheus/node_exporter/releases/latest/download/{{ prometheus_nodeexporter_file }}"
prometheus_nodeexporter_file: "node_exporter-{{ prometheus_nodeexporter__exporter_version }}.linux-amd64.tar.gz"
#<>
roles/prometheus_nodeexporter/files/prometheus.yml
#>
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
#<>
roles/prometheus_nodeexporter/defaults/main.yml
#>
---
prometheus_nodeexporter__exporter_version: "1.8.2"
#<>
Setting up Windows stations
Just download the msi of the latest Windows Exporter from github and install via GPO.
Don't forget to open the port 9182 on Windows stations via gpo. You can do this for testing using the PowerShell command.
New-NetFirewallRule -DisplayName "Allow Port 9182" -Direction Inbound -Protocol TCP -LocalPort 9182 -Action Allow
Graphana
It is simply installed via a package from the official website.
sudo apt-get install -y apt-transport-https software-properties-common wget
wget https://dl.grafana.com/oss/release/grafana_8.4.2_amd64.deb
sudo dpkg -i grafana_8.4.2_amd64.deb
sudo systemctl enable grafana-server; systemctl start grafana-server
sudo apt --fix-broken install
sudo ufw allow 3000/tcp
Let's go to http://localhost:3000
Default login and password in Grafana: **admin : admin**
To connect Prometheus, look in the Grafana console – Connections – Data sources – Add data source – Prometheus – http://localhost:9090 – Save & Test
Like a query language PromQL (Prometheus Query Language) I won’t write here, but it is extremely powerful. This is definitely a topic for a separate article.
So we simply import the dashboard from the library:
Dashboard – Import dashbord – Data Source: Prometheus
For Windows I liked it – id 20763
For Linux – id 11074
In our case, it will be useful in the future to modify the ready-made dashboard, making the source not only Node Exporter, but also Windows Exporter, in order to monitor all stations at once.
Prometheus and applications
Separately, I would like to note that prometheus has many client libraries. For example, in an application I developed in Python for inventory control of technical equipment
I collect the following metrics:
Conclusions
Both monitoring systems have their advantages and disadvantages; as always, there is no “ideal” solution.
Zabbix is a comprehensive solution with a powerful alert system and support for many standards out of the box, but is less flexible when working with metrics.
Prometheus – has excellent scalability, is flexible, convenient for container or cloud technologies, as well as applications, but is more difficult to configure.