we analyze the options and pitfalls of implementation

What does a Patroni-based DBMS cluster consist of?

PostgreSQL-based clusters are architecturally a “shared-nothing” type of cluster, which uses neither shared disk space nor shared memory. To ensure high availability and fault tolerance, a Patroni cluster is typically assembled in a minimum three-node configuration.

In some cases, it is possible to use two DBMS nodes and a third node without a DBMS for a quorum of distributed configuration storage. But this is rather an exotic solution, implemented when there is a lack of resources. Data synchronization between cluster nodes (since we have neither shared disk space nor shared memory) is implemented through streaming physical replication from the master server to the slaves. If the master (with the master role) fails, one of the slaves will take over its role.

A little bit about the terminology and names of the roles of Patroni cluster nodes in various articles on the Internet and in various documentation

In articles on the Internet and in various documentation, you can find different names for the roles of cluster nodes: master/slave, leading/slave, primary/secondary – all names are correct. Previously, the documentation contained the terminology master/slave, used for all types of any clusters. Now, PostgreSQL clusters use the terminology leading/slave and primary/secondary, which better emphasizes the fact of directed replication in the cluster.

You can start learning Patroni from the official site. For normal operation of the cluster based on Patroni, in addition to the solution itself and the PostgreSQL it manages, we will need additional components that will allow us to solve problems with network connections, the tasks of routing network traffic to the leading cluster node, and balancing requests.

Here are the components that a cluster on Patroni with additional components may consist of:

  • PostgreSQL itself;

  • distributed storage of cluster configurations on one of the solutions: etcd, Consul, ZooKeeper or Exhibitor;

  • HAProxy for load balancing, distributing requests between nodes and transparent switching in case of primary node failure;

  • Keepalived is a service for managing a virtual IP address (VIP) that will “float” between cluster nodes and thus ensure high availability in the event of failure of one of the nodes.

Let's say you spent 4-5 hours studying Patroni documentation, possible limitations and requirements. You looked at component comparisons, for example, “etcd vs zookeeper” and figured out the typical architecture of a failover cluster, the diagram of which is given in one of the projects on GitHub:

In the end, you settled on etcd as a distributed configuration storage, especially since it is most often used with Patroni.

Features of installation of components required for a cluster on Astra Linux

If you are solving the problem of import substitution, it is logical to take one of the popular domestic distributions. For example, Astra Linux 1.7.5 from the software registry.

But when installing components from the Astra Linux repositories, you will encounter some nuances:

  • only PostgreSQL version 11 is available, if you need to use more recent versions of PostgreSQL, you will have to compile it yourself – “…but that’s a completely different story” (c);

  • Patroni version 3.0.2 from 03/24/2023, it does not support PostgreSQL 16.x, although with the PostgreSQL version from the standard Astra Linux repos this is not critical. The current version of Patroni is 3.3.2 from 07/11/2024;

  • HAProxy and etcd 3.4.23 are also not the newest, but there are no critical limitations here.

Let's start with etcd. Read the manuals on etcd.ioto understand how it is configured, operated and managed.

Important: in addition to etcd itself, you will need to configure the corresponding services and operating system services. It will take about 15-30 minutes to configure everything specifically for Astra Linux. You may have to reboot several times.

If you need support for secure connections, allow time for generating and configuring certificates.

As a result, we get a config for the etcd server with the “leader” role, which will look something like this:

Example of configuration file /etc/etcd/etcd.conf

#[Member]

ETCD_DATA_DIR=”/var/lib/etcd”

ETCD_LISTEN_PEER_URLS=”http://0.0.0.0:2380″

ETCD_LISTEN_CLIENT_URLS=”http://0.0.0.0:2379″

ETCD_NAME=”core”

ETCD_HEARTBEAT_INTERVAL=”1000″

ETCD_ELECTION_TIMEOUT=”5000″

#[Clustering]

ETCD_INITIAL_ADVERTISE_PEER_URLS=”http://db01:2380″

ETCD_ADVERTISE_CLIENT_URLS=”http://db01:2379″

ETCD_INITIAL_CLUSTER=”core=http://db01:2380,db02=http://db02:2380,db03=http://db03:2380″

ETCD_INITIAL_CLUSTER_TOKEN=”etcd-my-cluster01″

ETCD_INITIAL_CLUSTER_STATE=”existing”

Here and below, examples of configuration files are provided to estimate the volume of configuration parameters.

Setting up Patroni

The next step is to configure Patroni. There are many examples of cluster deployment configs based on it on the Internet, but before that it is important to work out several issues.

One of them is the choice of slave node types in the cluster and the type of physical replication used (synchronous/asynchronous).

In synchronous replication, performance losses depend on the level of confirmation when storing replicated data. For example, at the “remote_apply” level, losses start from 20%, and for small transactions they can exceed 50%.

An alternative is asynchronous replication, but it is not suitable for every industry, especially if we are talking about, for example, the financial sector. There is an option with “mixed” replication (synchronous + asynchronous), but it is not much faster than synchronous, since at least one synchronous node in the cluster still remains. It is also advisable to select tools and strategies for backup and recovery for nodes within the Patroni cluster (so that Patroni does not launch a full backup from the primary node of the cluster at the moment when it considers that this is necessary to restore the cluster node), but this requires time and deeper expertise.

If you need to save time, it is easier to take a simple Patroni config. There are many examples on the Internet, here is one of them:

Example of configuration file /etc/patroni/patroni.yml

name: patroni-cl

scope:patr-cl

restapi:

listen: 0.0.0.0:8008

connect_address: db01:8008

authentication:

username: patron

password: '********'

etcd:

hosts: db01:2379,db02:2379,db03:2379

username: 'root'

password: '********'

bootstrap:

dcs:

ttl: 100

loop_wait: 10

retry_timeout: 10

maximum_lag_on_failover: 2097152

postgresql:

use_pg_rewind: true

use_slots: true

parameters:

wal_level: replica

hot_standby: “on”

wal_keep_segments: 5120

max_wal_senders: 5

max_replication_slots: 5

checkpoint_timeout: 30

initdb:

– auth-host: md5

– auth-local: peer

– encoding: UTF8

– data-checksums

– locale: ru_RU.UTF-8

pg_hba:

– host replication postgres ::1/128 md5

– host replication postgres 127.0.0.1/8 md5

– host replication postgres 192.xxx.yyy.11/24 md5

– host replication postgres 192.xxx.yyy.12/24 md5

– host replication postgres 192.xxx.yyy.13/24 md5

– host all all 0.0.0.0/0 md5

– host replication replicator samenet md5

– host replication all 127.0.0.1/32 md5

– host replication all ::1/128 md5

users:admin:

password: '********'

options:

– superuser

postgresql:

listen: 0.0.0.0:5432

connect_address: db01:5432

conf_dir: /etc/postgresql/11/main

bin_dir: /usr/bin/

data_dir: /data/patron

pgpass: /tmp/pgpass

authentication:

superuser:

username: 'postgres'

password: '********'

replication:

username: 'replicator'

password: '********'

rewind:

username: 'rewind_user'

password: '*******'

parameters:

unix_socket_directories: '/var/lib/pgsql/'

tags:

nofailover: false

noloadbalance: false

clonefrom: false

nosync: false

Next we will have to study and understand all the necessary configuration parameters, this will take another 1.5-2 hours.

Launching Patroni as services

The next step is to organize the launch of Patroni as services. At this stage, various problems may arise:

  • old Python starts instead of the required Python 3;

  • paths to binaries are missing or incorrect;

  • replication is not working due to missing replicator role;

  • non-existent directories and files (if the config is taken from the network);

All these errors are not terrible, they are easy to find in the logs and correct, but they also take some time.

Next, you need to adjust the config and roll it out to all three servers.

Checking the cluster operation

After the cluster has been successfully launched, we check the replication. We create a test database on the first node and search for it on the others. If the database has replicated, then the cluster is working as it should.

But if we look back, we understand that everything took at least 3-5 hours (depending on skills and competencies). And at the moment we only have:

  • configured distributed configuration storage;

  • 3 PostgreSQL servers (in cluster configuration) with physical replication managed by Patroni.

Setting up a single connection point

But we don't have a single cluster IP and a single connection point yet. For this, we configure Keepalived and HAProxy.

Keepalived is not a complicated service, but it has its pitfalls. Let's recall how Keepalived and the VRRP protocol work (what it sends to whom and how it reacts), read useful materialswe will find out what will happen if we suddenly take someone else's virtual_router_id.

Next, we allocate IP for VIP, select configs for a 3-node configuration (fortunately, there are many examples for connecting Keepalived with HAProxy):

Example of configuration file /etc/keepalived/keepalived.conf

global_defs {

router_id db01

}

vrrp_script haproxy {

script “/usr/bin/killall -0 haproxy”

interval 2

weight 2

}

vrrp_instance admin-1 {

state MASTER

virtual_router_id 50

advert_int 1

priority 150

interface eth0

}

virtual_ipaddress {

192.xxx.yyy.10/32deveth0

}

track_script {

haproxy

}

}

(Again) Solving problems with packages and utilities

When setting up Keepalived and HAProxy, we will likely encounter the absence of some packages and utilities required for the example scripts to work. For example, the killall utility is not installed by default.

We quickly solve the problems found, launch the service, launch pings on VIP and check its operation:

  • we disable network interfaces on the nodes one by one;

  • we make sure that the VIP “jumps” between nodes;

  • we check that the VIP returns to the original node;

Setting up request routing from VIP to the leading (primary) PostgreSQL node in the Patroni cluster

It is important for us to send the network traffic received on VIP to the leading (primary) PosrgreSQL node in the Patroni cluster. For this purpose, we use HAProxy.

HAProxy is probably the most popular network service of all auxiliary services used in our cluster. There are many example configs for it. HAProxy checks Patroni nodes and always redirects traffic from its dedicated port (the one that is listening) to the PostgreSQL port on the node where Patroni responds as the leader (Patroni leader = PostgreSQL primary).

Example HAProxy configuration file:

Example configuration file /etc/haproxy/haproxy.conf

global

log 127.0.0.1 local2

chroot /var/lib/haproxy

stats socket /run/haproxy/admin.sock mode 660 level admin

stats timeout 30s

user_haproxy

group_haproxy

daemon

maxconn 300

defaults

log global

tcp mode

retries 2

http mode

option http-server-close

client timeout 50s

server timeout 50s

connect timeout 8s

timeout check 5s

listen stats

http mode

bind *:7000

stats enable

stats uri /

frontend patroni_front

bind *:5433

default_backend patroni_backend

backend patroni_backend

option httpchk

http-check expect status 200

default-server inter 3s fall 3 rise 2

server db01 db01:5432 maxconn 100 check port 8008

server db02 db02:5432 maxconn 100 check port 8008

server db03 db03:5432 maxconn 100 check port 8008

We configure HAProxy, check it – it works!

In general, that's all. The cluster works, traffic from the VIP is routed correctly (although it may turn out that after a test or emergency shutdown of one of the cluster nodes, the Keepalived “master” on which the dedicated Virtual IP is located, and the new PostgreSQL primary will end up on different nodes (due to differences in the algorithms and protocols for selecting “masters”), some “diagonal” traffic will appear between the cluster nodes. Not critical, but you will have to tinker with this too.

Is it worth the time?

If your goal was to practice and refresh your knowledge of a bunch of network technologies, services and protocols, and at the same time to independently assemble a working PostgreSQL cluster from available components, then you can consider that everything went great. If you deploy the cluster consciously and without haste, it will take about 1–1.5 days.

But if we evaluate the result as a real business task, the picture is not so optimistic. In our example based on Astra Linux, we got:

  • old versions of PostgreSQL and Patroni;

  • open security issues within etcd and Patroni;

  • not very optimal default settings for backup and recovery of cluster nodes (by default, recovery of a cluster node requires a full backup on the primary PostgreSQL, which in general can be quite resource-intensive);

  • no balancing of read requests (although this can be easily done via HAProxy);

  • there is no connection pooler, which is desirable, and sometimes simply necessary with more than 300-400 network connections. You can “screw on” the almost default PgBouncer, but it is single-threaded and can “drop” a PostgreSQL node with the primary role only with its load, if we consider other alternative and high-performance, multi-threaded poolers, for example Odyssey, then there is no ready-made Odyssey build for PG16, again you need to assemble.

The cluster works, but there are still many places that need to be optimized. Studying implementation options, the implementation itself and tests will take another 2-3 days to 2 weeks, depending on the volume of additions. At the same time, do not forget that we need to document each step to ensure repeatability of the configuration when transferring, for example, from a test environment to production.

Proxima DB: Deploying a Cluster in 4 Steps

It can be said that everything written above is a very large introduction 🙂 But without a brief introduction to the path of self-assembly of a cluster, the description of deployment in Proxima DB would not be so exciting! So, let's watch the hands and notice the differences.

The entire deployment process in Proxima DB can be summarized in one sentence. You only need to enter the minimum required data set in 4 steps of the setup wizard and wait 10-15 minutes for the cluster to be deployed.

Step 1: Set the basic cluster parameters.

We enter the name and description of the unit (in Proxima DB terminology, a unit is a unit of a deployed DBMS, and it does not matter whether it is clustered or not) – in our case, this is the name and purpose of the future DBMS.

We select the “Distributed” type – this tells Proxima DB that we will be deploying a cluster configuration.

Step 2. The same step with entering the minimum required data.

This is the main and only step where we enter the actual values ​​and parameters for the cluster being deployed:

  • type of selected cluster configuration;

  • Virtual IP;

  • network interfaces that we will use on all servers;

  • logins and passwords;

  • FQDN of nodes for cluster node deployment, indicating the specific node that will become the leader after deployment;

Step 3. Additional PostgreSQL parameters (optional).

At this step, you can immediately specify additional parameters for PostgreSQL, such as setting work_mem and shared_buffers depending on the resources allocated to the physical or virtual machines to which you are deploying. However, you can set all these parameters later and skip step 3.

Step 4. Viewing and verifying all parameters of the future cluster.

No data entry is required here.

The installation process is fully logged. If an error occurs during installation, the location where it occurred is immediately highlighted in the report.

At the output, we get a fully prepared and configured fault-tolerant cluster based on PostgreSQL 16.2 with 1C support. The cluster includes fresh component builds:

And all this is fully compatible with Astra Linux 1.7.5.

The cluster is immediately configured with 4 connection ports in the following variations:

  • directly without a pooler and without balancing read requests;

  • directly without a pooler and with balancing of read requests;

  • via a pooler and without balancing read requests;

  • via a pooler and with balancing of read requests.

For the cluster DB, you can immediately configure and enable a backup plan (full and incremental), or perform RK manually as needed. If you need to re-deploy a new cluster with exactly the same configuration, you can make a copy of the existing one. This will allow you to deploy a new node or cluster and reduce the already small configuration costs.

Let's draw conclusions

Let me emphasize once again: with this post I do not deny, but on the contrary, I once again confirm the possibility of collecting bike a DBMS cluster based on PostgreSQL and Patroni with your own hands. This task is really feasible, there are all the instructions for completing the quest, the quest itself is not that difficult and does not have any incomprehensible branches. Having spent from one day to … probably seven days, even a specialist with not very deep knowledge of Linux can assemble such a cluster.

Of course, the better you understand how a cluster running Patroni works, how physical, synchronous and asynchronous replication works, the faster and easier it will be for you to deploy a cluster and configure it specifically for your individual tasks.

The fact remains: a ready-made tool allows you to deploy a cluster in 15 minutes. And the ability to save and reuse configurations eliminates the need to document everything each time and guarantees the presence of “exactly the same environment” in all circuits.

If speed and reduced Time-to-Market, resource savings and reduced requirements for DBMS administrator skills are important to you, solutions like Proxima DB can be very useful.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *