Building a Scalable and Highly Available Postgres System with Patroni 3.0 and Citus

Citus is a PostgreSQL extension that provides PostgreSQL scalability by transparently distributing and/or replicating tables across one or more PostgreSQL nodes. Citus can be used both on the Azure cloud platform and on your own servers because Citus database extension is completely open source and you can download and install it anywhere.

A typical Citus cluster consists of a special node called a coordinator and several worker nodes (workers). Typically, applications send their requests to the Citus coordinator node, which passes them on to workers and accumulates the results. (Unless, of course, you are using the Citus “Request from any node” feature. This is an optional feature introduced in Citus 11. In this case, requests can be directed to any of the nodes in the cluster).

Meanwhile, one of the most frequently asked questions is: “How does Citus handle coordinator or worker failures? What is the history of high availability (HA)?”

Except when you work with Citus as a managed service in the cloud, the answer so far hasn’t been very good – just use PostgreSQL streaming to run the coordinator and workers with HA, and how to handle failovers [во время сбоя] (failover) – you decide.

Specifically for Citus, it is important that the coordinator and workers continue to work reliably even if one of them stops responding. This ensures continuous availability and allows applications to continue working without significant interruptions even if a failure occurs. These failures can occur for various reasons, such as hardware failures, software errors, etc.

High availability (HA) means that the system is configured to minimize downtime and disruption in the event of a failure. This may include automatic failure detection, failover to redundant components, and restoration of system operation without significant impact on users.

In this article, you will learn how Patroni 3.0+ can be used to deploy a highly available Citus database cluster by just adding a few lines to the Patroni configuration file. Let’s look at these topics in more detail:

  • What is Patroni?

  • Introducing Citus Support in Patroni 3.0

  • Our first distributed Citus cluster with Patroni

  • Our first controlled HA switchover using Patroni and Citus

  • Future plans and possible improvements

  • Conclusion: Combining Patroni and Citus for PostgreSQL High Availability in a Distributed Environment

Clarifying Terminology: Multiple Competing Meanings of the Term “Cluster”

In the Postgres world, the term “cluster” has many different meanings and can be used in different situations. This can cause confusion as the same word is used to describe different concepts. Here is how we will use the term in this article to avoid confusion:

  1. Database cluster (the SQL standard calls it catalog cluster): A collection of databases managed by a single instance (instance) of a running database server.

  2. PostgreSQL cluster (or Patroni cluster): multiple database instances, a primary with a few standby nodes, usually connected via streaming replication.

  3. Citus Cluster: A distributed set of database nodes that form one or more logically connected PostgreSQL clusters using the Citus extension for Postgres.

  4. Kubernetes Cluster: A set of host machines for running containerized applications. Kubernetes can be used to deploy Citus or PostgreSQL clusters at scale.

In this article, we will mainly talk about Citus distributed clusters and Patroni-managed PostgreSQL clusters (or Patroni clusters).

What is Patroni? (you can skip this section if you already know everything)

Patroni is an open source tool that helps deploy, manage and monitor highly available PostgreSQL clusters using physical streaming replication. The Patroni daemon runs on all nodes of the PostgreSQL cluster, monitors the state of the Postgres process(es), and publishes that state to a distributed key-value store.

There are several properties that are required from a distributed key-value store (or configuration store) (DCS, Distributed Key-Value):

  1. It must implement a consensus algorithm such as Raft, Paxos, Zab or similar

  2. It must support Compare-And-Set (CAS) operations (compare and set)

  3. It should have mechanisms to manage key expiration using Sessions, Leases and Time To Live (TTL – Time To Live)

  4. It is good if it provides a watch API (WATCH API) to subscribe to changes in certain keys and receive notifications about them

The last two properties are desirable, but Patroni can still work even if they are not supported/implemented, while the first two are required.

Patroni supports the following DCS: etcd, Consul, ZooKeeper and Kubernetes API:

  • Consul and etcd implement the Raft protocol

  • ZooKeeper implements the Zab protocol

  • Kubernetes API supported using etcd

Each Patroni/PostgreSQL cluster node maintains a key in DCS member (member key) with its own name. The member key value contains the node address (host and port) and PostgreSQL state: role (primary or standby), current Postgres LSN (log sequence number – a unique identifier for each change that occurs in the database), labels, and so on. Member keys allow the system to automatically discover all nodes in a given Patroni/PostgreSQL cluster.

Patroni running next to the primary PostgreSQL node also maintains the key /leader in distributed storage (Distributed Key-Value Store)

  • Key /leader has a limited time to live (TTL), and if there are no regular updates (a kind of “heartbeat”), the key may expire.

  • If the key /leader is missing, standby nodes start competing for the role of leader, trying to create a new key /leader.

  • Patroni on the node that generated the new key /leader, promotes Postgres to primary. On the remaining standby nodes, Patroni reconfigures Postgres for streaming replication from the new primary.

  • It is important to note that all operations with the key /leader protected by the Compare-And-Set operation.

Patroni on standby nodes uses keys /leader and member to determine which node is primary, and configures a managed (standby) PostgreSQL node to replicate data from the primary node. In addition to Automatic Failover for HA, Patroni helps automate many management tasks:

  • Initializing new nodes using pg_basebackup or third party backup tools like pgBackRest, wal-g/wal-e, barman and so on.

  • Provides synchronous replication.

  • When switching roles and restoring a cluster after a failure (failover), Patroni supports the execution of the tool pg_rewindwhich helps the old primary node join the cluster as standby.

  • Support for point-in-time restore (PITR): Patroni can help you restore data to a specific point in time. Instead of creating a new instance using initdb, Patroni can initialize new PostgreSQL clusters from a backup.

  • And much more.

Figure 1. Typical deployment of a highly available (HA) Patroni-managed PostgreSQL cluster using etcd as distributed key-value storage (DCS) and HAProxy to provide a single point of connection for clients to the primary node.

Figure 1. Typical deployment of a highly available (HA) Patroni-managed PostgreSQL cluster using etcd as distributed key-value storage (DCS) and HAProxy to provide a single point of connection for clients to the primary node.

Introducing Citus Support in Patroni 3.0

Patroni 3.0 introduces official Citus support for Patroni. Although before the release of Patroni 3.0 it was already possible to run Patroni with Citus (thanks to the flexibility and extensibility of Patroni!), version 3.0 made the integration with Citus for high availability more efficient and easier to use.

Patroni relies on distributed key storage (DCS) to discover PostgreSQL cluster nodes and set up streaming replication. As already explained in the Terminology Clarification section, a Citus cluster is just a collection of PostgreSQL clusters logically linked together using the Postgres Citus extension. Therefore, it was logical to extend Patroni so that it could not only discover nodes in a given Patroni/PostgreSQL cluster, but also discover nodes in a Citus cluster, for example when adding a new Citus worker node. As Citus nodes are discovered, they are added to pg_dist_node metadata of the Citus coordinator.

There are just a few simple rules to follow in order to enable Citus support on Patroni:

  1. Scope (cluster name): Scope (scope) must be the same for all Citus nodes. This means that the cluster name must be identical on all nodes in order for nodes to work properly and be discovered within the cluster. Thus, the scope (cluster name) serves to determine whether nodes belong to a particular cluster and ensures correct interaction and control between them.

  1. Superuser username/password: It is desirable that the username and password of the superuser be the same on the coordinator node and workers. If this is not the case, then you need to configure connections superuser between nodes using client certificates. Of course pg_hba.conf should allow connections to the superuser on all nodes.

  1. REST API access: Access to REST API Patroni means that worker nodes (workers) must be able to access the REST API that is provided to the coordinator (coordinator node) in Patroni. In order for workers to access the coordinator’s REST API, they must have the appropriate credentials (such as username and password) or client certificates, if used for authentication. This data allows you to identify and authorize workers to access the API.

  1. Adding Citus to the Patroni config file: You should add a specific section to the patroni.yaml file to tell Patroni that there are Citus nodes and their parameters. These guidelines mean that in order to successfully integrate Citus with Patroni, it is necessary to ensure consistency and correct configuration between the nodes. This includes a common scope for all nodes, correct superuser credentials, and REST API access settings. When all these steps are done correctly, Patroni and Citus can work together to provide high availability and reliability for your PostgreSQL cluster. A complete example of a Patroni configuration file is available at GitHub.

citus:
  group: X  # 0 for coordinator and 1, 2, 3, etc for workers
  database: citus  # must be the same on all nodes

That’s all! You can now launch Patroni and enjoy the integration with Citus.

Patroni takes care of everything:

  1. The Citus extension will be automatically added to shared_preload_libraries (top of the list!)

  2. If the value max_prepared_transactions not explicitly stated in global dynamic configurationPatroni will automatically set it to 2*max_connections. That is, in other words, if the max_prepared_transactions parameter (the maximum number of prepared transactions that can be active at the same time) was not set manually in the PostgreSQL configuration, then Patroni will automatically set its value to twice the value of the max_connections parameter (the maximum number of simultaneous connections to the PostgreSQL database) . This ensures that there are enough prepared transactions to handle potential requests.

  3. The citus.database database will be automatically created first, followed by the CREATE EXTENSION citus; command.

  4. The current superuser credentials (from the patroni.yaml file) will be added to the table pg_dist_authinfoto allow inter-node communication. Don’t forget to update them if you later decide to change username/password/sslcert/sslkey superuser!

  5. The primary coordinator node automatically keeps track of the available master worker nodes in the Citus system. As soon as a new primary worker is discovered, it is registered in the table pg_dist_node using the function citus_add_node(). This allows the Citus system to know about the presence of all primary workers and efficiently coordinate the distribution and replication of data between them.

  6. Patroni will also support the table pg_dist_node in case of failover/switchover (automatic/scheduled switching) on ​​the coordinator or working clusters. That is, Patroni ensures that the information in this table is correctly updated and managed when node roles fail or change, so that all nodes remain synchronized and ready for action.

  7. Finally, when performing a managed switchover on a production cluster, Patroni will also suspend client connections on the primary coordinator node. This is done to prevent visible errors for system clients.

The figure below shows an example of a Citus deployment in high availability (HA) mode using Patroni 3.0.0.

Figure 2: Patroni on the coordinator node automatically discovers and registers Citus workers in the cluster metadata.  All connections between distributed Citus nodes work without middleware like HAProxy, which reduces the complexity and cost of infrastructure maintenance.  The second HAproxy instance (on the right) is provided for the scenario where your application uses the optional feature "request from any node" in Citus, sometimes used to increase parallelization and throughput.

Figure 2: Patroni on the coordinator node automatically discovers and registers Citus workers in the cluster metadata. All connections between distributed Citus nodes work without middleware like HAProxy, which reduces the complexity and cost of infrastructure maintenance. A second HAproxy instance (on the right) is provided for the scenario where your application uses the optional “request from any node” feature in Citus, sometimes used to increase parallelization and throughput.

Our first distributed Citus cluster with Patroni

To deploy our test cluster locally, we will use the platform docker and tool docker-compose. File Dockerfile.citus is in Patroni repositories.

First we need to clone the Patroni repository and build the docker image patroni-citus:

$ git clone https://github.com/zalando/patroni.git
$ cd patroni
$ docker build -t patroni-citus -f Dockerfile.citus .
Sending build context to Docker daemon  573.6MB
Step 1/36 : ARG PG_MAJOR=15
… skip intermediate logs
Step 36/36 : ENTRYPOINT ["/bin/sh", "/entrypoint.sh"]
---> Running in 1933967fcb58
Removing intermediate container 1933967fcb58
---> 0eea66f3c4c7
Successfully built 0eea66f3c4c7
Successfully tagged patroni-citus:latest

Once the image is ready, we will deploy the stack with the following commands:

$ docker-compose -f docker-compose-citus.yml up -d
Creating demo-etcd1   ... done
Creating demo-work1-2 ... done
Creating demo-coord2  ... done
Creating demo-coord3  ... done
Creating demo-work1-1 ... done
Creating demo-etcd2   ... done
Creating demo-work2-2 ... done
Creating demo-coord1  ... done
Creating demo-work2-1 ... done
Creating demo-haproxy ... done
Creating demo-etcd3   ... done

We can then check that the containers are up and running:

$ docker ps
CONTAINER ID   IMAGE            COMMAND                  CREATED              STATUS              PORTS                              NAMES
e7740f00796d   patroni-citus    "/bin/sh /entrypoint…"   About a minute ago   Up About a minute                                      demo-etcd2
8a3903ca40a7   patroni-citus    "/bin/sh /entrypoint…"   About a minute ago   Up About a minute                                      demo-etcd3
3d384bf74315   patroni-citus    "/bin/sh /entrypoint…"   About a minute ago   Up About a minute   0.0.0.0:5000-5001->5000-5001/tcp   demo-haproxy
2f6c9e4c63b8   patroni-citus    "/bin/sh /entrypoint…"   About a minute ago   Up About a minute                                      demo-work2-1
4bd35bfdba58   patroni-citus    "/bin/sh /entrypoint…"   About a minute ago   Up About a minute                                      demo-coord1
8dce43a4f499   patroni-citus    "/bin/sh /entrypoint…"   About a minute ago   Up About a minute                                      demo-work1-1
e76372163464   patroni-citus    "/bin/sh /entrypoint…"   About a minute ago   Up About a minute                                      demo-work2-2
0de7bf5044fd   patroni-citus    "/bin/sh /entrypoint…"   About a minute ago   Up About a minute                                      demo-coord3
633f9700e86f   patroni-citus    "/bin/sh /entrypoint…"   About a minute ago   Up About a minute                                      demo-coord2
f50bb1e1d6e7   patroni-citus    "/bin/sh /entrypoint…"   About a minute ago   Up About a minute                                      demo-etcd1
03bd34403ac2   patroni-citus    "/bin/sh /entrypoint…"   About a minute ago   Up About a minute                                      demo-work1-2

In total we have 11 containers:

  • three containers with etcd (form a three-node etcd cluster),

  • seven containers with Patroni+PostgreSQL+Citus (three coordinator nodes and two workers (workers) of the cluster, two nodes each), and

  • one container with HAProxy.

HAProxy (load balancing server) works as an intermediary between clients and database servers. On port 5000, it connects to the primary node of the Citus coordinator, which acts as a central point of control. And on port 5001, HAProxy performs load balancing between the master worker nodes, distributing requests from clients among several workers to ensure more efficient use of resources and improve performance:

In a few seconds, our Citus cluster will be ready to go. We can check this using the tool patronictl from a container demo-haproxy:

$ docker exec -ti demo-haproxy bash
postgres@haproxy:~$ patronictl list
+ Citus cluster: demo ---------+--------------+---------+----+-----------+
| Group | Member  | Host       | Role         | State   | TL | Lag in MB |
+-------+---------+------------+--------------+---------+----+-----------+
|     0 | coord1  | 172.19.0.8 | Sync Standby | running |  1 |         0 |
|     0 | coord2  | 172.19.0.7 | Leader       | running |  1 |           |
|     0 | coord3  | 172.19.0.6 | Replica      | running |  1 |         0 |
|     1 | work1-1 | 172.19.0.5 | Sync Standby | running |  1 |         0 |
|     1 | work1-2 | 172.19.0.2 | Leader       | running |  1 |           |
|     2 | work2-1 | 172.19.0.9 | Sync Standby | running |  1 |         0 |
|     2 | work2-2 | 172.19.0.4 | Leader       | running |  1 |           |
+-------+---------+------------+--------------+---------+----+-----------+

Now let’s connect to the coordinator’s primary node via HAProxy and make sure the Citus extension is created and the workers are registered in the coordinator metadata:

postgres@haproxy:~$ psql -h localhost -p 5000 -U postgres -d citus
Password for user postgres: postgres
psql (15.1 (Debian 15.1-1.pgdg110+1))
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off)
Type "help" for help.

citus=# \dx
                    List of installed extensions
     Name      | Version |   Schema   |         Description
---------------+---------+------------+------------------------------
citus          | 11.2-1  | pg_catalog | Citus distributed database
citus_columnar | 11.2-1  | pg_catalog | Citus Columnar extension
plpgsql        | 1.0     | pg_catalog | PL/pgSQL procedural language
(3 rows)

citus=# select nodeid, groupid, nodename, nodeport, noderole
from pg_dist_node order by groupid;
nodeid | groupid |  nodename  | nodeport | noderole
-------+---------+------------+----------+----------
     1 |       0 | 172.19.0.7 |     5432 | primary
     3 |       1 | 172.19.0.2 |     5432 | primary
     2 |       2 | 172.19.0.4 |     5432 | primary
(3 rows)

So far so good :).

In this particular case, Patroni is configured to use client certificates in addition to passwords for super user connections between nodes. Since Citus actively uses superuser connections to communicate between nodes, Patroni also took care of configuring authentication parameters through pg_dist_authinfo:

citus=# select * from pg_dist_authinfo;
nodeid | rolename |                                                   authinfo
-------+----------+--------------------------------------------------------------------------------------------------------------
     0 | postgres | password=postgres sslcert=/etc/ssl/certs/ssl-cert-snakeoil.pem sslkey=/etc/ssl/private/ssl-cert-snakeoil.key
(1 row)

Don’t be intimidated by the password you see in the authinfo field. Why? Because, firstly, access to pg_dist_authinfo has only the superuser. Second, you can customize authenticationusing only client certificates, which, in fact, is recommended.

Our first controlled HA switchover using Patroni and Citus

In Postgres high availability terminology and Patroni terminology, a “switchover” is a deliberate change of priority. That is, a controlled process of changing roles (change of priority) between nodes to ensure the continued operation of the system in the event of a failure or scheduled maintenance.

Before switching using Patroni, let’s first create a distributed Citus table and start writing data to it using the command \watch psql:

citus=# create table my_distributed_table(id bigint not null generated always as identity, value double precision);
CREATE TABLE
citus=# select create_distributed_table('my_distributed_table', 'id');
 create_distributed_table
--------------------------

(1 row)

citus=# with inserted as (
    insert into my_distributed_table(value)
     values(random()) RETURNING id
) SELECT now(), id from inserted\watch 0.01

Request \watch 0.01 will be executed every 10ms, while it will return the inserted id plus the current time with microsecond precession. This will allow you to see how the switchover will affect the execution of queries.

Meanwhile, in another terminal, we initiate a switchover on one of the workers:

$ docker exec -ti demo-haproxy bash

postgres@haproxy:~$ patronictl switchover
Current cluster topology
+ Citus cluster: demo ---------+--------------+---------+----+-----------+
| Group | Member  | Host       | Role         | State   | TL | Lag in MB |
+-------+---------+------------+--------------+---------+----+-----------+
|     0 | coord1  | 172.19.0.8 | Sync Standby | running |  1 |         0 |
|     0 | coord2  | 172.19.0.7 | Leader       | running |  1 |           |
|     0 | coord3  | 172.19.0.6 | Replica      | running |  1 |         0 |
|     1 | work1-1 | 172.19.0.5 | Sync Standby | running |  1 |           |
|     1 | work1-2 | 172.19.0.2 | Leader       | running |  1 |         0 |
|     2 | work2-1 | 172.19.0.9 | Sync Standby | running |  1 |         0 |
|     2 | work2-2 | 172.19.0.4 | Leader       | running |  1 |           |
+-------+---------+------------+--------------+---------+----+-----------+
Citus group: 2
Primary [work2-2]:
Candidate ['work2-1'] []:
When should the switchover take place (e.g. 2023-02-06T14:27 )  1692195503:
Are you sure you want to switchover cluster demo, demoting current leader work2-2? [y/N]: y
2023-02-06 13:27:56.00644 Successfully switched over to "work2-1"
+ Citus cluster: demo (group: 2, 7197024670041272347) ------+
| Member  | Host       | Role    | State   | TL | Lag in MB |
+---------+------------+---------+---------+----+-----------+
| work2-1 | 172.19.0.9 | Leader  | running |  1 |           |
| work2-2 | 172.19.0.4 | Replica | stopped |    |   unknown |
+---------+------------+---------+---------+----+-----------+

Finally, after the switchover is complete, let’s check the logs in the first terminal:

Mon Feb  6 13:27:54 2023 (every 0.01s)

             now              |  id
------------------------------+------
2023-02-06 13:27:54.441635+00 | 1172
(1 row)

Mon Feb  6 13:27:54 2023 (every 0.01s)

            now              |  id
-----------------------------+------
2023-02-06 13:27:54.45187+00 | 1173
(1 row)

Mon Feb  6 13:27:57 2023 (every 0.01s)

             now              |  id
------------------------------+------
2023-02-06 13:27:57.345054+00 | 1174
(1 row)

Mon Feb  6 13:27:57 2023 (every 0.01s)

             now              |  id
------------------------------+------
2023-02-06 13:27:57.351412+00 | 1175
(1 row)

As you can see, before the switchover happened, queries were regularly executed every 10 milliseconds. Between identifiers 1173 And 1174 you may notice a short delay jump of 2893 milliseconds (less than 3 seconds). This is how the controlled switching (switchover) appeared, which did not cause a single client error!

After the switchover is completed, we can check again pg_dist_node:

citus=# select nodeid, groupid, nodename, nodeport, noderole
from pg_dist_node order by groupid;
nodeid | groupid |  nodename  | nodeport | noderole
-------+---------+------------+----------+----------
     1 |       0 | 172.19.0.7 |     5432 | primary
     3 |       1 | 172.19.0.2 |     5432 | primary
     2 |       2 | 172.19.0.9 |     5432 | primary
(3 rows)

As you can see nodename for primary in group 2 was automatically changed by Patroni from 172.19.0.4 on 172.19.0.9.

Future plans and possible improvements

This article would not be complete if we did not talk about what further work on the integration of Patroni and Citus is possible. And there are really many options:

  1. Read scaling: We can register standby workers in pg_dist_nodeso that they can be used to scale read-only queries.

  2. Connection pool: When communicating between nodes, Citus has the ability to use a connection pooling mechanism. For this metadata table pg_dist_poolinfo should be automatically populated and kept up to date in case of a failover/switchover.

  3. Multiple databasesA: Patroni currently only supports single database clusters that include Citus. The Citus extension allows you to turn a standard PostgreSQL database into a distributed system that can work with data on multiple nodes and run queries on them in parallel. In such a scenario, each distributed Citus database would be considered a separate cluster, and Patroni only supports one PostgreSQL cluster with one Citus-extended database. However, there are users who have several of them.

Patroni and Citus together provide users of PostgreSQL distributed systems with a good solution for high availability

Patroni paves the way for automated, fully declarative deployment of Citus open source Postgres distributed database clusters with high availability (HA) – on any platform imaginable. In our examples, we used docker and docker-compose, but a real production deployment does not require the use of containers.

Although Patroni 3.0 has supported Citus since version 10.0, we recommend using the latest versions Citus And PostgreSQL 15to take full advantage of transparent switchovers and/or worker restarts. On Citus 11.2 update pagealso known as the release notes page, you can see the following:

Major improvement [для обеспечения высокой доступности в Citus 11.2] is that we now transparently reconnect when we detect that the cached connection to the worker was dropped while we weren’t using it.

There is some great documentation to get started with Citus and Patroni:

For Kubernetes lovers, we also have good news: check out the “Citus on Kubernetes” in the Patroni repository. However, keep in mind that this is just an example and is not intended for production use. To apply it in real life, we recommend waiting for Postgres operators from Crunchy, Zalando, or OnGres to start supporting Citus.


Tonight there will be an open lesson “DB + external sources or how Postgres Foreign Data Wrappers work”, to which everyone is invited. You can sign up link.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *