Some Switchover Moments in Patroni. Practical Stories from SRE Everyday Life. Part 8

Hello! This is Evgeny Bezruchkin, DevOps engineer at Flant. And we continue sharing stories from our practice.

In one project, we have a PostgreSQL cluster on a virtual machine running Patroni. During the work, we needed to reduce the cluster master by CPU and memory. But at the stage when we changed the master with the replica roles (switchover), a problem arose: only the master remained alive, and all the replicas stopped. But we found a way out of the situation, finding some interesting moments along the way. In this article, we will tell you how we managed to make the switchover go smoothly using synchronous replication.

Background

The cluster consisted of a large 32c64g master and two replicas. The replicas are asynchronous, since the cluster is geo-distributed and there is no time to wait for data to appear on a replica in another city.

At some point, such a large master became unnecessary. And we decided to blow it away for the sake of economy.

The plan was this:

Wait until the cluster load is low.
Make a switchover – change the master to replica #1.
Reconfigure VM with former master.
Make switchover back.

We do a switchover

Well, I've waited for the cluster load to be low, and I'm starting switchover. Since our replicas are asynchronous, there's a chance of losing some data (that hasn't arrived from the master). To avoid this, we usually do the following:

CHECKPOINT; — flushes modified buffers to disk and creates a checkpoint indicating a consistent state of the cluster.
SELECT pg_switch_wal(); — closes the current WAL log.

One guess: CHECKPOINT on the replicas would force them to commit all incoming changes to disk, and they would be more ready for the switch. We may have been wrong.

With the next request I check how everything is on the replicas, how much they lag.
The query shows the difference in the states of the master and replicas at different stages of the WAL life:

SELECT 
	pg_wal_lsn_diff( pg_current_wal_insert_lsn(),pg_current_wal_flush_lsn()) non_flushed,
	s.application_name,
	pg_wal_lsn_diff( pg_current_wal_insert_lsn(),s.sent_lsn) as sent_lag,
	pg_wal_lsn_diff( pg_current_wal_insert_lsn(),s.write_lsn) as write_lag,
	pg_wal_lsn_diff( pg_current_wal_insert_lsn(),s.flush_lsn) as flush_lag,
	pg_wal_lsn_diff( pg_current_wal_insert_lsn(),s.replay_lsn) as replay_lag 
FROM pg_stat_replication s;

 non_flushed |   application_name   | sent_lag | write_lag | flush_lag | replay_lag
-------------+----------------------+----------+-----------+-----------+------------
           0 | postgres-1           |        0 |         0 |         0 |          0
           0 | postgres-2           |        0 |         0 |         0 |          0

It looks like the master and the replicas are identical.

I do the patroni switchover and get a surprise:

+ Cluster: main ----+--------------+---------+---------+----+-----------+------------------+
| Member            | Host         | Role    | State   | TL | Lag in MB | Tags             |
+-------------------+--------------+---------+---------+----+-----------+------------------+
| postgres-0        | 10.1.4.20    | Replica | running | 8  |   unknown | clonefrom: true  |
+-------------------+--------------+---------+---------+----+-----------+------------------+
| postgres-1        | 10.2.4.3     | Leader  | running | 9  |           | clonefrom: true  |
+-------------------+--------------+---------+---------+----+-----------+------------------+
| postgres-2        | 10.3.4.7     | Replica | running | 8  |   unknown | clonefrom: true  |
+-------------------+--------------+---------+---------+----+-----------+------------------+

Congratulations, you have no lines! What happened?

Patroni stopped the current master and transferred its role to the replica postgres-1. Between these steps, some changes arrived to the stopping master, which were applied to the replica postgres-2, but did not make it to postgres-1. The role of master went to the most backward replica.

Here's the rule: the master is always ahead, and if the replica has more transactions than the master, it won't get into the cluster. Either roll back to the master's state via pg_rewind, or transfer everything from the master.

We have already caught such situations, and experience in this cluster showed that pg_rewind is slower, and it is easier/faster to simply transfer replicas. But it can be different: for example, if the database is huge, children will have time to finish school before the data is transferred to the replica, it is better to roll back the state.

Before synchronizing the databases, you need to check what was not transferred during the switch. And if there was something important, save it and return it to its place.

I look at the magazines and see:

postgres-0$ /usr/lib/postgresql/15/bin/pg_waldump /var/lib/postgresql/15/main/pg_wal/0000000800002265000000CE 
postgres-2$ /usr/lib/postgresql/15/bin/pg_waldump /var/lib/postgresql/15/main/pg_wal/0000000800002265000000CE 

CHECKPOINT_SHUTDOWN

And in postgres-1 the information about the master shutdown was not recorded.

Now it's clear where to put the comma in “There's nothing to save and pour over.” I'm launching patronictl reinit postgres-2 without looking. Half an hour later the replica reports the end of the process. I look:

+ Cluster: main ----+--------------+---------+---------+----+-----------+------------------+
| Member            | Host         | Role    | State   | TL | Lag in MB | Tags             |
+-------------------+--------------+---------+---------+----+-----------+------------------+
| postgres-0        | 10.1.4.20    | Replica | running | 8  |   unknown | clonefrom: true  |
+-------------------+--------------+---------+---------+----+-----------+------------------+
| postgres-1        | 10.2.4.3     | Leader  | running | 9  |           | clonefrom: true  |
+-------------------+--------------+---------+---------+----+-----------+------------------+
| postgres-2        | 10.3.4.7     | Replica | running | 8  |   unknown | clonefrom: true  |
+-------------------+--------------+---------+---------+----+-----------+------------------+

Half an hour has passed, and the cluster is in a semi-working state. I look at the Patroni logs: the replica telegraphs: “Synchronization was with postgres-0”, that is, with the former master.

I run reinit again, the same thing in the log – the replica wants data from the previous timeline. There's a reason for this!

In any unclear situation, read sources. And there it is written in black and white that cloning of the database server is performed from the master only if there are no other replicas with the **clonefrom** tag. The idea is sound – so as not to overload the master with the backup process. But at the same time, Patroni does not look at how relevant the replica is.

Well, since we still need to reconfigure the former master, I delete postgres-2 and run:

patronictl reinit main postgres-0

As expected, it takes data from postgres-1. And, in order not to wait long, I immediately launch the second replica:

patronictl reinit main postgres-2

More surprises – postgres-0, having already downloaded a dozen gigabytes, starts pouring again. I stop it, start it again patronictl reinit main postgres-0the second one breaks. The logs say that Patroni always creates a snapshot with the same label for the dump. This confuses the cards for the processes, and they restart.

Okay, let's be consistent.

I start the overflow again. But the working day is already in full swing, and there are still no replicas, all requests are sent to the master. It starts to slow down. Consequently, the connections to the server are clogged with requests that do not have time to be executed in time. And pg_basebackupto which Patroni pours a replica, is connected on the same rights, so he waits his turn.

The apps connect via haproxy and I reduce the number of possible connections there. This helps — pg_basebackup starts and replication is running, but applications are complaining. I increase the number of connections on haproxy back.

There is time to drink coffee, but no, the client brings a problem: everything works very, very slowly. Roughly speaking, the only working server should process x3 requests, but the speed is not three times lower, but an order of magnitude or even worse.

Time for research.

I look at the master DB log and see a large number of messages about long queries COMMIT. Who's slowing down: queries due to load or transactions completing slowly?

I checked: queries are executed quickly, but transactions are committed slowly. Is replication from pg_basebackup does it affect work?

Here I see a message that the transaction was cancelled due to a long commit on the replica. At the same time, let me remind you, we have asynchronous replication. I check the master settings:

synchronous_commit=on
synchronous_standby_names="*"

Great! It turns out that some time ago, synchronous replication was enabled in the settings on the new master. Where the settings came from remains a mystery. This event prompted us to make metrics and alerts in monitoring that the PostgreSQL configuration has changed.

I simply clear the list of synchronous replicas and the speed of applications increases rapidly.

Everything is good, everything is great, but it doesn’t work yet.

The master switchover is coming up, and I really don't want to stumble a second time. Is it possible that synchronous replication, which just ruined the life of applications, will help us survive the switchover?

The plan was born something like this:

Waiting for the postgres-0 overflow.
I'm transferring postgres-2 so that there is a working replica for applications.
I exhibit it on the master synchronous_standby_names="postgres-0".
Performing switchover.

I assume that in this case the master will close the transaction and perform a checkpoint when the switch candidate confirms this on its side.

The final

The reconfigured replica postgres-0 has come up and caught up with the master.

I enable synchronous replication:

patronictl edit-config
synchronous_standby_names="postgres-0"

I check its status in the database. Nothing.

There is a command usage error in the Patroni log. PostgreSQL does not like hyphens (-). You cannot just name columns, tables, databases and servers with a hyphen. For this to work, their names must be wrapped in quotes ("postgres-0"). But Patroni doesn’t know anything about this.

I set all replicas to be synchronous: synchronous_standby_names=’*’. I launch switchover. Everything goes smoothly, as planned.

Results

If your Patroni cluster does not have fast enough connectivity between nodes, you may face the same problem with switchover/failover. It is difficult to help with failover, but switchover can be secured a little: