How to fit “free” PostgreSQL into a harsh enterprise environment

Many are familiar with PostgreSQL, and it has worked well for small installations. However, the trend towards Open Source has become more pronounced, even when it comes to large companies and enterprise requirements. In this article, we will show you how to embed Postgres into a corporate environment and share our experience in creating a backup system (DBS) for this database using the Commvault backup system as an example.

PostgreSQL has already proven its worth – it works great, is used by trendy digital businesses like Alibaba and TripAdvisor, and the lack of royalties makes it a tempting alternative to monsters like MS SQL or Oracle DB. But as soon as we start thinking about PostgreSQL in an Enterprise landscape, we immediately run into strict requirements: “But what about configuration fault tolerance? disaster resistance? where is the comprehensive monitoring? what about automated backups? what about using tape libraries, both direct and secondary storage? “

On the one hand, PostgreSQL does not have built-in backup facilities, like the “adult” DBMS like RMAN, Oracle DB or SAP Database Backup. On the other hand, the suppliers of corporate backup systems (Veeam, Veritas, Commvault), although they support PostgreSQL, in fact only work with a certain (usually standalone) configuration and with a set of various restrictions.

Backup systems specially designed for PostgreSQL, such as Barman, Wal-g, pg_probackup, are extremely popular in small PostgreSQL installations or where heavy backups of other elements of the IT landscape are not needed. For example, in addition to PostgreSQL, the infrastructure can have physical and virtual servers, OpenShift, Oracle, MariaDB, Cassandra, etc. All this should be backed up with a common tool. Putting a separate solution exclusively for PostgreSQL is a bad idea: the data will be copied somewhere to disk, and then they need to be removed to tape. This duplication of backup increases the backup time, and also, more critically, recovery.

In an enterprise solution, an installation is backed up with a certain number of nodes in a dedicated cluster. At the same time, for example, Commvault can work only with a two-node cluster, in which Primary and Secondary are rigidly assigned to certain nodes. And it makes sense to back up only with Primary, because backing up with Secondary has its limitations. Due to the peculiarities of the DBMS, a dump is not created on Secondary, and therefore only the possibility of file backup remains.

To mitigate the risks of downtime, creating a fault-tolerant system creates a live clustering configuration, and Primary can migrate gradually between different servers. For example, Patroni software itself launches Primary on a randomly selected cluster node. SRK has no way of tracking this out of the box, and if the configuration changes, processes break. That is, the introduction of external control prevents the SRK from working effectively, because the control server simply does not understand where and what data needs to be copied from.

Another problem is the backup implementation in Postgres. It is possible via dump, and it works on small bases. But in large databases, the dump takes a long time, requires a lot of resources and can lead to a failure of the database instance.

File backup corrects the situation, but on large databases it is slow because it works in single-threaded mode. In addition, vendors have a number of additional restrictions. Either you cannot use file and dump backups at the same time, or deduplication is not supported. There are many problems, and most often it is easier to choose an expensive but proven DBMS instead of Postgres.

Nowhere to retreat! Behind Moscow developers!

However, recently our team faced a difficult challenge: in the project of creating AIS OSAGO 2.0, where we made the IT infrastructure, the developers for the new system chose PostgreSQL.

It is much easier for large software developers to use “trendy” open-source solutions. Facebook has enough specialists to support the work of this DBMS. And in the case of PCA, all the tasks of the “second day” fell on our shoulders. We were required to provide fault tolerance, assemble a cluster and, of course, establish a backup. The logic of actions was as follows:

  • Teach the SRK to make a backup from the Primary node of the cluster. To do this, the SRK must find it, which means that it needs integration with one or another solution for managing the PostgreSQL cluster. In the case of PCA, Patroni software was used for this.
  • Decide on the type of backup based on the amount of data and recovery requirements. For example, when it is necessary to restore pages granularly, use a dump, and if the databases are large and granular restoration is not required, work at the file level.
  • Attach the block backup feature to the solution to create a multi-threaded backup.

At the same time, we initially set out to create an effective and simple system without monstrous strapping from additional components. The fewer crutches, the less the workload on staff and the lower the risk of IBS failure. We immediately ruled out the approaches that used Veeam and RMAN, because a set of two solutions already hints at the system’s unreliability.

A bit of magic for an enterprise

So, we needed to guarantee a reliable backup for 10 clusters of 3 nodes each, while the same infrastructure is mirrored in the backup data center. Data centers in the PostgreSQL plan work on the active-passive principle. The total amount of databases was 50 TB. Any corporate-level SRC can easily handle this. But the nuance is that initially Postgres does not have a hook for full and deep compatibility with backup systems. Therefore, we had to look for a solution that initially had the maximum functionality in conjunction with PostgreSQL, and refine the system.

We conducted 3 internal “hackathons” – we looked at more than fifty developments, tested them, made changes in connection with our hypotheses, and tested them again. After analyzing the available options, we chose Commvault. Out of the box, this product could work with the simplest clustered PostgreSQL installation, and its open architecture gave rise to hope (which came true) for successful refinement and integration. Also Commvault can backup PostgreSQL logs. For example, Veritas NetBackup in the PostgreSQL part is able to make only full backups.

Learn more about architecture. Commvault management servers were installed in each of the two data centers in a CommServ HA configuration. The system is mirrored, managed through a single console, and from the HA point of view it meets all enterprise requirements.

We also launched two physical media servers in each data center, to which we connected disk arrays and tape libraries dedicated specifically for backups via SAN via Fiber Channel. Stretched deduplication bases ensured resiliency of media servers, and connecting each server to each CSV – the possibility of continuous operation in case of failure of any component. The architecture of the system allows the backup to continue even if one of the data centers goes down.

Patroni defines a Primary node for each cluster. It can be any free node in the data center – but only in the main. In the backup, all nodes are Secondary.

In order for Commvault to understand which cluster node is Primary, we integrated the system (thanks to the open architecture of the solution) with Postgres. To do this, a script was created that reports the current location of the Primary node to the Commvault management server.

In general, the process looks like this:

Patroni selects Primary → Keepalived brings up the IP cluster and runs the script → Commvault agent on the selected cluster node receives a notification that this Primary → Commvault automatically reconfigures the backup within the pseudo-client

The advantage of this approach is that the solution does not affect the consistency, the correctness of the logs, or the recovery of the Postgres instance. It is also easily scalable, because now it is not necessary to fix the Primary and Secondary nodes for Commvault. It is enough that the system understands where Primary is, and the number of nodes can be increased to almost any value.

The solution does not pretend to be ideal and has its own nuances. Commvault can only back up an entire instance, not individual databases. Therefore, a separate instance has been created for each database. Real clients are combined into virtual pseudo-clients. Each Commvault pseudo-client is a UNIX cluster. It adds those cluster nodes on which the Commvault agent for Postgres is installed. As a result, all virtual nodes of the pseudo-client are backed up as one instance.

Within each pseudo-client, the active node of the cluster is indicated. This is what our integration solution for Commvault defines. The principle of its operation is quite simple: if a cluster IP rises on a node, the script sets the “active node” parameter in the Commvault agent binary – in fact, the script sets “1” in the required part of memory. The agent sends this data to CommServe, and Commvault makes a backup from the desired node. In addition, the correctness of the configuration is checked at the script level, helping to avoid errors when starting the backup.

At the same time, large databases are backed up in block in several threads, meeting the requirements of RPO and backup windows. The load on the system is insignificant: Full-copies do not occur so often, on other days only logs are collected, moreover, during periods of low load.

By the way, we applied separate policies for backing up PostgreSQL archived logs – they are stored according to different rules, copied according to a different schedule, and deduplication is not enabled for them, since these logs contain unique data.

To ensure the consistency of the entire IT infrastructure, separate Commvault file clients are installed on each of the cluster nodes. They exclude Postgres files from backups and are intended only for OS and application backups. This part of the data also has its own policy, and its own storage period.

Now the SRK does not affect the productive services, but if the situation changes, it will be possible to enable the load limiting system in Commvault.

Is it good? Good!

So, we got not only a workable, but also a fully automated backup for a clustered PostgreSQL installation, which meets all the requirements of enterprise calls.

The RPO and RTO parameters at 1 hour and 2 hours are overlapped with a margin, which means that the system will match them even with a significant increase in the volume of stored data. Despite many doubts, PostgreSQL and the enterprise environment are quite compatible. And now we know from our own experience that a backup for such DBMS is possible in a wide variety of configurations.

Of course, along the way, we had to wear out seven pairs of iron boots, overcome a number of difficulties, step on a few rakes and correct a number of mistakes. But now the approach has already been tested and can be used to implement Open Source instead of proprietary DBMS in the harsh enterprise environment.

Have you tried PostgreSQL in a corporate environment?

Authors:

Oleg Lavrenov, Design Engineer of Data Storage Systems, Jet Infosystems

Dmitry Erykin, engineer-designer of computing systems “Jet Infosystems”

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *