A little about Durability in Postgres. Part 1
As you know, many relational databases, and in this particular case PostgreSQL, promise us that our transactions will meet ACID criteria (Atomicity, Consistency, Isolation, Persistence), with the proper level of configuration of certain settings.
Despite the title of the text, with apologies to the reader in advance, we will postpone the conversation about Durability-Storability, which ensures the safety of data in the event of a physical hardware failure, primarily a power outage, disk damage, network failure and any other disasters. We will return to it towards the end of the first part and talk in more detail in the second.
Should we not talk about it right now, because first we would like to talk about how the request is executed in principle? We will talk about one Postgres machine, without considering the issues of replication and replica consistency.
Let's imagine the situation that we have a database that has a simple table:
CREATE TABLE movies (
movie_id SERIAL PRIMARY KEY,
title text,
);
A request is received from the client
UPDATE movies SET title="The Lord of The Rings" WHERE movie_id = 71;
Let's figure out what happens to the request next. Let me clarify right away that the request pipeline is replete with many different mechanisms and nuances. The key ones that could not be ignored were highlighted.
Parser
The request inside Posgres is passed to the parser. A parser is a mechanism that converts the string representation of an SQL query into a tree data structure like an AST (Abstract Syntax Tree). The parser breaks the SQL query into its component parts, tokenizes it, highlighting the key components, and builds a query tree.
This is what the tree that the parser will build looks like:
Rewrite
The constructed query tree from the parser is passed to Rewrite. This mechanism can modify the tree if there are any Rules for a given operation in a given table.
For example, there might be a rule like this (albeit stupid):
CREATE RULE update_movies AS ON UPDATE TO movies DO INSTEAD
UPDATE movies SET title="Matrix"
WHERE movie_id d= 71;
As a result, at the next stages, a query that has already been updated taking into account the rule will be executed – and for this it is necessary to rebuild the tree. We will not rebuild anything and will pretend that in our case such a rule does not exist. We talked about it only to describe the mechanism as an integral part of the pipeline.
Scheduler
Next, the final query tree goes to the scheduler. The scheduler is responsible for optimizing the query. We can see the strategy that the planner will choose in each specific case using EXPLAIN:
Executor
When the plan is formed, the request is sent for execution. The process of executing a request proceeds strictly according to the plan generated by the scheduler.
In our particular case, the scheduler reported that there is an index by Primary Key (movie_id) and we need to load parts of the index (called pages, 8KB by default in Postgres) into RAM from disk, which also contains the data we are interested in.
The index page looks something like this:
+-------------------------+ 0
| Page Header | <-- Метадата, флаги, размер страницы и прочее.
+-------------------------+ 24 байта
| Special Area | <-- Метадата (поинтеры для B-дерева и прочее).
+-------------------------+ Размер Special Area варьируется
| Index Entry 1 | <-- Ключ (индексное значение) и поинтер на строку(TID).
+-------------------------+
| Index Entry 2 |
+-------------------------+
| ... |
+-------------------------+
| Index Entry N |
+-------------------------+
| Free Space | <-- Не использованное еще пространство для новых записей.
+-------------------------+ 8192 байт (8KB) - конец страницы
Once an entry with movie_id = 71 is found in the index, it also contains a link (or pointer) that stores information about a specific data page (not to be confused with the index page presented above) containing the entire row data – in our case this is (movie_id, title), and also the offset within this page, because the page contains more than one line (much more data fits in 8KB).
Data page visualization:
+-------------------------+ 0
| Page Header | <- Информация о странице, флаги, контрольная сумма.
+-------------------------+ 24
| Item Pointer Array | <- Указатели на каждую строку в Row Data Area.
+-------------------------+ Переменная длина
| Free Space | <- Свободное место для новых строк.
+-------------------------+ Varies
| Row Data 3 | <- Полные данные строки (movie_id=91, title="...")
+-------------------------+
| Row Data 2 | <- Полные данные строки (movie_id=71, title="...")
+-------------------------+
| Row Data 1 | <- Полные данные строки (movie_id=23, title="...")
+-------------------------+ 8192 байт (8КБ)
This data (not the specific row found, but the page referenced by the index, even before searching for the row at the offset that the index told us) is unloaded into RAM, into its pre-reserved segment, called the shared buffer pool in Postgres. This is a large part of memory, which primarily plays the role of a cache, storing pages of indexes and table data for frequently requested queries or intersections of different queries, in order to limit itself only to I/O of the file system and not directly access the disk device (SSD, sometimes HDD).
A small deviation from our request.
The main criterion for saving is time – disk operations are very expensive (slow compared to RAM, for an SSD this is a difference of tens or hundreds of times), but in the case of a highly loaded system we can also talk about saving direct access to the disk, because each device has has its own limiting resource.
For example, server SSD drives have IOPS (input/output operations per second) often in the region of tens of thousands of read operations per second and several times fewer write operations per second. In addition to the physical limitation on the number of operations per second, SSD drives are subject to wear and tear and their service life is limited. We will not delve into these issues (those interested can start analyzing the issue with the term “write amplification”) and will limit ourselves to this information, taking it on faith.
When the necessary data is uploaded to the shared buffer, the Request Executor (in our case, the UPDATE request) makes the necessary changes. The output is binary data that reflects an already modified version of the data set that affected our UPDATE request. Now we need to replace the old data on the disk with new ones.
And at this step we come to that same Durability. WAL comes into play.
The Postgres engine writes changed data primarily to WAL (Write Ahead Log), which is used for persistence, crash protection, and data recovery. For now, it is enough to represent it as a binary (we cannot read it as text) log of all changes.
The long-awaited COMMIT
The data is written to the WAL segment (one of many files of 16MB by default) on the disk. A predetermined number of WAL records are stored in a special buffer in memory, but more on that later. After writing to WAL, the same data is flushed to disk into the immediate data files, replacing the old version. Data files are a physical representation of DBMS tabular data.
After the reset, a commit occurs – COMMIT, since the data is stored in WAL for protection against failures and in data files. Once committed, the transaction is completed successfully and the modified version of the data is available for the client to read, just like other operations.
We will talk about whether data is always written to WAL first, how to configure this mechanism differently and optimize it, what pitfalls exist and how to find a balance between reliability and speed, as well as how WAL itself works in the second part. which will be published tomorrow.