Storing data in Postgresql

Disclaimer

The speed of writing this text was influenced by such insurmountable circumstances as: summer, barbecue mood, sun and laziness abundance of work. Perhaps this information has already been published in various variations over the past few months, but I honestly haven't seen it.

Main part

In this text I would like to take a closer look at data storage in PostgreSQL at the physical level.

First, let's define some common things. Data are stored in tablesthe tables are in schemesthe schemes, in turn, in databases. By data here I mean one or more lines. As an example, we will consider the standard of criticism, in my personal opinion, the quotes of Linus Torvalds

Commonly known things

Commonly known things

This is what we have lost. Real criticism.

This is what we have lost. Real criticism.

Quotes from a discussion of O_DIRECT between a bunch of smart guys (google what it is yourself).
(https://yarchive.net/comp/linux/o_direct.html)

I have loaded all the data into the database and now I plan to find the physical location of this masterpiece and at the same time understand what and how is located in this PostgreSQL of yours.

Hidden text
CREATE TABLE public.test_table (
	id int4 NULL,
	author varchar NULL,
	phrase varchar NULL
);

INSERT INTO test_table
VALUES 
(1
, 'Linus Torvalds'
, 'The thing that has always disturbed
me about O_DIRECT is that the whote interface is just stupid,
and was probabty designed by а deranged monkey on some serious mind—controlling substances'),
(2
, 'Linus Torvalds'
, 'Sadly, database реор1е don''t seem to have аnу understanding of good taste,
and various 0S реорlе end uр usually just saying “Yes, Мr Oracle“,
I''ll open up any orifice I have for уоur pleasure'); -- наши записи

INSERT INTO test_table 
SELECT ROW_NUMBER () OVER()+2
, gen_random_uuid()
FROM pg_catalog.generate_series(0,1500) --немного синтетики

Let's start with a general overview of the database and gradually move down to the level of individual rows.
First you need to find the data files used by the database cluster, usually they are stored in the cluster data directory, which is classically called PGDATA /var/lib/pgsql/data на Linux,
\Program Files\PostgreSQL\[version]\data на Windows)
.

This directory contains a bunch of subdirectories and several cluster-specific configuration files (https://postgrespro.ru/docs/postgresql/15/storage-file-layout). We will not analyze all the subdirectories, because there are quite a lot of them, but we will focus on those that will help find the data string.

Lots of subdirectories and several configuration files

Lots of subdirectories and several configuration files

pg_tblspc

First of all, all data is located on one of the physical media.

Tablespaces define the physical location of the data. In fact, a tablespace is This is a file system directory. For example, tablespaces can be used to place archived data on slow media and actively processed data on fast media.

https://postgrespro.ru/education/books/internals – PostgreSQL16 Internals
I recommend it to everyone

When a cluster is initialized, two spaces are created:

  • pg_defaultwhich “lies” in PGDATA/base and is used as the default space unless otherwise specified

  • pg_globalwhich “lies” in PGDATA/global and stores common system directory objects

In the subdirectory pg_tblspc there are symbolic links to directories with user table spaces. By following them, we will find directories of the type PGDATA/basewhich also contain subdirectories with database oids. If a database has objects located on different physical media, then a separate database subdirectory (with the same oid) is created for each space, and the files of the objects located in this space are added to each of these directories.

Hidden text
абличные SELECT * FROM pg_catalog.pg_tablespace;
Select

Select

File system (oid 2 additional spaces of disk E and D)

File system (oid 2 additional spaces of disk E and D)

base

Any database is a directory that is stored in PGDATA/base (the name most often corresponds to the oid of the object). Each of these directories contains files that are intended for storing and accessing data. For each DB object, there are one or more files. For example, for a regular table without indexes, there are three files (four if the table is UNLOGGED). Each of the files corresponds to one of the layers.

Hidden text
SELECT relfilenode , oid, relname
FROM pg_class 
WHERE relname="test_table";

Oid и relfilenode таблицы, которую мы создали вы 20508
I have quite a lot of objects in my DB (table files are highlighted)

I have quite a lot of objects in my DB (table files are highlighted)

The remaining files are other tables, TOAST files, indexes, sequences, materialized views.

Base layer
Files without postfixes (in our example /PostgreSQL/15.7/data/base/20508). The location where the data is actually located. Maximum file size 1 GB(can be increased during assembly by parameterwith-segsize), when it is reached, the next file of the same layer (segment) is created. The ordinal number is added to the end through a dot (if the test_table table grows more than 1 GB, a new file will be created 20508.1)

Free space map (free space map)
Postfix _fsm. This is a relation layer consisting of FSM pages that helps quickly find a page from the main layer to write a new row version. Maximum file size 1 GB(can be increased during assembly by parameterwith-segsize). To ensure fast searching of free space, the map is stored as a tree (generally in the form of many trees, but that's not what we're talking about today).

Hidden text

The purpose of the free space map is to quickly locate a page with enough
free space to hold a tuple to be stored; or to determine that no such page
exists and the relation must be extended by one page. As of PostgreSQL 8.4
each relation has its own, extensible free space map stored in a separate
“fork” of its relation.

Github link

To look inside the map there is a special module pg_freespacemap

--pg_freespace(rel regclass IN, blkno OUT bigint, avail OUT int2)

Выдаёт объём свободного пространства на каждой странице отношения
, согласно FSM. Возвращается набор кортежей (blkno bigint, avail int2)
, по одному кортежу для каждой страницы в отношении.

SELECT * FROM pg_freespace('test_table');

blkno|avail|
-----+-----+
    0|   32|
    1|   32|
    2|   32|
    3|   32|
    4|   32|
    5|   32|
    6|   32|
    7|   32|
    8|   32|
    9|   32|
   10|   32|
   11|   32|
   12|   32|
   13|   32|
   14|   32|
   15|  256|

There is a more specific and complex article written about FSM. If you are interested, you can read it

Visibility map (visibility map)

Postfix _vm. This layer allows you to determine whether you need to clean or freeze the page. Also helps with index only scan (when a transaction tries to read a row from such a page, it does not have to check its visibility, which allows using an index-only scan).Files of this layer are usually quite small. Each page is allocated 2 bits (1 bit – all row versions are up-to-date, 2 bits – all row versions are frozen). GitHub Link
Visibility map is not created for indices.

Pages

Each file is logically divided into blocks(pages)- This is the minimum amount of data that is read or written.. This is done to optimize I/O operations. The page size is specified by a variable BLKSZby default it is 8 KB, maximum 32 KB (it can be configured during assembly, but this is done never rarely).
Each relation (table, index…) is stored as an array of such pages. They fill the file until it reaches the maximum size (SEGSIZE), after which a new segment is created, and the cycle continues.

Regardless of which layer the files belong to, they are used by the server in roughly the same way. Pages are first placed in the buffer cache (where processes can read and modify them), and then flushed back to disk when needed.

Inside the page

The page is also divided into parts. Link to code with more detailed description

Headline

First comes the title, which takes up 24 bytes, and stores general information.

  • First 2 blocks: the last WAL entry associated with this page(8 bytes) and page checksum (2 bytes)

  • Next 2 bytes flags:

    • PD_HAS_FREE_LINES 0x0001. If this flag is set, it means that there may be free space on the page that can be used to store new tuples without having to split the page.

    • PD_PAGE_FULL 0x0002 .Set when the page is completely filled with data.

    • PD_ALL_VISIBLE 0x0004 .Set when all records on the page are visible to all transactions.

    • PD_VALID_FLAG_BITS 0x0007 .A mask that defines the allowed values ​​for a bit field.

  • Next come 3 blocks 2 bytes, indicating displacement:

    • pd_lower. Offset to the beginning of free space

    • pd_upper.Offset to the end of free space

    • pd_special .Offset to the beginning of the special space (to the end of the page)

  • Below is information about the page size and layout version number (2 bytes)

  • And at the end is the oldest uncleaned xmax identifier on the page, or zero if there is none (necessary for VACUUM optimization)

SELECT * 
FROM page_header(get_raw_page('test_table',0)); 

lsn        |checksum|flags|lower|upper|special|pagesize|version|prune_xid|
-----------+--------+-----+-----+-----+-------+--------+-------+---------+
40/CED9E4C8|    -905|    4|  432|  480|   8192|    8192|      4|0        |

Pointers

The title is followed by array of pointers on the line pointers version. Each pointer takes 4 bytes and contains:

  • lp_off . Offset of line relative to the beginning of the page

  • lp_flags. Set of pointer state flags

    • LP_UNUSED |0 |/* unused (should always have lp_len=0)*/

    • LP_NORMAL |1 |/* used (should always have lp_len>0) */

    • LP_REDIRECT|2 |/* HOT redirect (should have lp_len=0) */

    • LP_DEAD |3 |/* dead, may or may not have storage */

  • lp_len . Line length

Link to the source code for further immersion

SELECT lp, lp_off, lp_flags, lp_len 
FROM heap_page_items(get_raw_page('test_table',0)) sub

lp |lp_off|lp_flags|lp_len|
---+------+--------+------+
  1|  7944|       1|   241|--тут есть фраза 
  2|  7680|       1|   261|--тут есть фраза
  3|  7608|       1|    65|--тут только имя 
  .
102|   480|       1|    65|
Оффсет уменьшается так как записи добавляются с конца к началу

String versions

After the pointers to the actual data (row versions) comes free space block. Actually, there is nothing particularly interesting here, no fragmentation, a single block of empty space. The presence of free space is precisely what is noted in map of free space.

The row versions are placed at the end of the free space, and the pointers are placed at the beginning.

Hidden text
Слотированная страница
*
 * +--------------------+---------------------------------------+
 * | Заголовок страницы | Указатель1 Указатель2 Указатель3...   |
 * +--------------------+---------------------------------------+
 * | ... УказательN |		            						|
 * +------------------------------------------------------------+
 * |		   		^ Смещение до начала свободного пространства|
 * |												 		    |
 * |			 v Смещение до конца свободного пространства    |
 * +------------------------------------------------------------+
 * |			 | кортежN ...                        		    |
 * +------------------------------------------------------------+
 * |	   ... кортеж3 кортеж2 кортеж1 | "Специальная область"  |
 * +--------------------+---------------------------------------+
 *						^ pd_special			
 *

A page is considered full when nothing more can be added between the start and end of free space markers.

Now comes the most interesting part – row versions (tuples, rows, tuples).

In the case of tables, we don't just talk about rows, but about row versions (tuples), since multiversioning implies that there are multiple versions of the same row. Indexes are not multiversioned; instead, indexes reference all possible table versions of rows, and visibility rules select the appropriate ones.

The row version consists of a header and the data itself. The record header (tuple_header) is quite heavy (minimum 23 bytes), it contains service information about the record.

  • t_xmin A field that is on par with xmax plays a key role in ensuring multi-version concurrency (MVCC). xmin (minimum transaction ID) is the identifier of the transaction that created the record.

  • t_xmax (maximum transaction ID) is the identifier of the transaction that deleted or updated the record.

  • t_field.The field is divided between three virtual fields. Cmin,Cmax,Xvac

    We store five “virtual” fields Xmin, Cmin, Xmax, Cmax, and Xvac in three physical fields. Xmin and Xmax are always really stored, but Cmin, Cmax and Xvac share a field. This works because we know that Cmin and Cmax are only interesting for the lifetime of the inserting and deleting transaction respectively. If a tuple is inserted and deleted in the same transaction, we store a “combo” command id that can be mapped to the real cmin and cmax, but only by use of local state within the originating backend. See combocid.c for more details. Meanwhile, Xvac is only set by old-style VACUUM FULL, which does not have any command sub-structure and so does not need either Cmin or Cmax.

    link

  • t_ctid. Physical location of the record (block number, offset)

  • t_infomask. Various flags

    /*
     * information stored in t_infomask:
     */
    #define HEAP_HASNULL			0x0001	/* Наличие NULL */
    #define HEAP_HASVARWIDTH		0x0002	/* Наличие атрибутов переменной длины */
    #define HEAP_HASEXTERNAL		0x0004	/* Хранится ли что-то в TOAST */
    #define HEAP_HASOID_OLD			0x0008	/* Есть ли поле OID */
    #define HEAP_XMAX_KEYSHR_LOCK	0x0010	/* Наличие key_share блокировки */
    #define HEAP_COMBOCID			0x0020	/* t_cid является combo CID*/
    #define HEAP_XMAX_EXCL_LOCK		0x0040	/* Эксклюзивная блокировка */
    #define HEAP_XMAX_LOCK_ONLY		0x0080	/* Транзакция, установившая xmax, является единственным владельцем записи. */
    
     /* xmax is a shared locker */
    #define HEAP_XMAX_SHR_LOCK	(HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_KEYSHR_LOCK)
    
    #define HEAP_LOCK_MASK	(HEAP_XMAX_SHR_LOCK | HEAP_XMAX_EXCL_LOCK | \
    						 HEAP_XMAX_KEYSHR_LOCK)
    #define HEAP_XMIN_COMMITTED		0x0100	/* t_xmin committed */
    #define HEAP_XMIN_INVALID		0x0200	/* t_xmin invalid/aborted */
    #define HEAP_XMIN_FROZEN		(HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)
    #define HEAP_XMAX_COMMITTED		0x0400	/* t_xmax committed */
    #define HEAP_XMAX_INVALID		0x0800	/* t_xmax invalid/aborted */
    #define HEAP_XMAX_IS_MULTI		0x1000	/* t_xmax это MultiXactId */
    #define HEAP_UPDATED			0x2000	/* Это обновленная версия строки*/
    #define HEAP_MOVED_OFF			0x4000	/* Запись перемещена в другое место вакуумом*/
    #define HEAP_MOVED_IN			0x8000	/* Запись перемещена из другого место вакуумом */
    #define HEAP_MOVED (HEAP_MOVED_OFF | HEAP_MOVED_IN)
    
    #define HEAP_XACT_MASK			0xFFF0	/* Маска видимости */

    Another link to the source code

  • t_infomask2 Some more flags

    /*
     * Информация сохраняемая в t_infomask2:
     */
    #define HEAP_NATTS_MASK			0x07FF	/* 11 бит для количества аттрибутов */
    /* bits 0x1800 are available */
    #define HEAP_KEYS_UPDATED		0x2000	/* строка была обновлена или удалена */
    #define HEAP_HOT_UPDATED		0x4000	/* строка была HOT-updated */
    #define HEAP_ONLY_TUPLE			0x8000	/* heap-only строка */
    
    #define HEAP2_XACT_MASK			0xE000	/* биты видимости аттрибутов */
    

    Link

  • t_hoff (tuple header offset) A field that specifies the offset needed to access the first data in a tuple. In other words, it is a kind of “pointer” to the start of the data payload within the tuple.

  • t_bits Bitmap of NULL values. May be absent if there is no NULL.

SELECT
	t_xmin,
	t_xmax,
	t_field3,
	t_ctid,
	t_infomask2,
	t_infomask,
	t_hoff,
	t_bits
FROM
	heap_page_items(get_raw_page('test_table',0)) sub

t_xmin|t_xmax|t_field3|t_ctid |t_infomask2|t_infomask|t_hoff|t_bits  |
------+------+--------+-------+-----------+----------+------+--------+
110286|0     |       0|(0,1)  |          3|      2306|    24|        |
110286|0     |       0|(0,2)  |          3|      2306|    24|        |--нет NULL
113319|0     |       0|(0,3)  |          3|      2307|    24|11000000|--третий атрибут NULL 
113319|0     |       0|(0,4)  |          3|      2307|    24|11000000|
113319|0     |       0|(0,5)  |          3|      2307|    24|11000000|
113319|0     |       0|(0,6)  |          3|      2307|    24|11000000|
.
113319|0     |       0|(0,102)|          3|      2307|    24|11000000|

Next comes the actual data, which is stored in hexadecimal and looks threatening.

SELECT tt.author, tt.phrase
FROM heap_page_items(get_raw_page('test_table',0)) sub
JOIN test_table tt 
	ON tt.ctid = sub.t_ctid; -- не надо так делать, ctid может не соответствовать
And here is the data we were looking for.

And here is the data we were looking for.

Conclusion

I hope the material was interesting and you spent these 15 minutes usefully. I would like to finish with words from the book Postgres Internalswhich I think are relevant not only for Postgres:

“Think about it, experiment, check all the information yourself.”

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *