Part 1 or Commitfest 2024-07
This article opens a series of news about the future, 18th, version of PostgreSQL. Let's consider the following opportunities included in the July commitfest.
Scheduler: right half-join hashing support
Planner: materializing an internal rowset for nested loop joins in a parallel plan
Scheduler helper functions for generate_series
EXPLAIN (analyze): statistics of Parallel Bitmap Heap Scan node worker processes
Min and max functions for composite types
Parameter names for regexp* functions
Debug mode in pgbench
pg_get_backend_memory_contexts: path column instead of parent and new type column
pg_get_acl function
pg_upgrade: optimization of pg_dump operation
Predefined role pg_signal_autovacuum_worker
Scheduler: right half-join hashing support
commit: aa86129e1
The following query plan in version 17 uses a hash join and scans both tables sequentially:
17=# EXPLAIN (costs off, analyze)
SELECT * FROM flights
WHERE flight_id IN (SELECT flight_id FROM ticket_flights);
QUERY PLAN
------------------------------------------------------------------------------------------------
Hash Join (actual time=2133.122..2195.619 rows=150588 loops=1)
Hash Cond: (flights.flight_id = ticket_flights.flight_id)
-> Seq Scan on flights (actual time=0.018..10.301 rows=214867 loops=1)
-> Hash (actual time=2132.969..2132.970 rows=150588 loops=1)
Buckets: 262144 (originally 131072) Batches: 1 (originally 1) Memory Usage: 7343kB
-> HashAggregate (actual time=1821.476..2114.218 rows=150588 loops=1)
Group Key: ticket_flights.flight_id
Batches: 5 Memory Usage: 10289kB Disk Usage: 69384kB
-> Seq Scan on ticket_flights (actual time=7.200..655.356 rows=8391852 loops=1)
Planning Time: 0.325 ms
Execution Time: 2258.237 ms
(11 rows)
But the hash is built from a larger table, tickets_flights, which obviously requires more resources than building a hash from a small table, flights.
In version 18, replacing the hashing table will become possible, and in the query plan on the 4th line we will see Hash Right Semi Join, in this case also in parallel mode. As a result, execution time is noticeably reduced:
QUERY PLAN
---------------------------------------------------------------------------------------------------
Gather (actual time=56.771..943.233 rows=150590 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Hash Right Semi Join (actual time=41.754..909.894 rows=50197 loops=3)
Hash Cond: (ticket_flights.flight_id = flights.flight_id)
-> Parallel Seq Scan on ticket_flights (actual time=0.047..221.511 rows=2797284 loops=3)
-> Parallel Hash (actual time=40.309..40.309 rows=71622 loops=3)
Buckets: 262144 Batches: 1 Memory Usage: 23808kB
-> Parallel Seq Scan on flights (actual time=0.008..6.631 rows=71622 loops=3)
Planning Time: 0.555 ms
Execution Time: 949.831 ms
(11 rows)
Planner: materializing an internal rowset for nested loop joins in a parallel plan
commit: 22d946b0f
Prior to version 18, the scheduler did not consider the possibility of materializing an internal set of rows for joining by nested loops in parallel.
17=# EXPLAIN (costs off)
SELECT *
FROM ticket_flights tf
JOIN flights f USING (flight_id)
WHERE f.flight_id = 12345;
QUERY PLAN
----------------------------------------------------
Nested Loop
-> Index Scan using flights_pkey on flights f
Index Cond: (flight_id = 12345)
-> Gather
Workers Planned: 2
-> Parallel Seq Scan on ticket_flights tf
Filter: (flight_id = 12345)
(7 rows)
The plan for the same query in version 18 uses materialization of flights:
QUERY PLAN
--------------------------------------------------------------
Gather
Workers Planned: 2
-> Nested Loop
-> Parallel Seq Scan on ticket_flights tf
Filter: (flight_id = 12345)
-> Materialize
-> Index Scan using flights_pkey on flights f
Index Cond: (flight_id = 12345)
(8 rows)
It should be noted that query performance with the new plan has not changed much. In both cases, essentially the same actions are performed. But the planner now has more opportunities to choose the optimal plan in other situations, which is good news.
Scheduler helper functions for generate_series
commit: 036bdcec9
A new helper function tells the scheduler how many rows generate_series will return for a range of dates and times:
EXPLAIN
SELECT *
FROM generate_series(current_date, current_date + '1 day'::interval, '1 hour');
QUERY PLAN
---------------------------------------------------------------------
Function Scan on generate_series (cost=0.01..0.26 rows=25 width=8)
(1 row)
In previous versions, the number of rows was always estimated at 1000 (the default value of the ROWS parameter for functions).
Note that a similar helper function for integers has been around for a long time:
EXPLAIN
SELECT *
FROM generate_series(1, 42, 2);
QUERY PLAN
---------------------------------------------------------------------
Function Scan on generate_series (cost=0.00..0.21 rows=21 width=4)
(1 row)
EXPLAIN (analyze): statistics of Parallel Bitmap Heap Scan node worker processes
commit: 5a1e6df3b
The EXPLAIN command now shows statistics (number of accurate and inaccurate fragments) for each worker process participating in a parallel bitmap scan. In the following example, these are the lines starting with Worker 0 and Worker 1.
EXPLAIN (analyze, costs off, timing off, summary off)
SELECT count(*) FROM bookings
WHERE total_amount < 20000 AND book_date > '2017-07-15';
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Finalize Aggregate (actual rows=1 loops=1)
-> Gather (actual rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (actual rows=1 loops=3)
-> Parallel Bitmap Heap Scan on bookings (actual rows=6415 loops=3)
Recheck Cond: (total_amount < '20000'::numeric)
Filter: (book_date > '2017-07-15 00:00:00+03'::timestamp with time zone)
Rows Removed by Filter: 67072
Heap Blocks: exact=6345
Worker 0: Heap Blocks: exact=3632
Worker 1: Heap Blocks: exact=3470
-> Bitmap Index Scan on bookings_total_amount_idx (actual rows=220460 loops=1)
Index Cond: (total_amount < '20000'::numeric)
Min and max functions for composite types
commit: a0f1fce80
Values of composite types can be compared with each other for a long time. The comparison is performed element by element. For example, let's sort the rows in the following example:
CREATE TABLE seats(
line text,
number integer
);
INSERT INTO seats VALUES
('A', 42), ('B', 1), ('C', 27);
SELECT * FROM seats s ORDER BY s DESC;
line | number
------+--------
C | 27
B | 1
A | 42
(3 rows)
However, there were no min and max aggregate functions for composite types. They were added:
SELECT min(s.*), max(s.*) FROM seats s;
min | max
--------+--------
(A,42) | (C,27)
(1 row)
Parameter names for regexp* functions
commit: 580f8727
Functions for working with regular expressions now have parameter names for greater clarity:
SELECT regexp_replace(
pattern => '$',
replacement => 'Postgres!',
string => 'All You Need Is '
);
regexp_replace
---------------------------
All You Need Is Postgres!
Parameter names can be found in documentation or in the output of the \df command in psql. Previously, only the positional method of passing parameters was available.
Debug mode in pgbench
commit: 3ff01b2b
Most server utilities use the -d option to specify the database to connect to. However, in pgbench this option enabled debug mode. For unification, the pgbench parameters have been changed: now you can set the database in -d or –dbname, and debugging mode is enabled only with the full name –debug.
The change has been available since version 17, although it was formally completed in the July commitfest of version 18.
pg_get_backend_memory_contexts: path column instead of parent and new type column
commit: 32d3ed81, f0d11275, 12227a1d
Server developers and those who like to examine memory usage will appreciate the changes to the presentation pg_backend_memory_contexts.
It has become more convenient and reliable to connect parent and child contexts with each other. For this purpose, a path column has been added, which is an array of context identifiers. The first element of the array is always the top-level context TopMemoryContext, and the last element is the context identifier of the current row. Also, for the convenience of writing queries, the numbering in the level column has been corrected; now it starts from 1, not 0.
An example query showing the TopPortalContext memory context and all its child contexts.
WITH memory_contexts AS (
SELECT * FROM pg_backend_memory_contexts
)
SELECT child.name, child.type, child.level, child.path, child.total_bytes
FROM memory_contexts parent, memory_contexts child
WHERE parent.name="TopPortalContext" AND
child.path[parent.level] = parent.path[parent.level];
name | type | level | path | total_bytes
--------------------------------+------------+-------+-----------------------+-------------
TopPortalContext | AllocSet | 2 | {1,20} | 8192
PortalContext | AllocSet | 3 | {1,20,31} | 1024
ExecutorState | AllocSet | 4 | {1,20,31,136} | 65632
tuplestore tuples | Generation | 5 | {1,20,31,136,138} | 32768
printtup | AllocSet | 5 | {1,20,31,136,139} | 8192
ExprContext | AllocSet | 5 | {1,20,31,136,140} | 8192
ExprContext | AllocSet | 5 | {1,20,31,136,141} | 8192
tuplestore tuples | Generation | 5 | {1,20,31,136,142} | 8192
ExprContext | AllocSet | 5 | {1,20,31,136,143} | 8192
Table function arguments | AllocSet | 5 | {1,20,31,136,144} | 8192
ExprContext | AllocSet | 5 | {1,20,31,136,145} | 32768
pg_get_backend_memory_contexts | AllocSet | 6 | {1,20,31,136,145,146} | 16384
(12 rows)
Previously, contexts were linked using the parent column (removed in the second commit) pointing to name. A more complex recursive query was required. But more importantly, this combination worked less reliably, since context names might not be unique.
Additionally, a type column has been added to the view (third commit). Currently four types of context are used: AllocSet, Generation, Slab, Bump. You can read more about memory contexts in the source code: src/backend/utils/mmgr/README.
pg_get_acl function
commit: 4564f1ce, e311c6e5, d898665b
After executing the following commands, the test table becomes dependent on the alice role.
CREATE ROLE alice;
CREATE TABLE test (id int);
GRANT SELECT ON test TO alice;
Now the role alice cannot be deleted without removing the dependency on the table:
DROP ROLE alice;
ERROR: role "alice" cannot be dropped because some objects depend on it
DETAIL: privileges for table test
Information about object dependencies is stored in two system directories pg_depend (between objects within the same database) and pg_shdepend when the dependencies involve common cluster objects, such as roles, as in our case.
Currently there is exactly one line in pg_shdepend about this dependency:
SELECT * FROM pg_shdepend\gx
-[ RECORD 1 ]-----
dbid | 5
classid | 1259
objid | 16528
objsubid | 0
refclassid | 1260
refobjid | 16527
deptype | a
Deciphering the information from pg_shdepend is not so easy, since the objects in it are represented in the form of identifiers: the identifier of a table in the system catalog (classid), the identifier of a row in this system catalog (objid) and the sequence number of the column for tables (objsubid). The referenced objects are represented by refclassid and refobjid, respectively (there are no references to table columns, so there is no refobjsubid column).
For the convenience of working with objects by identifiers, it is provided a number of functions. For example, the pg_identify_object function displays an entry about an object:
SELECT * FROM pg_identify_object(1259,16528,0);
type | schema | name | identity
-------+--------+------+-------------
table | public | test | public.test
The object on which the table depends:
SELECT * FROM pg_identify_object(1260,16527,0);
type | schema | name | identity
------+--------+-------+----------
role | | alice | alice
There has been an addition to this family of functions. The new pg_get_acl function shows the access permissions of an object, without having to query a specific system directory.
The following query will show exactly what privileges are granted to alice:
SELECT *,
pg_identify_object(classid, objid, objsubid) AS object,
pg_identify_object(refclassid, refobjid, 0) AS ref_object,
pg_get_acl(classid, objid, objsubid)
FROM pg_shdepend\gx
-[ RECORD 1 ]---------------------------------------------
dbid | 5
classid | 1259
objid | 16528
objsubid | 0
refclassid | 1260
refobjid | 16527
deptype | a
object | (table,public,test,public.test)
ref_object | (role,,alice,alice)
pg_get_acl | {postgres=arwdDxtm/postgres,alice=r/postgres}
pg_upgrade: optimization of pg_dump operation
commit: 6e16b1e4, 64f34eb2e, 2329cad1b, 23687e925, 68e962998, c8b06bb96, bd15b7db4, 6ebeeae29
Several optimizations have been made to update the server. It makes sense to consider them together. All are somehow related to the work of the pg_dump utility, which is used during the server update process to dump the contents of system directories.
First is that pg_upgrade will run pg_dump with the –no-sync option. When upgrading a server, there is no guarantee that files will be written to disk, because if something goes wrong, the entire procedure must be repeated again. And you can save money on synchronizing files with disk.
Thanks to second optimization, database creation commands will use the FILE_COPY strategy instead of the default WAL_LOG:
CREATE DATABASE .. STRATEGY = FILE_COPY;
With this strategy, the contents of the template database are not written to the log, and reliability is guaranteed by performing checkpoints before and after copying files. In normal operation this is less efficient, but in binary upgrade mode the server will not perform unnecessary checkpoints (only one will be required after creating the template0 template base).
The effectiveness of these two optimizations will increase with the number of databases in the cluster being upgraded.
Third optimization speeds up the unloading of table descriptions in pg_dump –binary-upgrade mode. It is in this mode that pg_upgrade runs pg_dump. Acceleration is achieved due to the fact that at the beginning of its work, pg_dump receives information about all tables with one request to pg_class and stores it in an ordered array in memory. In previous versions, when unloading each table, a separate query is executed to pg_class, which wastes a lot of resources.
Fourth optimization speeds up the unloading of sequences. In the same way as unloading tables. When pg_dump starts, it stores sequence information in an ordered array. There is no need to access the system catalog in the future. But this optimization works not only in –binary-upgrade mode, but also in normal mode.
The third and fourth optimizations are not entirely free. To store information about tables and sequences, you will have to allocate additional RAM. But the developers are confident that such overhead costs are fully justified.
In databases with a small number of tables and sequences, the last two optimizations will not have much effect. But if we talk about systems with tens, hundreds of thousands of tables and sequences, then the acceleration is more than noticeable.
To test this, let’s create 100,000 tables in a separate version 17 database with a column to which the sequence is attached. Let's repeat the same steps in the 18th version database:
$ psql -d test -o /dev/null <<EOF
SELECT format('CREATE TABLE t%s (id int GENERATED ALWAYS AS IDENTITY)', g.i)
FROM generate_series(1,100_000) AS g(i)
\gexec
EOF
Now let’s measure how long it takes pg_dump to run in –binary-upgrade mode.
Version 17:
$ time pg_dump -d test --schema-only --binary-upgrade > /dev/null
real 1m40,677s
user 0m5,403s
sys 0m3,265s
Version 18:
$ time pg_dump -d test --schema-only --binary-upgrade > /dev/null
real 0m7,427s
user 0m2,220s
sys 0m0,264s
The acceleration is more than an order of magnitude.
Predefined role pg_signal_autovacuum_worker
commit: ccd38024b
Roles included in the pg_signal_backend fixed role can send a signal to server processes to terminate the current request (pg_cancel_backend) or even a signal to terminate (pg_terminate_backend). With one limitation. These signals cannot be sent to superuser processes.
However, when maintaining a server, situations may arise when you need to interrupt the autocleanup processes. Members of the new predefined role pg_signal_autovacuum_workerNon-superusers will be able to signal autovacuum worker processes to complete cleaning the current table or to terminate the work using these same two functions.
That's all for now. There's news from the September commitfest ahead.