PostgreSQL 13: parallel VACUUM

The other day, Amit Kapila committed a Masahiko Savada patch, which allows cleaning in parallel. The table itself is still cleared by one (leading) process, but to clean indexes, it can now start background workflows, one for each index. In manual mode, this allows you to speed up the cleaning of large tables with multiple indexes; automatic cleaning does not use this feature yet.

Related links:

As you know, the table cleaning process consists of several phases.

First, the table is scanned and links to unnecessary (“dead”) versions of rows are collected in memory. Memory is limited by maintenance_work_mem, so all links may not fit at one time.

Then all indexes are sorted sequentially (as it was before), and pointers to found dead versions of strings are cleared from them.

Then the already viewed part of the table is scanned again and dead versions of the rows are already cleared from it.

If at one time it was not possible to process all dead versions of strings, then the memory is cleared and the whole process is repeated again from the place where we stopped last time.

This whole scheme remains unchanged, but now indexes can be cleared in parallel. To do this, the lead process starts several workflows that parse existing indexes and process them. One index is processed by only one process. After all indexes are cleared, workflows are completed, and the leader proceeds to the next phase.

For example, take a flight table ticket_flights demobases. There is one index on it, but you can create a couple more.

demo=# CREATE index on ticket_flights (fare_conditions);
demo=# CREATE index on ticket_flights (amount);

Only indices whose size exceeds the parameter value participate in parallel processing min_parallel_index_scan_size. Our indices are suitable:

demo=# SHOW min_parallel_index_scan_size;
(1 row)
demo=# SELECT pg_size_pretty(pg_relation_size(indexname::regclass))
FROM pg_indexes
WHERE tablename = 'ticket_flights';
 325 MB
 187 MB
 180 MB
(3 rows)

Update half the lines to load cleanup work:

demo=# UPDATE ticket_flights SET amount = amount + 1 WHERE random() > 0.5;
UPDATE 4194262


demo=# VACUUM VERBOSE ticket_flights;
INFO:  vacuuming "bookings.ticket_flights"
INFO:  launched 2 parallel vacuum workers for index vacuuming (planned: 2)
INFO:  scanned index "ticket_flights_fare_conditions_idx" to remove 4194262 row versions by parallel vacuum worker
DETAIL:  CPU: user: 1.84 s, system: 0.41 s, elapsed: 11.82 s
INFO:  scanned index "ticket_flights_amount_idx" to remove 4194262 row versions by parallel vacuum worker
DETAIL:  CPU: user: 2.31 s, system: 0.44 s, elapsed: 12.95 s
INFO:  scanned index "ticket_flights_pkey" to remove 4194262 row versions
INFO:  "ticket_flights": found 4194262 removable, 8391852 nonremovable row versions in 104885 out of 104885 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 630
There were 0 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 9.91 s, system: 4.40 s, elapsed: 121.40 s.

It can be seen that the leading process launched two workers, and took one index.

The number of work processes can be specified explicitly (in any case, of course, it will not exceed max_parallel_maintenance_workerswhich does not exceed max_worker_processes) You can use explicit indications, in particular, to turn off concurrency:

demo=# VACUUM (PARALLEL 0, VERBOSE) ticket_flights;

I hope that this topic will be further developed. The table could be scanned in parallel by several processes (this was in the original Savada patch, but not included in the final one), each index could also be cleared in parallel. It is also necessary to teach automatic cleaning to use parallelism.

P. S. By the way, we need a person in our educational team. To be able and loved to understand and explain. Somewhere he is, we know for sure, but while hiding.

Similar Posts

Leave a Reply