PostgreSQL Antipatterns: fighting hordes of the “dead”

The features of the internal PostgreSQL mechanisms allow it to be very fast in some situations and not so fast in others. Today we’ll dwell on a classic example of a conflict between how a DBMS works and what a developer does with it – UPDATE vs MVCC principles.

Briefly plot from an excellent article:

When a line is modified with the UPDATE command, two operations are actually performed: DELETE and INSERT. IN current row version xmax is set equal to the number of the transaction that performed UPDATE. Then created a new version the same line; its xmin value matches the xmax value of the previous version.

Some time after the completion of this transaction, the old or new version, depending on COMMIT/ROOLBACKwill be recognized Dead tuples upon passage VACUUM according to the table and stripped.

But this will not happen right away, but problems with the “dead” can be acquired very quickly – with multiple or massive updating of records in a large table, and a little later, faced with a situation that VACUUM will not be able to help.

# 1: I Like To Move It

Suppose your business logic method works for itself, and suddenly realizes that it would be necessary to update the field X in some record:

UPDATE tbl SET X =  WHERE pk = $1;

Then, as it progresses, it finds out that the Y field should be updated too:

UPDATE tbl SET Y =  WHERE pk = $1;

… and then also Z – why trifle something?

UPDATE tbl SET Z =  WHERE pk = $1;

How many versions of this record now have in the database? Yeah, 4 pieces! Of these, one is relevant, and 3 will have to clean up after you [auto]VACUUM.

Do not do like this! Use updating all fields in one request – almost always the logic of the method can be changed like this:

UPDATE tbl SET X = , Y = , Z =  WHERE pk = $1;

# 2: Use IS DISTINCT FROM, Luke!

So, you still wanted to update many, many records in the table (during the use of a script or converter, for example). And something like this flies into the script:

UPDATE tbl SET X =  WHERE pk BETWEEN $1 AND $2;

Approximately in this form, the query is encountered quite often and almost always not to fill out an empty new field, but to correct some errors in the data. At the same time the correctness of existing data is not taken into account at all – but in vain! That is, the record is being rewritten, even if it lay exactly what I wanted – and why? Correct:

UPDATE tbl SET X =  WHERE pk BETWEEN $1 AND $2 AND X IS DISTINCT FROM ;

Many do not know about the existence of such a wonderful operator, so here’s a cheat sheet for IS DISTINCT FROM and other logical operators to help:

… and a little about operations on complex ROW()-expressions:

# 3: I’ll recognize my dear by … blocking

Start up two identical parallel processes, each of whom is trying to mark on the record that she is “in work”:

UPDATE tbl SET processing = TRUE WHERE pk = $1;

Even if these processes substantively do things independent of each other, but within the framework of one ID, on this request the second client will “lock up” until the first transaction is completed.

Solution No. 1: the task is reduced to the previous one

Just add again IS DISTINCT FROM:

UPDATE tbl SET processing = TRUE WHERE pk = $1 AND processing IS DISTINCT FROM TRUE;

In this form, the second request will simply not change anything in the database, there is already “everything is as it should” – therefore, blocking will not occur. Further, the fact of the “non-existence” of the record is already processed in the applied algorithm.

Decision number 2: advisory locks

A big topic for a separate article in which you can read about the methods of application and the “rake” of recommendatory locks.

Decision No. 3: without[д]smart challenges

But for sure, it should happen to you simultaneous work with the same record? Or did you still mess up with client-side business logic call algorithms, for example? And if you think about it? ..

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *