Recommendations when working with PostgreSQL

Hello current_user()!

Welcome to an exciting journey through the world of development with PostgreSQL, a powerful database management system. In this article, we'll dive into the key aspects of creating code and structure for PostgreSQL, with an emphasis on effective development strategies.

These are not rules, but rather useful recommendations that will help you avoid unpleasant situations in the future. The experience presented is my own, and of course it may be a little subjective since I am relatively far from being a super developer 🙂

Here we will consider points that relate specifically to the development and its organization (approach). We will not touch on server configuration, backups/replicas/shards, etc.

I'll try to be brief. Just briefly and point by point:

Normalization of structure

When learning SQL programming, we all learned that there are levels of normalization in databases. But we must optimize not only the location of data in different objects, but also these objects (tables, functions, roles, accesses) in different schemas

Use separate circuits for different modules

Let's say you need to implement a system for sending push notifications to clients' Android devices. You are writing a service that will work with FCM. Write functions in your database that will produce lists of token text (yes, there are a lot of fields in general).

So, it is necessary that such pens for different modules (backends) were in separate diagrams.

Here's a crazy example:

logging
payments_common
payments_tinkoff
payments_sber
notifications

Use role groups and assign grants specifically to roles

Let's assume that you have implemented the point above, now your service should connect to the database and run its queries. But under what user? It won’t run under postgres or admin?)
We will create a separate user for him and let him log in under him. Hmm, but you still need to give rights? Well, let's give it.

I recommend using the following approach: create roles, configure them as group policies and give permissions specifically to the roles.

For example, these are the roles:

api_role – for accessing internal api functions
client_api_role – for mobile client access
admin_api_role – for site access (well, for example, this is the admin panel)
notify_role – for access to notification systems
developer – what is available to developers (for example, some part of the tables is available for reading in production)
And then create users and connect them to roles. And give grants specifically for roles. Don't make them personal.

Example for the Developers group

Let's say you give some developers ReadOnly in production (not for all tables, of course)

It's easier to create a role developerand then execute 1 time:
grant select ... to developer
and when adding a new developer you will only need to run: grant developer to "new user"

In any case, adhere to the following rules:

For each new backend (application) – its own user
Administrator should only be the Cluster Administrator + some developers (for example, technical lead)

Use extensions

Postgresql is a powerful DBMS. But it cannot be universal either. You shouldn’t assume that everything is done using standard postgresql tools. Also, don’t try to write your own gadgets/systems of anything.

Many ready-made and good tools have been around for a long time and are widely used. We take it, study it and apply it – nothing complicated.

If you don’t know what exactly is best to use in this or that case, we try to test, promote and look at everything, or you can ask for help, for example, on HabrQA or TG channel on PostgreSQL

Use different schemes for different extensions

And, returning to the first recommendation – for each extension, use your own scheme (or at least all extensions into one common extension scheme):

-- создадим схему
CREATE SCHEMA hstore;

-- установим расширение в его схему
CREATE EXTENSION hstore WITH SCHEMA hstore;

-- не забываем сразу подключить схему в serach_path
ALTER SYSTEM SET search_path="$user", 'public', 'hstore';

Use pg_notify

Let's assume that you have a backend that must take data from the database and process it.
The standard solution would be something like this:
update table set processed_at = now() returning *;

And this is a fairly working option, but when to do it? For example, once every 5 seconds? – Yes, this option is suitable.

But what if those records that we need to pull out do not appear so often (less than once every 5 seconds), but we must respond to them very quickly?

Yes, this is where pg_notify comes to our aid.
You simply set up your service to listen to the desired channel and when you receive a message, you can request the data through a request (you can also send it through Notify as a last resort).

Fast, cool and without dirt in the system.

Store datatime in timestamptz

Make it a habit to always store timestampTZ and someday they will thank you very much for this.
This is a very common mistake for both beginners and experienced developers.

Yes, this is not a rule, it happens that you need to use it without a time zone. But if you don't know what to use, use with time zone.

Example of use with timezone

Trading on the exchange begins at 10 am Moscow time.
If your client from Hong Kong enters the exchange at 13:00:
it has 13:00:00 +08, and in Moscow time it is 08:00:00 +03, as we see, the exchange should be closed. If we used it here without a time zone, we would have to manually write a calculation of whether the exchange is open or how long before it opens.

Let me add: if we store in the database that the exchange opens at 12:00:00 +03, then when select '12:00:00 +03' - '13:00:00 +08' It will tell us that there are 4 hours left before opening.

Example of use without timezone

We want to display the `Get New Year's bonus` button in our application on January 1 from 09:00 to 10:00.

This is exactly the period of time we need to store in the database without reference to a timezone.

And then for 9 a.m. Moscow time and for 9 a.m. Hong Kong time a hit in the interval will be displayed:
set local time zone $client_timezone;
select now()::timestamp <@ tsrange('2024-01-01 09:00:00', '2024-01-01 10:00:00');

A little about how timestamptz is stored in the database:
PostgreSQL does not store the time zone that you feed it; it converts everything to UTC and stores it that way.
And when you need to give it to a client, it casts into the client’s zone. When you give her a time filter, she casts your filter in UTC and works with it.
I think it's clear now.

A little about programming, monitoring and optimization

I decided to limit this block to precisely those points that are in one way or another related to programming itself. Maybe some advice will be dubious, but I’ll mention it anyway

Don't forget to set the correct server configuration

Although this point in general granted, but, yes, yes, the PostgreSQL configuration is initially such that your cluster can start on a cast iron. But you don't have to do that 🙂

As soon as you have made a cluster that will be a dev/prod, and even if there are only 1-2 clients there, configure the server for your hardware and tasks. In fact, a fair number of problems are related to the cluster configuration in one way or another.

Information about which configs you need to look at right away and which you can look at later can be found on the Internet.

One rule – don’t overdo it, otherwise the server may lose consciousness 🙂
For the sake of example:
give the client 10 GB of RAM for operations + connect 100 clients
or set shared_buffers = 100% RAM, and therefore think about why the system crashes
or just set this parameter unrealistically large and see it in pg_stat_activity LWLock: BufferMapping

Use metrics. Use pg_stat_statements

I also recommend immediately installing extensions for monitoring requests (for example pg_stat_statements).

Although it’s rare, go to the product page, look at the statistics, reset them, aggregate them. There is no need to wait for the quality control department to break down your window with both legs.

In general, set up alerts for some metrics!

It is better to use CTEs and long queries in functions rather than temporary tables

Perhaps this point sounds incorrect.

PostgreSQL always puts temporary tables on disk, this will take a lot of time. Use table expressions. And it’s better to write everything with one operator. Because even in standard function isolation (with Read committed), you will observe different data in different statements in the function.

Don’t make sections everywhere you can (and where you can’t either)

Yes, yes, sections are cool. But I would recommend using them only in certain cases:

huge table
for quick removal
if you have 100% one request will go to a specific section
waste data separation
- This is probably a special case of the point above. For example, you have tasks for an external service. What worked – in the archive by month, what did not work – in the operational table (attach by default)

In any of these cases, it is extremely important that the query accesses a small number of partitions (the query always has a partition key condition).

Don't forget to do additional statistics

Yes, when the scheduler represents the exhaust in the wrong way, you can end up with very slow queries.

Let me give you an example:
You have patients in your table: full name, gender, sign of pregnancy, other fields

From the data we know: 90% of patients are girls, of which 50% are pregnant
What does the database know after collecting statistics: 10% are guys, and 45% are pregnant people.

And when you ask for all the pregnant guys, the planner will say that this will be 45% of the entire table.
So, thanks advanced statisticsyou can build connections between the columns, and then the planner will understand that if the attribute is gender = male, there are no pregnant women.

The example may not be the best. But in analytical tasks, advanced statistics will help the planner not to stumble and choose a nested loop, but to choose some hash join.

Also check your query plans; if the planned number of rows differs from the actual number by at least an order of magnitude, you have problems with statistics.

Try not to shove everything into TOAST

It’s probably worth immediately adding the point that there is no need to store analytical data in TOAST (for example, json/text with the default storage parameter).
You can run into very big performance problems.

Literally half a year ago I had such a problem in my practice. It turned out that to build the report we took 1 field for calculation from json (which is quite large).
They took the meaning from the json into a tablet. The request began to run not for 3 hours, but only 1-4 seconds.

Use explain rather than create index

Yes, yes, very often there are problems when the request starts to slow down and the programmer immediately А почему бы не сделать индекс отдельный ?.

This is a very sore subject. Sit down, run explain analyze, analyze it completely and find bottlenecks.

Believe me, this helps a lot. _{and also increases your heart rate 🙂}

And in general, the question about the use of certain indexes for certain tasks, as well as about the configuration of indexes, is a completely separate topic

Use plan visualizers

Viewing the query plan in PGAdmin is not very convenient. Yes, you can look at everything in it, but there are more convenient tools that will make this task easier.

For example I use this.

Look carefully for signs of variability and concurrency in the functions you create

Unfortunately, no one pays attention to this.

Make it a rule to always check the volatility of functions (volatile/stable/immutable).
In my practice, I came across a rake when I wrote internal api functions (auxiliary functions) with the volatile attribute (which is the default), which simply gave out information.
And every time this had to be corrected, when in production such functions turned the execution time of ordinary OLTP queries from 0.05-0.1 seconds into 4-6 seconds.

For the most part, I haven’t encountered any problems with parallelism, but it’s still better not to leave this parameter as default (just read in advance when you need it and when you don’t).

Also, do not forget that functions with the volatile parameter work in their own snapshot (snapshot in MVCC) and I can see more current information in the database than the mother operator; this behavior can break something in your business logic.

Here's a trivial example

-- не описываю прям всё, чисто ради примера

-- создадим таблицу
create table users(id, created_at);

-- генерим даныне
insert into users()... generate_series(1,10)

-- создаём функцию волатайл, которая возвращает created_at по ид
create function get_user_created_at(user_id) returning created_at ....

-- делаем запрос:
select created_at, get_user_created_at(id), pg_sleep(1)
from users

-- в параллельной транзакции делаем
update users set created_at - now()

-- ну и в запросе, который написан на 13 строке
-- мы увидим расхождения в created_at,
-- поскольку функция волатайл отрабатывает в своём снапшоте
-- и увидит там created_at из запроса в 17 строке

Implement locks correctly

A very sore subject.
I won’t say much here, but will simply give a link to a very good article that helped me get rid of a very large number of deadlocks.

Organization of coding

And so, it's time to start the last block. Part of the organization of writing code, its design, etc. is described more here.

Make a special diagram devon sale

Let's say that in your project some database developers are allowed access to the product with readOnly rights, for example, they will need to perform some debugging and often reproduce queries (for the same reason, explain analyze ).

So, I advise you to make a special circuit, for example dev. Give full access there, let them copy, modify and run functions in their sandbox (well, we can’t make a complete copy of the product for testing).

I note that it will be necessary to remind the developers to create signs with the UNLOGGED parameter, so that they do not go to the replica.

Keep the code in a repository and version it

There is no need to write in messages in a merge request of some backend service that a correction is needed in such and such a procedure, change the code there (or generally keep a new version of the procedure/requests for adding fields, etc. in the comments)

Create a separate folder and store the code in the repository, always have a history of modification of certain objects in the database.

The simplest way to do this is to simply make a snapshot of the database structure and display it in the form of different files (each object has its own file). Available tools include pgCodeKeeper, DataGrip, etc.

We are currently testing the system using pg_dump + pg_query_go.

Pour into food only through tools, do not do it with your hands

And now, when the code is already in the repository, you can make migrations from it and upload them to production.

Use CICD tools and roll out database changes to production using special utilities.
An example of such utilities is: liquibase, flyway, pgCodeKeeper, pg-migrator and a ton of others.

It all depends on how you create migrations and what your database is like.

Never make changes to the product without committing them to the repository code or somewhere else. You will pick up problems, 100% guarantee.

Write comments everywhere

Force programmers to write comments for almost all functions, tables, their columns, etc.
This will help new developers to support the project in the future, as well as to generate documentation.

^{Yes, yes, the point is banal to the point of disgrace.}

Use CrossReview

I think there is no need for comments here

Agree with the team on a unified coding style

And it would seem that the last 3 points go “for granted” – very often this is not observed.

If problems arise, use linters, for example pgFormatter.

Conclusion

By following certain principles and using advanced PostgreSQL development and maintenance tools, you can ensure the stability, performance and security of both the database and your project's code base.