An article about how and why it is worth organizing applied partitioning “in theory” has already been, here we will focus on the practice of applying some approaches within the framework of our monitoring service for hundreds of PostgreSQL servers.
“Cases of bygone days …”
Initially, like any MVP, our project started under a fairly small load – monitoring was carried out only for the top ten most critical servers, all tables were relatively compact … But time passed, there were more and more monitored hosts, and once again tried to do something with one of 1.5TB tables, we realized that it’s possible to live on like this, which is very inconvenient.
Times were almost epic, different PostgreSQL 9.x versions were relevant, so all the partitions had to be done “manually” – through table inheritance and triggers routing with dynamic
The resulting solution turned out to be universal enough so that it could be translated to all tables:
- An empty “header” parent table was declared, on which all were described desired indexes and triggers.
- Recording from the point of view of the client was made in the “root” table, and inside with routing trigger
BEFORE INSERTthe record was “physically” inserted into the desired section. If there wasn’t one yet, we caught an exception and …
- … via
CREATE TABLE ... (LIKE ... INCLUDING ...)by the template of the parent table was created restricted date sectionso that when data is extracted, reading is done only in it.
PG10: First Attempt
But partitioning through inheritance has historically not been well suited to work with an active write stream or a large number of descendant sections. For example, you may recall that the algorithm for selecting the desired section had quadratic complexitythat works with 100+ sections, you understand how …
PG10 optimized this situation by implementing support native sectioning. Therefore, we immediately tried to apply it immediately after the migration of the storage, but …
As it turned out after digging the manual, the natively partitioned table in this version:
- does not support description of indexes
- does not support triggers on it
- cannot be itself any “descendant”
- do not support
INSERT ... ON CONFLICT
- can’t spawn section automatically
Painfully getting a rake on our forehead, we realized that we could not do without modifying the application, and postponed further research for six months.
PG10: Second Chance
So, we began to solve the problems in turn:
- Because triggers and
ON CONFLICTwe were in some places still needed, for their development did an intermediate proxy table.
- Get rid of the “routing” in triggers – i.e. from
- Carried out separately template table with all indicesso that they do not even appear on the proxy table.
Finally, after all this, the main table has already been natively partitioned. Creating a new section so far remained on the conscience of the application.
As in any analytical system, we also had “Facts” and “cuts” (dictionaries). In our case, in this capacity were, for example, body of the “template” same slow queries.
The “facts” have been partitioned by days for a long time, so we calmly deleted the obsolete sections, and they did not bother us (logs!). But with the dictionaries the trouble turned out …
Not to say that there were a lot of them, but approximately on 100TB of “facts” turned out a dictionary on 2.5TB. You can’t conveniently delete anything from such a table, you won’t squeeze it in adequate time, and writing to it gradually became more and more slow.
It seems like a dictionary … in it, each entry should be presented exactly once … and that’s right, but! .. Nobody bothers us to have a separate dictionary for every day! Yes, it brings a certain redundancy, but it allows you to:
- write / read faster due to the smaller section size
- consume less memory by working with more compact indexes
- store less data due to the ability to quickly remove obsolete
As a result of the whole complex of measures CPU load reduced by ~ 30%, disk load by ~ 50%:
At the same time, we continued to write exactly the same thing to the database, just with less load.
# 2 Database evolution and refactoring
So, we settled on the fact that we have every day has its own section with data. Actually
CHECK (dt = '2018-10-12'::date) – and there is a partition key and a condition for a record to fall into a particular section.
Since all the reports in our service are built in the context of a specific date, then the indexes since the “non-partitioned times” for them were all types (Server, date of, Plan Template), (Server, date of, Plan Node), (date of, Error class, Server), …
But now on each section live own copies of each such index … And within each section date – constant… It turns out that now we are in each such index we enter the constant corny as one of the fields, which makes more of its volume, and the search time on it, but does not bring any result. They themselves left a rake, oops …
The direction of optimization is obvious – just remove the date field from all indices on partitioned tables. With our volumes, the gain is about 1TB / week!
Now let’s notice that this terabyte still had to be written down somehow. That is, we also drive should now load less! In this picture, the effect obtained from the cleaning, which we devoted a week to, is clearly visible:
# 3 “Smear” the peak load
One of the big troubles of loaded systems is excessive synchronization some operations do not require that. Sometimes “because they didn’t notice”, sometimes “it was easier”, but sooner or later you have to get rid of it.
We zoom in the previous picture – and we see that we have a disk “Shakes” the load with double amplitude between neighboring samples, which obviously “statistically” should not be with so many operations:
This is quite simple to achieve. We have already started monitoring almost 1000 servers, each is processed by a separate logical stream, and each stream discards the accumulated information for sending to the database with a certain frequency, something like this:
The problem here lies precisely in the fact that all threads start at about the same time, therefore, the moments of sending from them almost always coincide “to the point”. Oops number 2 …
Fortunately, it’s easy to correct, by adding a “random” run by time:
setInterval(sendToDB, interval * (1 + 0.1 * (Math.random() - 0.5)))
#4. We cache that
The third traditional highload problem is lack of cache where he is could be.
For example, we made it possible to analyze by plan nodes (all of these
Seq Scan on users), but immediately think that they, for the most part, are the same – they forgot.
No, of course, nothing is written to the database repeatedly, this cuts off the trigger with
INSERT ... ON CONFLICT DO NOTHING. But to the base, these data reach anyway, and even superfluous read to check for conflict I have to do it. Oops number 3 …
The difference in the number of records sent to the database before / after enabling caching is obvious:
And this is a concomitant drop in storage load:
Terabyte-per-day only sounds scary. If you do everything right, then this is just 2 ^ 40 bytes / 86400 seconds = ~ 12.5MB / sthat even IDE desktop screws held. 🙂
But seriously, even with a tenfold “skew” of the load during the day, you can easily meet the capabilities of modern SSDs.