Save a lot of money on large volumes in PostgreSQL

Continuing the topic of recording large data streams, raised by the previous article on partitioning, in this we consider the ways in which reduce the “physical” size of the stored in PostgreSQL, and their impact on server performance.

It’s about TOAST settings and data alignment. “On average”, these methods will save not too many resources, but without any modification to the application code.

However, our experience has been very productive in this regard, since the repository of almost any monitoring is by its nature mostly append-only in terms of recorded data. And if you are interested in how you can teach the base to write to disk instead 200MB / s half as much – I ask for a cat.

Little Big Data Secrets

According to the profile of our service, they regularly fly to him from the logs text packs.

And since VLSI complexwhose databases we are monitoring is a multi-component product with complex data structures, then queries for maximum performance are obtained by quite such “multi-volumes” with complex algorithmic logic. So the volume of each individual instance of the request or the resulting execution plan in the log coming to us turns out to be “average” quite large.

Let’s look at the structure of one of the tables into which we write the “raw” data – that is, here is the original text from the log entry:

CREATE TABLE rawdata_orig(
  pack -- PK
    uuid NOT NULL
, recno -- PK
    smallint NOT NULL
, dt -- ключ секции
    date
, data -- самое главное
    text
, PRIMARY KEY(pack, recno)
);

Such a typical plate (already partitioned, of course, therefore it is a section template), where the most important is the text. Sometimes quite voluminous.

Recall that the “physical” size of one record in PG cannot occupy more than one page of data, but the “logical” size is a completely different matter. To write a volume value (varchar / text / bytea) into the field, use TOAST technology:

PostgreSQL uses a fixed page size (usually 8 KB), and does not allow tuples to span multiple pages. Therefore, it is impossible to directly store very large field values. To overcome this limitation, large field values ​​are compressed and / or split into several physical lines. This happens unnoticed by the user and affects most of the server code slightly. This method is known as TOAST …

In fact, for each table with “potentially large” fields, automatically creates a paired table with “slicing” each “large” record in 2KB segments:

TOAST(
  chunk_id
    integer
, chunk_seq
    integer
, chunk_data
    bytea
, PRIMARY KEY(chunk_id, chunk_seq)
);

That is, if we have to write a string with a “large” value datathen the real recording will happen not only in the main table and its PK, but also in TOAST and its PK.

Reduce the TOAST effect

But most of the records here are still not so big, in 8KB should fit – how would you save on this? ..

Here the attribute comes to our aid STORAGE table column:

  • EXTENDED allows both compression and separate storage. it standard option for most data types compatible with TOAST. First, an attempt is made to perform compression, then it is saved outside the table if the row is still too large.
  • MAIN Allows compression but not separate storage. (In fact, separate storage, however, will be performed for such columns, but only as a last resortwhen there is no other way to reduce the line so that it fits on the page.)

In fact, this is exactly what we need for the text – squeeze as much as possible, and even if it doesn’t fit at all – put it in TOAST. You can do this directly “on the fly”, with one command:

ALTER TABLE rawdata_orig ALTER COLUMN data SET STORAGE MAIN;

How to evaluate the effect

As the data flow changes every day, we cannot compare absolute numbers, but in relative than smaller share we recorded in TOAST – the better. But there is a danger – the more we have the “physical” volume of each individual record, the “wider” the index becomes, because we have to cover more data pages.

Section before change:

heap  = 37GB (39%)
TOAST = 54GB (57%)
PK    =  4GB ( 4%)

Section after changes:

heap  = 37GB (67%)
TOAST = 16GB (29%)
PK    =  2GB ( 4%)

In fact, we began to write in TOAST 2 times less often, which unloaded not only the disk, but also the CPU:

I note that we also began to “read” the disk less, not only “write” – because when you insert a record into some table, you also have to “subtract” a part of the tree of each of the indices to determine its future position in them.

Who on PostgreSQL 11 live well

After upgrading to PG11, we decided to continue “tuning” TOAST and noticed that starting with this version, the parameter became available for tuning toast_tuple_target:

The TOAST processing code is triggered only when the row value to be stored in the table is larger than TOAST_TUPLE_THRESHOLD bytes (usually 2 KB). TOAST code will compress and / or move field values ​​out of the table until the row value is less than TOAST_TUPLE_TARGET bytes (variable, usually 2 KB as well) or it becomes impossible to reduce the size.

We decided that the data we usually have is either “very short” or immediately “very long”, so we decided to limit ourselves to the lowest possible value:

ALTER TABLE rawplan_orig SET (toast_tuple_target = 128);

Let’s see how the new settings affected disk loading after the migration:

Not bad! Average the turn to disk was reduced about 1.5 times, and disk “occupancy” – 20 percent! But maybe this somehow affected the CPU?

At least, it definitely didn’t get worse. Although, it’s difficult to judge if even such volumes still cannot raise the average CPU load higher 5%.

From a change of position, the sum … changes!

As you know, a penny saves a ruble, and with our storage volumes of order 10TB / month even a small optimization can give a good profit. Therefore, we paid attention to the physical structure of our data – how specifically “Stacked” fields inside the record each of the tables.

Because due to data alignment, this is directly affects the resulting volume:

Many architectures provide data alignment across machine word boundaries. For example, on an x86 32-bit system, integers (integer type, occupies 4 bytes) will be aligned on the border of 4-byte words, as well as double-precision floating-point numbers (double precision type, 8 bytes). And on a 64-bit system, double values ​​will be aligned on the border of 8-byte words. This is another reason for incompatibility.

Due to alignment, the size of the table row depends on the order of the fields. Usually this effect is not very noticeable, but in some cases it can lead to a significant increase in size. For example, if you place the char (1) and integer fields mixed up, 3 bytes will usually be wasted between them.

Let’s start with synthetic models:

SELECT pg_column_size(ROW(
  '0000-0000-0000-0000-0000-0000-0000-0000'::uuid
, 0::smallint
, '2019-01-01'::date
));
-- 48 байт

SELECT pg_column_size(ROW(
  '2019-01-01'::date
, '0000-0000-0000-0000-0000-0000-0000-0000'::uuid
, 0::smallint
));
-- 46 байт

Where did the extra pair of bytes come from in the first case? Everything is simple – 2-byte smallint aligned to 4-byte boundary before the next field, and when it’s the last one, there’s nothing to align and no need.

In theory, everything is fine and you can rearrange the fields as you like. Let’s check on real data on the example of one of the tables, the daily section of which takes 10-15GB.

Source structure:

CREATE TABLE public.plan_20190220
(
-- Унаследована from table plan:  pack uuid NOT NULL,
-- Унаследована from table plan:  recno smallint NOT NULL,
-- Унаследована from table plan:  host uuid,
-- Унаследована from table plan:  ts timestamp with time zone,
-- Унаследована from table plan:  exectime numeric(32,3),
-- Унаследована from table plan:  duration numeric(32,3),
-- Унаследована from table plan:  bufint bigint,
-- Унаследована from table plan:  bufmem bigint,
-- Унаследована from table plan:  bufdsk bigint,
-- Унаследована from table plan:  apn uuid,
-- Унаследована from table plan:  ptr uuid,
-- Унаследована from table plan:  dt date,
  CONSTRAINT plan_20190220_pkey PRIMARY KEY (pack, recno),
  CONSTRAINT chck_ptr CHECK (ptr IS NOT NULL),
  CONSTRAINT plan_20190220_dt_check CHECK (dt = '2019-02-20'::date)
)
INHERITS (public.plan)

Section after changing the order of columns – exactly the same fields, only the order is different:

CREATE TABLE public.plan_20190221
(
-- Унаследована from table plan:  dt date NOT NULL,
-- Унаследована from table plan:  ts timestamp with time zone,
-- Унаследована from table plan:  pack uuid NOT NULL,
-- Унаследована from table plan:  recno smallint NOT NULL,
-- Унаследована from table plan:  host uuid,
-- Унаследована from table plan:  apn uuid,
-- Унаследована from table plan:  ptr uuid,
-- Унаследована from table plan:  bufint bigint,
-- Унаследована from table plan:  bufmem bigint,
-- Унаследована from table plan:  bufdsk bigint,
-- Унаследована from table plan:  exectime numeric(32,3),
-- Унаследована from table plan:  duration numeric(32,3),
  CONSTRAINT plan_20190221_pkey PRIMARY KEY (pack, recno),
  CONSTRAINT chck_ptr CHECK (ptr IS NOT NULL),
  CONSTRAINT plan_20190221_dt_check CHECK (dt = '2019-02-21'::date)
)
INHERITS (public.plan)

The total volume of the section is determined by the number of “facts” and depends only on external processes, so we divide the size of heap (pg_relation_size) by the number of entries in it – that is, we get average size of real stored record:

Minus 6% of the volume, well!

But everything, of course, is not so rosy – because in indexes we cannot change the order of fields, and therefore “in general” (pg_total_relation_size) …

… still here saved 1.5%without changing a single line of code. Taki yes!

I note that the above arrangement of fields is not a fact that the most optimal. Because some field blocks do not want to be “torn apart” for aesthetic reasons – for example, a couple (pack, recno)which is PK for this table.

In general, the definition of “minimum” field alignment is a fairly simple “exhaustive” task. Therefore, you can get results on your data even better than ours – try it!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *