What's new in Greenplum 7. Part 2

In the last part of the review of changes and innovations in Greenplum 7, we looked at the migration of the Append Optimized table engine to use the interface of table access methods, optimization of adding columns to tables, as well as changes related to index support.

Today we will talk about another new type of index for Greenplum and more.

BRIN indexes

All the innovations in terms of indexes mentioned earlier do not look like a must-have for the analytical workload. How often do you have to extract unique or, especially, unique values ​​from a table containing millions and billions of records? That's it. These improvements will most likely be useful for auxiliary tables or reference books.

A much more common scenario is to select a subset of rows that meet the desired criteria for subsequent joins, aggregations, and so on. And here, techniques are in demand that allow you to narrow the search range and exclude those data blocks that are obviously not needed: data skipping, pruning metadata. Previously, for these purposes the user had access to:

  • Table partitioning. The data is physically divided into different tables and, as a result, files. It becomes possible to scan only those partitions that may contain the desired data; it is possible to maintain, backup and reload the table in parts. The cost is rigidity of the structure, increased load on the database directory, and higher demands on planners, who must be aware of the table topology in order to build an effective plan for it.

  • bitmap indexes, which will be good for columns with low cardinality (otherwise the size they occupy on disk will cover all their advantages). They allow you to combine several indexes on different columns in one query plan through operations between bit masks.

  • btree indexes, which, unlike bitmap, can also work on columns with a high percentage of unique values, but will require significant overhead for maintenance and use.

Starting with Postgres 9.5 and Greenplum 7, another option became available to users – brin indexes (Block Range Index). The essence of this type of index is that the entire address space in the table is divided into ranges of a fixed size. For heap tables, this is the number of pages, configurable for each table index, whose numbers constitute the most significant bits in the row version identifier. For each range, the index stores meta information that describes the values ​​stored in it. In the simplest version, for ordered types, the minimum and maximum values. In more complex ones, covering values, for example, an area that includes the coordinates of all points stored in the block. And, since we store only a few values ​​per range, such an index is much more compact than a b-tree.

When accessing such an index, we can obtain the identifiers of all blocks that could potentially store the values ​​we requested and spend resources only on scanning them. Obviously, each value extracted will need to be double-checked, since the requested interval does not necessarily cover the entire range of blocks selected for scanning. And the narrower and more accurate the description of each range of blocks at the same size, the more efficient scanning by index will work, since fewer blocks will be selected during scanning and they will contain fewer “unnecessary” values.

Therefore, good candidates for using this type of index will be columns for which the nature of loading data into the table will correlate with subsequent queries against it. The data will be downloaded in chronological order or daily and then requested for longer time periods (week, month) – great. Often, as a guide, it is proposed to use columns for which, according to statistics, there is a direct or inverse correlation between the value and its location in the table. You can force a correlation by rewriting the table in the desired order using the command CLUSTER in the presence of an index, and using ALTER TABLE REPACK BY COLUMNSimplemented for Greenplum 7 (8ee44e8) – even in its absence. The latter required one more additional function in the interface of table methods – table_relation_copy_for_repack.

Let's look at how a search is carried out using the brin index (see the implementation of the function amgetbitmap index access method – bringetbitmap) to understand what pitfalls arose when adapting the index for AO tables:

  1. We request the number of pages for the table.

  2. The description in the index is not stored for each page, but for their range. Therefore, we will traverse each page range of the table and use the index to determine whether it can store anything interesting to us.

  3. The first pages of the brin index always contain a so-called reverse range map (revmap). It always occupies a contiguous range of pages, and for simplicity we can think of it as an array of pointers that allows us to obtain a description for the desired range of pages.

  4. If a description for a range of blocks has not been found (for example, if information about the range has not yet been summarized), all of its pages will be marked for scanning unconditionally.

  5. Otherwise, a support function will be called, which interprets the generalized information and decides on the advisability of scanning a given range of blocks.

  6. As a result, a bitmap will be built that will mark all pages that potentially contain the requested rows.

Now let’s look carefully at step 3 again and remember what a row version identifier for an AO table is. Let me remind you that we do not have fixed size pages, and in general, the line number does not in any way reflect the physical page number. But the upper 7 bits store the number of the segment file, which contains a specific version of the line. Thus, if at least one line is stored in the highest available 127th segment file, we will need a reverse range map that allows us to address at least 0xFE000000 pages. A single 32 KB brin index page in a Greenplum build fits about 5454 pointers by default. Thus, the map will take up more than 781 thousand pages or 23 Gb+. For comparison, in a Postgres build with 8 KB pages, this page number would correspond to a table occupying more than 32 TB of disk space. In the case of an AO table, such an identifier can be a single row in the entire table.

For this reason, brin indexes have undergone significant modifications to support AO tables (63d916b). The entire range of available row version identifiers can be divided into sequences by the table access method (BlockSequences) without such significant gaps within each of them. Yes, for this purpose another function has appeared in the access methods interface – relation_get_block_sequences. For heap tables, it returns exactly one sequence from the first to the last page. For Append Optimized there will be such sequences according to the number of segment files. For each segment file, the first page will be the page whose number consists of the segment file number (high 7 bits) and 25 bits filled with zeros. The last page will be the one corresponding to the current (!) value of the line version number generator for this segment file (FastSequence). Since the generator value is not reset during the table's lifetime, intensive updates followed by garbage collection will still result in non-existent page information being stored in the reverse range map. The pages themselves will be purely logical, covering ranges of 32k row version identifiers, which corresponds to the two low bytes ctid. Such a logical page will be the minimum unit for which generalized information is stored in the index. This means that the minimum amount of information read from disk will depend on the row width of a particular table, and in case of columnar storage, on the columns required by the query, in contrast to a heap table, for which the page size is fixed. This is important to consider when choosing a value pages_per_range during index creation.

Thus, for AO tables, scanning by brin index will be performed as follows:

  1. For each segment table file, we form a sequence of logical pages (Block Sequence).

  2. For each sequence of pages, we extract from the meta page the number of the first page of the reverse range map.

  3. For each page range, starting with the range corresponding to the segment file number:

  4. We get the page number of the reverse range map in the chain by discarding the most significant bits corresponding to the segment file number, dividing by the maximum number of ranges per map page.

  5. We move on to the next page of the reverse range map if the current page number is less than the desired one.

  6. We get the line pointer number on the map page as the remainder of dividing the page number by the maximum number of ranges that can be stored on it.

  7. Using the resulting offset, we obtain the record identifier (page + record number), which describes the current range.

  8. Retrieve the page and record containing the description of the range.

  9. Using the support function, we check whether the generalized range information corresponds to the requested condition. If the range may contain values ​​of interest to the user or the summary information is missing (for example, it has not been collected), each of its logical pages is marked in the bitmap.

  10. Once the bitmap construction is complete (operator Bitmap Index Scan), we extract sequentially versions of rows with identifiers belonging to logical pages (Bitmap Heap Scan), and recheck the user-specified condition for them. Block Map (Block Directory) helps us get the offset of the physical blocks in the files corresponding to the start of each logical page. Thus, we reduce the amount of data that needs to be read from disk.

As an example, consider scanning a table of three columns with btree and brin indexes on the column b an integer type that will retrieve a range containing approximately 0.7% of the records. Below are the query plans and the time it takes to complete them.

ORCA for Greenplum 7.1 misses the mark on the cost of index scanning and defaults to sequential scanning:

When, to obtain the same result, the segment that stored most of the searched rows only needs to scan two logical pages (Heap Blocks) using the brin index. These blocks will contain only 23430 searched lines, the remaining 42105 will be filtered out after rechecking the condition.

A similar result can be achieved using a btree index. The difference is that this access method and the number of values ​​retrieved allow you to build an accurate (down to the row version identifier) ​​bitmap and avoid double-checking the condition. But the price for this will be the size of the indexes, which will be two orders of magnitude larger.

Parallel operations within a segment

Postgres 9.6 introduced the ability to parallelize many operations within a query. However, Greenplum developers did not find a quick way to combine query parallelization within a cluster and within a specific cluster segment. And out of harm's way banned parallelization of queries within a segment:

To get an idea of ​​the potential implementation challenges, just look at the parallel plan for Postgres:

Then try to imagine what a distributed version of such a plan would look like:

Where X — the number of worker processes, not segments in the cluster. Replacement Gather on Gather Motion looks organic. WITH Redistribute Motion more difficult. Each of the worker processes on a segment should receive only rows corresponding to a given segment, but evenly distributed between them.

In addition, the issue of parallelizing scanning of Append Optimized tables remains open. The existing heap approach relies on a fixed size of pages, the number of which is known at the start of scanning.

On the other hand, this could significantly simplify cluster management, as well as the distribution of resources within one host, including by regulating the number of worker processes. Provided that one Postgres instance can utilize the resources of one physical machine.

Extending the capabilities of Foreign tables

Postgres provides developers with an external table API extension that allows them to retrieve or modify data from arbitrary sources. Such an arbitrary source can be a file, s3 storage, Postgres, Kafka, or even another Greenplum cluster.

The process of retrieving rows from external storage is hidden behind a plan node Foreign Scan. Let's imagine that we have two tables in external Postgres, the result of which we want to use in Greenplum:

In Postgres 9.4 and Greenplum 6, FDW's capabilities were limited to forwarding predicates to a remote system. The connection itself and its subsequent processing will be carried out at the request initiator (in this case, at the coordinator).

However, in the Postgres world, life went on as usual. Postgres 9.6 introduced the ability to forward connections, sorts, and UPDATE/DELETE-operations. Postgres 10 added support for aggregates. Postgres 12 expanded support for sorting and also implemented support for forwarding LIMIT. Having received these changes, we can see the following query plan, in which the result will be completely evaluated on the remote system:

Now it will be possible to use an arsenal of these capabilities in connectors to various systems. Also, to use these capabilities, you will need to modify the second scheduler available in the Greenplum distribution – ORCA. In current versions, ORCA builds a plan similar to what we could see in Greenplum 6.

Database workflow monitoring capabilities

The request takes a long time to complete. EXPLAIN ANALYZE can't wait. All that standard tools in Greenplum 6 could provide us with was a column wait_reason representation pg_stat_activity with meanings lock, replication or resgroup. Not much. I had to walk around the entire cluster looking for a bottleneck, connecting a tracer, debugger, etc. In Postgres 9.6 this situation began to change (53be0b1), then developed into Postgres 10 (6f3bd98 And 249cf07). Now a wide list of reasons for the current process wait is available to the user for analysis.

In this example, we can see that the coordinator has dispatched a request for segments and is waiting for it to complete. Two of the three segments were blocked on I/O at the time of the request: they were waiting for table file extensions to load new data. The third one waits for a synchronous replica (mirror) to catch up with it in order to continue the insertion. There is currently no full list of expectations in the Greenplum documentation; you can use documentation Postgres or study the sources for Greenplum-specific expectations.

Special attention should be paid to the new presentation gp_stat_activity to collect information from the entire cluster – previously you had to write a wrapper yourself.

Also with Postgres 10 patches, monitoring of system background processes has become available. In the example below, the WAL-writer process was captured waiting for I/O while flushing the contents of the write-ahead log buffer to disk:

And that is not all

We looked at the impact of several more Postgres changes on the functionality of the new version of Greenplum. In the final, third part, we will continue to look at such improvements, and also talk about how the capabilities specific to our database have changed. Stay in touch, thanks for your attention!

I thank our designers @kawaii_anya and @nastoika_lera for their help in preparing this article.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *