Greenplum Series. Part 2. Optimal DDL

Hi all!

In the previous article, we figured out how Greenplum's MPP architecture works. Today, in collaboration with @imzorin, we'll go deeper and figure out what DDL is in this storage. We'll also try to highlight the main points to pay attention to when choosing a table type, distribution, etc.

Types of tables

There are two main types of tables in Greenplum: heap and AOT (Append-Optimized Table).

If you have worked with Postgres before, it will be easy to understand what it is. heap tablesas they represent the standard physical structure from Postgres: on each Greenplum segment, table records are stored sequentially in one file.

This type of table is chosen if the table is small and there will be a lot of work with the data in the future (deleting and/or changing). For example, this is the preferred choice for dictionary tables.

To create a heap table, you need to specify a parameter in the DDL APPENDONLY equal false. For example:

сreate table t1 (
  col1 integer,
  col2 varchar(100) 
) with (APPENDONLY=false) 
distributed by (col1);

In the example you see the line distributed by (col1). In this way we set up the distribution, which we will talk about later.

Append-optimized tables (AOT) – tables with a special structure, possessing a number of special features. For example, any data orientation is available for this type of tables, as well as all types of compression. This type of tables uses fewer resources, and statistics are collected faster.

This type of table has a complex internal structure: tables contain links to files that store large blocks of data. Each table has an internal index that ensures its operation. This index is built on system fields and is not used in queries. Each APPEND-OPTIMIZED table creates several system objects
in Greenplum and several files on disk.

It is recommended to create AOT tables if:

• The table will contain a lot of data

• The table consists of many columns

• Aggregation is performed only on a small number of columns

• Operations of querying and adding data in batches will be used predominantly

• There are single columns that are updated frequently without changing the other values ​​in the row

To create an AOT table, you need to specify a parameter in the DDL APPENDONLY equal true. For example:

сreate table t1 ( 
  col1 integer,
  col2 varchar(100) 
) with (APPENDONLY=true) 
distributed by (col1);

Data orientation

The orientation of data in tables in GP can be row or columnar.

Row structure It is recommended to use for small tables in which data changes frequently (i.e. for heap). Or if many AOT table fields are frequently used. This is due to the fact that unloading with such data orientation is done much more efficiently due to saving table data on a segment in one file.

However, RLE_TYPE compression cannot be applied to such tables (more about compression below).

Columnar structure applies only (!) to AOT. The data for each column is stored in a separate file, which slows down the system, but any type of compression that will work more efficiently can be selected for such data. It is recommended to select this type of data orientation if:

• Queries use only some of the table fields

• Aggregation is often used in queries

• Large table that needs to be compressed well

Data orientation is specified at the DDL creation stage:

сreate table t1 (
  col1 integer,
  col2 varchar(100) 
) with (appendonly=true, ORIENTATION = [row/column]) 
distributed by (col1);

Compression

Compression – data compression for more compact storage. Different types and levels of compression can be applied to different columns of a single table, but this requires a large amount of CPU. In addition, compression speeds up the execution of input-output operations in the system, allowing more data to be obtained with each read operation from the disk.

Greenplum offers several types of compression:

View

Compression level

Description

QuickLZ

1

Fast performance and low CPU utilization, but low compression level.

ZLIB

1-9

Opposite of QuickLZ: sufficient compression level, but slow and resource-intensive. Used by default. Checks the integrity of uncompressed data, and is also suitable for working with streaming data.

RLE_TYPE

1-4

Run-length encoding is considered the most efficient compression method in terms of compact data storage on disk. High levels of compactness are achieved at the expense of additional time and CPU cycles for data compression when writing and decompression when reading. RLE is great for files with repetitive data.

ZSTD

1-19

Optimal compression option: fast operation and high compression level

When choosing compression, remember that the ZLIB and ZSTD algorithms are available for both row and column orientations of tables. The RLE_TYPE algorithm, in turn, is available only for column orientation of tables.

When choosing the compression type and level for tables, consider the following factors:

CPU load. There must always be CPU power available to apply compression.

Disk size. Minimum disk size is one of the factors required for data compression and scanning. Find the optimal settings to effectively compress data without causing excessively long compression times or slow scanning speeds.

Speed. Higher compression levels may result in slower system performance. Performance with compressed tables depends on hardware, query tuning, and other factors. Before making a final choice of compression type and level, we recommend benchmarking to determine the actual performance for your configuration.

In our practice, based on input from various Customers, the most popular choice of compression is ZSTD with compression level 1-5.

Let's look at two examples of creating compression:

1. Compression to the entire table:

сreate table t1 ( 
  col1 integer, 
  col2 varchar(100) 
) with (appendonly=true, orientation = column,
COMPRESSTYPE=[quicklz/zlib/rle_type/zstd], COMPRESSLEVEL=1) 
distributed by (col1);

2. Compression of a specific column:

сreate table t1 ( 
  col1 integer ENCODING (COMPRESSTYPE=quicklz),
  col2 varchar(100)     
) with (appendonly=true, orientation = column) 
distributed by (col1);

Let's summarize in a comparative table of the two types of tables:

Heap

AOT

Data orientation

Row

Row, column

File storage

Recordings in one file

Contains links to files with recordings

Compression

All types except RLE_TYPE. Applies to entire table only.

Any type of compression for the entire table or for individual columns

Feature of the table

Small volatile tables

Large tables where only some fields change

Distribution

Distribution – data distribution across Greenplum segments. Distribution is the most important factor in system optimization and speed. To fully utilize the parallel computing capabilities, it is necessary to select the distribution key correctly. Distribution is divided into several types, each of which has its own advantages and limitations.

If we look globally, the purpose of distribution is:

• Uniform distribution of data

• Locality of operations

Factors affecting the uniform distribution of data:

• Data selectivity. The more unique the key value, the better the distribution will be built, since identical values ​​lie on the same segment.

• The key must not contain null.

• The key must not contain default values.

Factors affecting locality:

• Using keys in join, group by, partition. The same key in different tables is the key to speed when they interact.

• Number of fields in the key. If the key consists of several fields, then local operations on the table will involve all fields included in the key, and in exactly the same order. It is also obvious that hashing of multiple fields will take longer than hashing of one field. That is, for Greenplum distributed by (col1, col2) ≠ distributed by (col2, col1)

• Field types in different tables. Different data types result in different hash values, which affects locality.

Data distribution is divided into 3 main types:

DISTRIBUTED BY. The most commonly used type of distribution (one could even say that the other two are rather exceptions). In order to correctly distribute data across primary segments, hash values ​​are generated for the selected fields, on the basis of which the distribution will take place.

сreate table t1 ( 
  col1 integer, 
  col2 varchar(100) 
) with (appendonly=false) 
distributed by (col1);

DISTRIBUTED REPLICATED. Often used for small tables (e.g. dictionaries). The table contents are duplicated for each segment.

сreate table t1 ( 
  col1 integer,  
  col2 varchar(100) 
) with (appendonly=false) 
distributed replicated;

DISTRIBUTED RANDOMLY. As the name suggests, this is a roughly even distribution of data between segments. Greenplum uses RR algorithm for this type of distribution. In practice, this distribution is used if neither of the two previous ones is suitable.

сreate table t1 ( 
  col1 integer, 
  col2 varchar(100) 
) with (appendonly=false) 
distributed randomly;

To check the distribution of an existing table, you can make a query: select pg_get_table_distributedby('schema.my_table'::regclass::oid);

Here are some tips on choosing a distribution key:

• It is preferable to use Integer type fields for the distribution key, since the hash for them is calculated much faster.

• Fields that can be a potential filter (i.e. that will often appear in where in the future) should not be selected as the distribution key. Otherwise, only some of the segments will participate in filtering by the distribution key.

• Always explicitly specify the distribution method in the DDL. Greenplum defaults to the first field (or PK) of the table as the key, which is unlikely to be a good solution for preserving locality and preventing data skew.

Partitioning

Partitioning is dividing a table into separate logical parts. In general, the partitioning functionality remains the same in Greenplum as in Postgre. However, in GP, ​​you can create partitions of a larger size than in other types of databases, due to the MPP architecture.

The first type of partitioning is by rangecan be used for periods (most often for time periods):

create table t1 (  
  col1 integer, 
  col2 varchar(100),  
processed_dt date
) 
distributed by (col1) 
PARTITION BY RANGE(processed_dt) 
( 
  PARTITION p1 START('2020-01-01'::DATE) END('2029-12-31'::DATE) 
  EVERY('1       month'::INTERVAL),
  DEFAULT PARTITION EXTRA
);

Above we created a partition p1 by field processed_dt with January 1, 2020 before December 31, 2029 at intervals of 1 month. If there are no values ​​in this interval, the data will be placed in the partition extra.

The second type of partitioning is partitioning by values ​​(by list).

create table t1 (  
  col1 integer,  
  col2 varchar(100) 
)
distributed by (col1) 
PARTITION BY LIST(col2) 
(   
  PARTITION p1 VALUES('A'), 
  PARTITION p2 VALUES('B'), 
  DEFAULT PARTITION OTHERS   
);

The third type of partitioning is used for hash values ​​of some field:

create table t1 ( 
  col1 integer, 
  col2 varchar(100), 
processed_dt date 
)
distributed by (col1)
PARTITION BY HASH(col1);

Partitions can also be added and removed:

ALTER TABLE t1 ADD PARTITION p2 START('2030-01-01'::DATE) END('2039-12-31'::DATE);
ALTER TABLE t1 DROP PARTITION FOR ('A');

When selecting a field for partitioning, it is important to understand that UPDATE is prohibited for such fields. Also, you should not select one field both as a distribution key and as a partition. This is due to the fact that in this scenario, only one segment will work when accessing the partition, which will reduce the speed of query execution.

You can create a multi-level partition structure with subpartitions. Using the subpartition pattern ensures that each partition has the same subpartition design, including partitions you add later. For example, the following SQL creates a two-level partition structure:

CREATE TABLE sales ( 
  trans_id int,  
  date date, 
  amount decimal(9,2),
  region text) 
DISTRIBUTED BY (trans_id)
PARTITION BY RANGE (date) 
SUBPARTITION BY LIST (region)  
SUBPARTITION TEMPLATE (SUBPARTITION usa VALUES ('usa'),
            SUBPARTITION  asia VALUES ('asia'), 
            SUBPARTITION europe VALUES ('europe'),
            DEFAULT SUBPARTITION other_regions)
(START (date '2011-01-01') INCLUSIVE END (date '2012-01-01')
EXCLUSIVE EVERY (INTERVAL '1 month'), DEFAULT PARTITION outlying_dates);
CREATE TABLE sales (  
  trans_id int,  
  date date,  
  amount decimal(9,2),   
  region text) 
DISTRIBUTED BY (trans_id) 
PARTITION BY RANGE (date) 
SUBPARTITION BY LIST (region) 
SUBPARTITION TEMPLATE       
(SUBPARTITION usa VALUES ('usa'), 
  SUBPARTITION asia VALUES ('asia'), 
  SUBPARTITION europe VALUES ('europe'), 
  DEFAULT SUBPARTITION other_regions)    
(START (date '2011-01-01') INCLUSIVE END (date '2012-01-01') 
EXCLUSIVE EVERY (INTERVAL '1 month'),
DEFAULT PARTITION outlying_dates);

When creating multi-level partitions in ranges, it is easy to create a large number of subpartitions, some of which contain little or no data. This can add many unnecessary entries to the system tables, increasing the time and memory required to optimize and execute queries. To avoid this, you can increase the range interval or choose a different partitioning strategy to reduce the number of subpartitions created.

You can view partitions in the system table pg_partition

Field

Data type

Description

parrelids

oid

Table object identifier

parkind

char

Patrician type:

• R – range partition

• L – list partition

parlevel

smallint

The section level of this line is:

• 0 for the top-level parent table,

• each subsequent level has an ordinal number – a natural number (1, 2, …)

paristemplate

boolean

Whether this string represents a subsection template definition (true) or an actual subdivision level (false).

parnatts

smallint

The number of attributes that define this level.

paratts

smallint

An array of attribute numbers involved in defining this level.

parclass

oidvector

The identifier(s) of the partition column operator classes.

Do not create more partitions than necessary. Creating too many partitions can slow down management and maintenance tasks such as cleanup, segment repair, cluster expansion, disk usage checks, and more.

To answer the question of whether you should partition a table, review the checklist below.

The table is quite large. Large tables are good candidates for partitioning. If a table has millions or billions of records, you can see performance gains by logically breaking that data into smaller chunks.

Performance seems unsatisfactory. As with any performance tuning initiative, a table should only be partitioned if queries against that table result in slower response times than expected.

When filtering in WHERE, table columns often appear. For example, if most queries tend to search for records by date, then monthly or daily date splits may be useful.

The storage maintains historical data. For example, there is a need to store data for the last twelve months. If the data is partitioned by month, it is always easy to delete the oldest partition from the storage and load the current data into the most recent partition.

Conclusion.

Now we know a little more about Greenplum. We were able to identify some rules that are worth considering when choosing a distribution, data orientation, remembered what partitioning is and can move on.

The next article will be the final one and in it we will figure out what will help us optimize queries, how to monitor and anticipate problems.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *