what they are, how to use them, and why you shouldn’t create them for every request

Hello, my name is Ivan Yakunin, I am a product analyst on the Fintech Marketplace team. We deal with billing in Avito, study how users buy on the platform, and with the help of this data we create safe and fast payment tools.

At Avito, we use several analytical databases to build our DWH and analytics based on it. One of them is Vertica. In this article I will talk about projections – one of the distinctive features of Vertica, which we use to optimize the operation of our storefronts and dashboards. In conditions where the amount of data is constantly growing and cluster resources are limited, projections are a good way to optimize calculations in the database.

What is Projection

In most databases, a table is a physical structure, a single object that is stored on disk. In Vertica, a table is a logical structure, and the physical object that is stored on disk is the projection. In this case, one table (logical structure) can refer to several projections (physical structures).

When you create a table, you always create its first projection with it, which in Vertica is called a super-projection. The following projections, created manually, will be a copy of the super-projection data, but you can change the sorting, segmentation, filters, and grouping of data.

The closest analogy from other databases is materialized views. Vertica updates projections when the super-projection changes, just as other databases would update materialized views with the ON COMMIT flag (updates when view sources change).

A table is a logical structure and can refer to several projections at once. This allows SQL queries to be executed more efficiently: you can specify one table, but the Vertica optimizer will analyze when reading which projection from the disk the query can be executed cheaper and faster. Simply put, a table is a symbolic link to multiple copies of the same data, assembled in different ways, and the copy that is cheapest to use in the query is read.

Now let's look at what projections look like in code. Let's create a table with which we will work. Here I will provide a small sample of the data that we will use.

create table public.test_projection (
	id int,
	event_date date,
	amount int
);
INSERT INTO public.test_projection select 1, '2024-03-09', 19584;
INSERT INTO public.test_projection select 2, '2024-03-10', 54686;
INSERT INTO public.test_projection select 3, '2024-03-11', 79387;
INSERT INTO public.test_projection select 4, '2024-03-12', 37693;
INSERT INTO public.test_projection select 5, '2024-03-13', 68438;
INSERT INTO public.test_projection select 6, '2024-03-14', 39868;

Let's see at what point the projection appears. To do this, let's look at the query execution plan if we query the entire table we created:

Access Path:
+-STORAGE ACCESS for test_projection [Cost: 6, Rows: 6] (PATH ID: 1)
|  Projection: public.test_projection_b0
|  Materialize: test_projection.id, test_projection.event_date, test_projection.amount
|  Execute on: All Nodes

We see that the projection test_projection is actually read from the disk, but it has the additional name b0. Let's find out where it came from.

SELECT GET_PROJECTIONS('public.test_projection');

Conclusion:

Current system K is 1.
# of Nodes: X.
Table public.test_projection has 2 projections.
 
Projection Name: [Segmented] [Seg Cols] [# of Buddies] [Buddy Projections] [Safe] [UptoDate] [Stats]
----------------------------------------------------------------------------------------------------
public.test_projection_b1 [Segmented: Yes] [Seg Cols: "public.test_projection.id", 
"public.test_projection.event_date", "public.test_projection.amount"] [K: 1] 
[public.test_projection_b0] [Safe: Yes] [UptoDate: Yes] [Stats: Yes]
public.test_projection_b0 [Segmented: Yes] [Seg Cols: "public.test_projection.id", 
"public.test_projection.event_date", "public.test_projection.amount"] [K: 1] 
[public.test_projection_b1] [Safe: Yes] [UptoDate: Yes] [Stats: Yes]

Now it is clear that we considered one of the projection replicas. There are two of them, because the replication coefficient K in the cluster is equal to 1. Before we move on to projections, let’s record the “before” result so that we have something to compare with. Let's look at the query plan grouped by week.

EXPLAIN
SELECT
        date_trunc('WEEK', order_created_at) as date_week,
        sum(purchase_amount) as purchase_amount
FROM public.test_projection_iayakunin
GROUP BY 1

Conclusion:

Access Path:
+-GROUPBY HASH (GLOBAL RESEGMENT GROUPS) (LOCAL RESEGMENT GROUPS) [Cost: 5, Rows: 6] (PATH ID: 1)
|  Aggregates: sum(test_projection.amount)
|  Group By: date_trunc('WEEK', test_projection.event_date)
|  Execute on: All Nodes
| +---> STORAGE ACCESS for test_projection [Cost: 4, Rows: 6] (PATH ID: 2)
| |  	Projection: public.test_projection_b0
| |  	Materialize: test_projection.event_date, test_projection.amount
| |  	Execute on: All Nodes

Now let's create a new projection for this table.

CREATE PROJECTION test_projection_week AS
SELECT
	date_trunc('WEEK', event_date) as date_week,
	sum(amount) as purchase_amount,
	max(id) as purchase_id
FROM public.test_projection
GROUP BY 1;
 
SELECT start_refresh();

We select a time period of a week and group the values ​​within the table. A call to the start_refresh function is required to trigger a projection update from the super-projection. Until it is called, the UpToDate value for this projection will be “no”, and it cannot be used. In addition, once we have created a projection, we can access it directly by name, bypassing the link through the main table.

Let's see how the execution plan of the previous request has changed.

Access Path:
+-GROUPBY PIPELINED [Cost: 2, Rows: 2 (NO STATISTICS)] (PATH ID: 1)
|  Aggregates: sum(test_projection_week.purchase_amount)
|  Group By: test_projection_week.date_week
|  Execute on: All Nodes
| +---> STORAGE ACCESS for public.test_projection_week (Rewritten LAP) [Cost: 1, Rows: 2 (NO STATISTICS)] (PATH ID: 2)
| |  	Projection: public.test_projection_week
| |  	Materialize: test_projection_week.date_week, test_projection_week.purchase_amount
| |  	Execute on: All Nodes

Now this query reads our new projection, since it is cheaper to run with it – the data is already grouped by week. Now the database only needs to read three rows. We won’t feel the difference between ten and two lines; this is a test example, but on a real storefront this will save a lot of time and resources.

What happens if we add data to a table? Our projection is grouped by week, but the super projection is not. Let's try to add a few days that belong to an existing week. This is what the values ​​look like before changing the data. We can access the projection bypassing the reference table, directly by its name:

Now let's insert the data:

INSERT INTO public.test_projection select 7, '2024-03-15', 38657;
INSERT INTO public.test_projection select 8, '2024-03-16', 69382;
INSERT INTO public.test_projection select 9, '2024-03-17', 68482;

After inserting the data, we don’t have to start updating the projection; now Vertica will independently keep it up to date. Here's what the data looks like after inserting new days:

When to use projections

Above we looked at the main feature of projections – they automatically update their relevance relative to the super-projection and all this is hidden behind the same table. This allows you to speed up heavy, repetitive calculations. And not only the same repeated piece of code, but rather code that uses similar cuts, since Vertica decides which projection to read by analyzing the cost of the request. Most often, such situations arise if in routine calculations one of the storefronts is used by many storefronts that depend on it, or when visualizing data on dashboards. Next we will look at the second case and see how we can speed up the construction of graphs in a dashboard using projections.

Let's imagine that we have a table with a lot of raw data, and we need to build a dashboard in a large number of sections. For example, these could be events from one of the products, and we need to set up analytics for the funnel of this product and all the metrics that we want to see. There can be a lot of events, and we can wait quite a long time for the dashboard to load. Dashboard users can often change chart parameters; this will cause frequent calls to the database with similar queries. With the help of projections, we can predict and calculate the execution of the most difficult and frequent requests, and speed up the execution of the dashboard.

For example, we will use the same table that we created earlier, but we will create a couple more projections for it, grouped by months and days. And let's write a very simple query in redash, with which you can build a linear graph of the amount of purchases by date with the selected granularity.

This way we can speed up the execution of dashboards that are built using a large number of sections. In this case, we do not have to worry about how to update each cut, schedule it, and write large queries.

But it’s important to remember that projections take up storage space, so you shouldn’t create them for every request. If we create a projection for a query that will be executed only once, then the projection calculation time will be equal to the execution of this query. The balance between speeding up calculations and the disk/processor resources that we spend on calculating and storing the projection is important.

Bottom line

Projections are pre-processed data from the main super-projection of the table.

The fact that Vertica updates projections independently opens up a large field for optimizing storefronts and queries in data visualization systems. Projections can be used when you have frequently reused pieces of code, especially if there is a lot of data. The key word is “often”, because creating projections for one-time queries is bad – they take up storage space and do not help save calculation time.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *