immersion into the topic

The structure of databases and tables

To understand how indexes work, let's look at how databases are structured.

This is what our cluster looks like now:

The cluster data files are located in the data directory. Each database has its own child folder in data/base, and each table and index has its own file. Such a table file is called a Heap File. It contains a list of unordered records of varying lengths and should not exceed 1GB. If the file size exceeds 1GB, the next one is created. Accordingly, our table in the database will be stored in two or more files. The table itself consists of an array of pages. In other words, a page is a table row, and a file is a table in the database.

Each page contains:

  • title;

  • lines consisting of a line header and the line itself;

  • row references (ctid) – metadata consisting of a page number and index, with their help the DBMS can more quickly access the necessary data.

Picture for clarity:

Along with the table file, there is a FSM file on the server — Free Space Map. Since the maximum page size is 8 KB, the FSM file helps the server immediately understand where to save data, instead of scanning all the table pages in search of free space. It is important that the FSM is not updated every time a row in the table is updated or deleted. This has its own logic: saving the old version of the table is necessary for the correct operation of the parallel access mechanism, so that different users can simultaneously use the DBMS.

TOAST mechanism

The server prohibits storing row values ​​in different pages. In this case, we use the TOAST mechanism – The Oversized-Attribute Storage Technique. Each table has an associated TOAST table, which stores large values, cut into 2 KB pieces. In the column of our original table, we simply place a reference to the place in the TOAST table where the real values ​​are stored.

You can read more about the TOAST mechanism in this longread from the developer of PostgreSQL.

Fill Factor

The theoretical basis above should help to understand what Fill Factor is and how it works — a parameter for indexes in SQL. When creating or rebuilding an index, we can set this parameter and tell the server what percentage of each page we will fill. By default, this value is 100%, that is, Fill Factor tries to fill all the space to the limit. But this coefficient is not always good. If I fill all my pages to the limit, and then, using some complex index, try to insert a row on this page, it simply will not fit. In this case, the server will have to perform quite a lot of operations in order to fit this row.

At the same time, if we use a classic clustered index with an increasing ID value and insert new rows at the end of the list, leaving a lot of empty space on the data pages by making the Fill Factor small is also inefficient, since this will also negatively affect performance. Our data will be more distributed, and we will have to: a) use more pages in memory; b) use more space in the cache, which is bad.

Conclusion: the Fill Factor value should be smaller the more and more often changes occur in our data.

What else is important to know about Fill Factor:

  • you cannot apply the same fill factor to all indexes;

  • Fill Factor applies only to indexes, not to all tables;

  • does not apply to LOB pages, that is, to large data that is no longer stored in the tables themselves, but in TOAST tables associated with them;

  • does not affect new pages inserted at the end of the index;

  • being a non-trivial parameter, it is used only by experienced basisrs.

VACUUM

The VACUUM command is used to clean up old unnecessary row versions. Strictly speaking, it does not return memory to the operating system, i.e. it does not physically delete such rows, but only marks memory segments that we can overwrite. But a full defragmentation of the table can be done using the VACUUM FULL command. Along with the table file, the VM file — Visibility Map — is also located on the server. It shows the presence/absence of “rotten” row versions. If the file stores the value “1”, it means that the page does not contain irrelevant row versions.

Data fragmentation increases if the database is not maintained, so it is necessary to run VACUUM periodically. For actively updated databases, it is recommended to run VACUUM every night. But VACUUM FULL is best used only when a lot of data relative to the table itself has been deleted – about 30-40%.

The VACUUM ANALYZE command collects statistics about the contents of tables and stores the results in a special system catalog. The query planner can use the statistics obtained to determine the most efficient query execution plan. VACUUM ANALYZE can be used with a parameter for a specific table in the database.

Autovacuum is a background process that performs all cleaning and marking of “rotten” lines automatically. It is always enabled by default and it is not recommended to disable it.

Indices: theoretical basis

Now let's move on directly to the indices.

An index is a database object that can be created and deleted, just like any table. Indexes are created for table columns and even for views. They allow you to search for the necessary values ​​without a full search. Here's a simple analogy: when we need to find, say, a book by Stephen Hawking in a store, we go to the shelves labeled “Pop Science.” Indexes work in much the same way.

In essence, indexes can be called a method of accessing data. They establish a correspondence between the key, that is, the value of the column we are indexing, and the table rows in which this key occurs. Using index scanning, we can quickly find those rows in which this key is present.

The number of records the query filters determines whether an index or a full table scan is used. Generally, if the query filters a small number of records relative to the table itself, such as 100 rows in a million-row table, then we use an index.

What else is important to know:

  • whenever indexed data in a table is updated, the indexes also need to be rebuilt;

  • indices require maintenance costs because they have their own structure;

  • Indexes have their own limitations, including those on the operations they support.

Scanning methods

Scan methods are how the system will execute our query to the database. The scan method is determined by the optimizer, which searches for the most efficient plan for executing the query.

There are four main scanning methods:

  1. Sequential scan: simply iterates over each line; happens by default.

  2. Index scan: used when the data set being indexed is large; the result set will potentially contain relatively few rows.

  3. Index only scan: Some indexes store the values ​​themselves along with the row identifiers, and this method allows you to read the index without accessing the data tables and get the result directly from the index. Index-only scanning is more efficient than a simple index scan. But there is a nuance: we must look at the Visibility Map to find out if the records are relevant. Only if the page does not contain invalid rows can we use index-only scanning.

  4. Bitmap scan: is used when the sample is relatively large, when it is too large for us to use index scan, but at the same time not so large that we use a regular sequential scan of each row. As the sample increases, the probability that we will read the same pages many times increases, and in such cases we use bitmap scan. The advantage of this method is that it works when searching more than one index.

Index types

PostgreSQL supports different types of indexes for different purposes, including:

  • B-Tree, also known as Balanced Tree, is a balanced tree;

  • Hash – a hash index, unlike a B-Tree, stores integers, also known as hash codes, rather than values;

  • and others.

If we write the command SELECT amname FROM pg_am in pgAdmin, we can see what index types are available on the server. In about 95% of cases, the B-Tree index type is used, since it is applicable to any data that can be sorted and covers the widest range of tasks. Other index types are used only in exceptional situations, so I will not dwell on them in detail.

B-Tree

When we simply create an index with the CREATE INDEX index_name ON table_name (column_name) command, this type of index is created by default. B-Tree supports the <, >, <=, >=, = and LIKE ('abc%') operators. On the contrary, it does not support the LIKE ('%abc') operator. B-Tree indexes a query from NULL and NOT NULL.

The complexity of searching by this index is logarithmic. If our data array is sorted, we can check and find a specific value we need by dividing by two, that is, we can come to the result through logN elements of the check. For example, in a table of eight rows, we will find the required value using this algorithm with three iterations:

There are two subtypes of B-Tree index: clustered (such as PRIMARY KEY, UNIQUE, and ID) and nonclustered.

Comparison of B-Tree Types

Criterion

Clustered

Non-clustered

Storage procedure

Sorts and stores data by sorting rule; uses primary key to structure data in table; always changes physical order of stored data

Does not organize physical storage, using labels to access data; creates its own separate data structure that is independent of the actual table it is built for

Index end leaves

Used for storage

Not used for storage

Disk space

Takes up a lot of space

Takes up little space

Access speed

Fast

Slow

Additional disk space

No need

This is necessary because the index is stored separately from the physical table data.

Ease of use

It does not require an explicit declaration and is created by default when defining a key, which means it is easier to use in principle, including in terms of syntax.

Requires explicit definition; contains only those table columns on which it is defined, so the query system needs an additional operation to get data from the index

Sorting data

Possible

Impossible

Creating multiple indexes on one table

Impossible

Maybe

Specificity

Can improve performance when retrieving required data

Applies only to non-key columns that are used in join queries.

As you can see from the table, each of the B-Tree subtypes has its own advantages and disadvantages. In general, non-clustered indexes are quite useful for very large data sets because they do not change the physical order of the stored data. A subtype of a non-clustered index is a cover index, a covering index is an index that is sufficient to answer a query without accessing the table itself, which is certainly faster than accessing the table through a regular index. However, you should not abuse this either: when we include more and more information in a covering index, it itself becomes heavier and larger, and the search time for it increases.

EXPLAIN

Before we build an index, we need to understand what type of index to use. The EXPLAIN (SELECT…) command allows us to see the query execution plan when using a particular index. In turn, the EXPLAIN ANALYZE (SELECT…) command runs the query and shows not only the plan, but also the query itself, so we can determine the most efficient index to use.

I recommend reading more about the EXPLAIN command Here.

When indexes are not needed

In some cases, indexes are not used at all. These include:

  • small tables – since index scan and sequential scan methods will be virtually identical in time;

  • tables with frequent mass UPDATE and INSERT changes – because indexes also require rebuilding, and this will be inefficient;

  • frequently processed columns;

  • columns with a large number of NULL values;

  • columns with image, text or varchar(max) data types – indexing large, heavy data will take too long.

Conclusion

I hope this article has saved you some time for your first dive into indexes in PostgreSQL. The main thing to remember is that the purpose of indexing is to speed up the execution of our database query. If we try to apply an index to an inappropriate type of data or use an incorrect index type, then the query execution time risks increasing. So the delicate issue of indexing should be approached boldly, but wisely.

I will talk about the practical side of indexing databases in PostgreSQL in another article. In the meantime, ask your questions about indexes in the comments – I will try to answer everything.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *