Databases. Development Basics and Working with Indexes in PostgreSQL

We continue sharing useful insights from our corporate meetups. Today's speaker, Factory5 senior backend developer Roman Gordienko, talks about the basic concepts of relational databases and how to use indexes in PostgreSQL. The information will be useful to anyone who works with relational databases or plans to do so in the future.

Relational databases

A relational database is a collection of data with predefined relationships between them. This data is organized into a set of tables, consisting of columns and rows. The database stores information that is a reflection of some part of the real world, stored in tables.

Features of relational databases:

• The data model is pre-defined and strongly typed.

• Data is stored in tables, which may not have rows, but have at least one column.

• Each column has a name and a type that the values ​​in all rows in it follow.

• Columns are arranged in the order defined when the table was created.

• Database queries return results in the form of tables.

One of the fundamental concepts for working with these types of databases is a transaction.

Transaction — a group of sequential operations with a database, which represents a logical unit of work with data.

Let's give a simple example that is often used to explain the essence of a transaction – transferring money from one account to another:

  1. We read the balance on account No. 1

  2. We read the balance on account No. 2 – the target one

  3. We reduce the balance on account #1

  4. Increasing the balance on account #2

  5. We read the balance on account #1 – we are convinced that it has decreased

  6. We read the balance on account #2 and see that it has increased.

  7. You are great (successfully completing the transaction)

All operations within a single transaction must be completed successfully, then and only then is the transaction considered completed. If at least one operation fails, the transaction is considered not completed – in this case, a ROLLBACK occurs and the data in the database is returned to the state it was in before the transaction started.

ACID principle

Now that we've covered the transaction, we can talk about the ACID principle. Most relational databases comply with it. ACID is an abbreviation for:

• Atomicity – Atomicity (guarantees that each transaction will be executed in full or not at all)

• Consistency – Consistency (each successful transaction records only valid results)

• Isolation – Isolation (while a transaction is running, parallel transactions should not affect its outcome.)

• Durability – Reliability (if the user has received confirmation that the transaction has been completed, he can be sure that the changes he made will not be cancelled due to a failure).

Primary and Foreign Key

Just like a transaction, the concepts of primary and foreign keys are fundamental entities when working with relational databases.

Primary key — a set of specific features unique to each record. The primary key is designated as primary key.

Types of Primary Keys:

• Natural (passport number, SNILS)

• Surrogate (INTEGER ID, UUID)

Primary Key Constraints:

• NOT NULL

• UNIQUE

• INDEX

External key or Foreign key provides an unambiguous logical connection between tables of one database.

For example, there are two tables A and B. In table A (shoes), there is a primary key “size”, in table B (color) there is a column called “size”. Accordingly, “size” is a foreign key for the logical connection of tables B and A.

Normalization

In order for work with the database to be effective and correct, as well as to eliminate various anomalies when saving data, it is necessary to adhere to the principles of data normalization when designing the database.

Normalization — is a way of organizing data. A normalized database has no repeating data, is easier to work with, and can be structured differently for different tasks.

In progress normalization data is transformed so that it takes up less space and the search for elements is fast and effective. To do this, additional tables are created and linked to each other by keys – columns in which there are no repeating elements.

First Normal Form (1NF)

For a database to be in it, its tables must meet these principles:

• There should be no duplicate rows in the table.

• Each table cell stores an atomic value.

• The column stores data of one type.

• There are no arrays or lists of any kind.

Second Normal Form (2NF)

For a database to be in it, tables must meet these requirements:

• The table is in first normal form.

• The table must have a key.

• All non-key columns of the table must depend on the complete key (if it is composite).

Third normal form (3NF)

The requirement for this form includes two points:

• The table must be in second normal form.

• All columns in a table depend on the primary key and are independent of each other.

There are six normal forms in total, but for most tasks it is sufficient for data to arrive in the database in 3rd normal form.

Let's talk about indexes

An index is a data structure that allows a database to find and retrieve information from tables more quickly and efficiently. Indexes work like table of contents in books – they indicate the location of specific data and provide quick access to it. Indexes are needed to reduce query processing time and reduce the load on the system.

However, indexes do not always have a positive effect on performance. When inserting, updating, and deleting data, indexes also need to be updated, which requires additional time and resources. Therefore, it is necessary to carefully analyze and select indexes so that they improve query performance and do not increase the load on the system.

Indexes in PostgreSQL

PostgreSQL has the following types of indexes:

• B-tree (supports all standard comparison operations and can be used with most data types – for sorting, uniqueness constraints and searching by range of values);

• Hash (provide fast access to data by equality; do not support sorting or searching by range of values);

• GiST (generalized and multi-purpose indexes that are needed to work with complex data types: geometric objects, texts and arrays; allow fast searches on spatial, text and hierarchical data);

• SP-GiST (needed for working with non-intersecting and unevenly distributed data; effective for searching in geometric and IP address data);

• GIN (used for full-text search and search by arrays, JSON and trigrams);

• BRIN (provide a compact representation of large amounts of data, especially when the values ​​in the table are arranged in a specific order; effective for storing and processing time series and geographic data).

Let's look at the data types that are found in PostgreSQL.

• Text data types, including 'text', 'varchar', 'char', 'citext'. B-tree indexes can be used to index text data.

• Numeric data types. They can also be indexed using B-tree – they also support comparison operations.

• Date and time. Data types like 'date', 'time', 'timestamp', 'interval' also support indexing using B-tree indexes.

• Arrays. They can be indexed using GIN or GIST indexes. GIN indexes are suitable for quickly searching for elements in an array, while GIST indexes can be used to find intersections between arrays.

• Spatial data types. 'geometry', 'geography' can be indexed using GiST and SP-GiST indexes.

• JSON and JSONB can be indexed using GIN or GIST indexes.

Basics of Explain in PostgreSQL

You have probably encountered a situation where you write a SQL query and it initially runs in a second or half a second, but over time the process becomes very slow. To understand why the execution has slowed down, you need to use the EXPLAIN function.

How does EXPLAIN work? Let's say we took a table, filled it with random data. Then we did select, but before that we wrote explain. This way we get a query plan. EXPLAIN shows how our planner will execute the query. First, the parser breaks it down into its constituent parts, then the analyzer is connected, then there may be a rewriter, which rewrites the query, since it will be effective.

Seq Scan means reading the data of table foo sequentially, block by block.

cost is not time, but a certain spherical concept in a vacuum, designed to evaluate the costs of operations: to obtain the first row, all rows.

rows — The approximate number of rows returned when performing a Seq Scan operation. This value is returned by the scheduler.

width — the average size of one line in bytes.

If you do an Explain and you see a Seq Scan, you either need to think about how your database is designed – maybe you need to add cells, split some data. We talked about the three normal forms earlier, maybe you need to convert your database to one of them, or add indexes.

In addition to Explain, there is also the Analyze function. When executed:

• a certain number of table rows, selected randomly, are read;

• statistics are collected on the values ​​for each of the table columns.

That is, EXPLAIN is like the planner's expectation, and ANALYZE is how the query will be executed on physical data.

To compare the scheduler's expectations with the physical execution, we use EXPLAIN (ANALYZE). The following is added to the command output:

actual times — the actual time in milliseconds taken to retrieve the first row and all rows respectively

rows — the actual number of rows obtained from Seq Scan

loops — how many times the Seq Scan operation had to be performed

Total runtime — total query execution time

How does query performance change when using an index?

Without using Seq Scan, a sequential scan occurs, the database goes from the first line to the last, for all terms in the table.

If you add an index, you can see how the QUERY PLAN and execution time change. We added an index where c1 > 500. The query executed using it looks like this:

To summarize: if we see Seq Scan in the query, it means that the selection is made without using indexes. In this case, adding an index can speed up our query.

During long-term development, when creating services with a long life cycle, the following problems may arise when using indexes:

• unused indexes

• prefix “clones”

• timestamp “in the middle”

• indexed boolean

• arrays in index

• NULL garbage.

Conclusion

Today we got acquainted with a relational database and learned how to start working with it effectively. The most complete information about working with a specific database is usually in the documentation for this database. We hope this article will help novice database developers to do their work effectively. See you soon!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *