PostgreSQL — One for All? How to Work with Atypical Data in a Relational System

Project diagram. Source: http://www.fortwiki.com/File:Sage_Data_Flow.jpg

The following principles formed the basis of such a database:

  • one single database for the whole country;

  • Writers enter their available data once;

  • readers always see up-to-date data;

  • For users with different tasks, separate views are created.

The project, codenamed SAGE, was first offered to IBM, but the company declined due to a shortage of programmers. At that time, IBM employed about 200 developers, and the project, according to their estimates, required thousands of specialists. As a result, a separate company was created for the project — System Development Corporation. At various times, up to 7,000 people worked on SAGE.

Why did I start this conversation? Later, to develop programs that accessed databases, it was necessary to hire programmers, not all of whom had a deep mathematical background. The first DBMSs were libraries of subroutines that could move data between memory and files. The storage structure in them was either very simple – text, or similar to the data structures in memory – lists and pointers. The problem was that developers were forced to write low-level complex and very fragile code, a tiny error in which could completely destroy the entire data set, and not just the record it accessed.

In fact, this is the problem that relational DBMSs were designed to solve (the term “relational” is based on the English word “relation” – “relation, dependence, connection”). This concept, which hides the physical storage level from developers who now work with a higher-level and more secure logical representation, was invented at IBM in the late 1960s by mathematician Edgar Codd. Instead of pointers in the relational model, there are values ​​in the columns of the table. The connection between rows occurs only through the coincidence of values, without any “physical” pointers.

Edgar Codd described 12 rules, “Codd's 12 rules”, that a relational DBMS must satisfy. Although in reality there are 13 of them, because the calculation starts from zero. Here are the rules:

0. The system must be able to manage databases using only its relational capabilities.

1. All information in the database at the logical level must be clearly represented in a single way: by values ​​in tables.

2. Each individual data value must be logically accessible using a combination of table name, column name, and primary key value.

3. Unknown or missing NULL values ​​must be supported for all data types when performing any operations.

4. The data dictionary must be stored in the form of relational tables, and the DBMS must support access to it using standard language tools.

5. The system must support the completeness of the language subset.

6. Each view must support all data manipulation operations: selection, insertion, modification and deletion.

7. High-level insert, update, and delete operations must be supported over any set of rows.

8. Applications should not depend on the methods used to store data on physical media.

9. The presentation of data in the application should not depend on the structure of relational tables.

10. All information necessary to maintain and control integrity must be contained in the data dictionary.

11. The operation of the application should not be affected by the physical location of the database.

12. Multi-level data access languages ​​must be consistent with each other.

Simply put, we see the following key principles:

  • separation of the logical and physical structure of the database;

  • logical structure: no pointers or links;

  • only tables whose rows have the same structure;

  • linking rows in different tables: the same value in the corresponding columns;

  • a language that allows complex compound queries to be expressed;

  • The physical structure is NOT a copy of the logical structure, it can be very complex.

Later, another requirement was added to Codd's requirements – transactionality. A database management system must be able to perform groups of operations – transactions – in such a way that any failure in this process rolls back to the original state and so that no traces of the attempt to execute remain.

The main problem of relational DBMS

As is often the case, the problems stem from the specifics of the solution. The weak point of relational databases is scaling. Transactions are implemented in the DBMS by a local log or versioning of records, and making it global or distributed is quite difficult.

To solve this problem, and at the same time increase fault tolerance, special mechanisms were invented – sharding and replication.

Sharding

Sharding is a horizontal scaling of a cluster, when separate parts of one database (shards) are placed on different nodes.

With this approach, parallel data recording in several shards is possible, and their independent location protects against simultaneous failure of “everything at once”. However, if even one node fails, the part of the data stored on it becomes inaccessible. This can be combated by another mechanism – replication.

Replication

Replication is the automatic copying of data between multiple nodes. A replica is a copy of the data.

Replication increases the fault tolerance of the system and provides parallel reading of data. Limitations of the approach arise if the requested node to which the data is planned to be copied does not respond for some reason. Then another node can either try to “get through” to it, and then the write execution time is extended by the duration of such a wait, or save the data on itself and report the success of the operation, risking inconsistency of replicas. This idea is described more strictly and in detail in the form CAP theorems.

Sharding and replication

Usually used together:

Creating a cluster of relational databases with sharding and replication is not such a difficult task, but each company had to implement it independently. For example, Google has stored all its data — search index, mail, and maps — in MySQL clusters since the mid-1990s and only fully switched to its own technologies like Spanner after 2013. Relational systems themselves began to provide these mechanisms out of the box quite recently. For example, sharding appeared in Oracle only in 2018, and in PostgreSQL it is fully implemented only in the paid Postgres Pro Shardman.

Alternative data storage systems

Until then, developers were trying to find a replacement for the “imperfect” relational model. Thus, in the late 2000s, other NoSQL database management systems entered the arena, such as MongoDB. Its developers offered automatic sharding with replication, albeit with preliminary configuration, right “out of the box”.

The NoSQL class includes such common types of systems as key-value, document, graph, and object-oriented systems. All of these DBMSs, as well as time series, search engines, and vector databases, were invented at one time as a way to mitigate the shortcomings of relational DBMSs, primarily the poorly detectable failures in large clusters when using consumer-grade hardware and open-source software without special reliability requirements, as well as errors in the architecture of systems using highly normalized relational data models.

Atypical data and ways to extend PostgreSQL

However, relational databases in general and PostgreSQL in particular are not standing still. Today, various DBMS extensions allow solving most problems without resorting to a “zoo” of many different systems. Let's consider working with individual types of atypical – that is, non-tabular – data.

How to store and search documents

In the key-value model mentioned above, we consider a piece of data to be a key by which we can search and write a specific byte array. This model has the simplest API possible, with three basic operations: put, get, and delete.

A document is essentially a variant of a value in the key-value model, but with additional capabilities. You can access a document by its parts – fields, use special functions, such as XPath for XML, automatically validate data, build indexes by fields, and not just search by key, and perform special operations. Similar capabilities are implemented in the same MongoDB.

PostgreSQL tried to do similar things back in the late 1990s, but a decent result was achieved only in 2014 due to the implementation of the JSON type. Today, Postgres offers JSONB and JSONPATH for working with documents.

JSONB is a data type that allows a text document to be parsed as it is written and then stored in binary format. Primitive types are mapped to PostgreSQL types: text, numeric, and boolean. JSONB provides a variety of functions for creating and processing documents, additional operators for accessing keys and nested elements, and the ability to index the entire document or individual keys.

As for JSONPATH, it is a language for selecting parts of documents, inspired by the language for selecting parts of XML documents XPath, which allows you to declaratively describe the route inside the document that you need to follow to get the elements you need. For example, you can select a field with a certain name without knowing in advance what nesting level it is at.

The first releases of JSONB had some limitations, but in 2021 the storage scheme was significantly improved, and operations are now much faster. Today, JSONB is a full-fledged working replacement for MongoDB. Perhaps the only advantage of MongoDB in this case is the ease of scaling, primarily due to sharding. If you take Postgres Pro Shardman, then the point of working with MongoDB disappears completely for you. If you are “sitting” on the free version of PostgreSQL, then the issue with out-of-the-box documents can only be resolved locally.

How to Store and Work with Time Series

A time series is a sequence of measurements of some parameter. This could be temperature, pressure, taxi movements, financial transactions, stock market data, or call center calls. Each element of a time series contains a timestamp and a value. Unlike many other databases, queries to time series databases very rarely access a single record and usually involve aggregation. For example: output the average, maximum, and minimum temperature from a given sensor for each minute of the last hour.

PostgreSQL uses the open source solution TimescaleDB to work with time series. The essence of the approach is that a so-called hypertable is created, i.e. a table automatically partitioned by a column with a timestamp in accordance with a specified interval, for example, an hour per section. Each section is stored and indexed separately, and after a specified period, the “cold” sections, to which no more records will be added, are compressed for more compact storage. Other convenient features of the solution include a policy of automatic deletion of obsolete data, distributed storage of sections with replication, and the presence of specialized hyperfunctions.

It looks something like this:

CREATE TABLE conditions(
 tstamp timestamptz NOT NULL,
 device VARCHAR(32) NOT NULL,
 temperature FLOAT NOT NULL);

SELECT create_hypertable(
'conditions', 'tstamp',
chunk_time_interval => INTERVAL '1 day'
);

SELECT create_hypertable(
'conditions', 'tstamp',
partitioning_column => 'device',
number_partitions => 8,
chunk_time_interval => INTERVAL '1 day');

In this example, the time interval is one day, during which the system divides the data into eight separate sections depending on the device from which the relevant information is received.

Overall, TimescaleDB works great with time series. There aren't many time series databases out there. Perhaps the closest open competitor to TimescaleDB in this area is ClickHouse. Both solutions are good in their own way, and to understand which one will work better on your data and your queries, it's best to try both and make a decision based on the pilot.

How to store embeddings and find them quickly

Another type of data that is poorly suited for relational DBMSs is vectors. Each record in a vector database is a large array of numbers, or a vector in a multidimensional space. A typical query for such a database is to find the vectors closest to a given vector. The proximity of vectors can be determined in various ways, such as using the cosine measure or the Euclidean distance. A full response to a query in non-specialized systems requires enumerating all stored vectors, since traditional indexes — B-tree and Bitmap — are unable to effectively speed up the solution of such a task.

The vector story itself is extremely popular now, since it is used when working with large language models like ChatGPT within the retrieval augmented generation (RAG) approach. Large language models include encoding text/tokens into such vectors – embeddings. We can encode all documents as embeddings and save them in the database, and then encode the query in the same way and find the closest embeddings to it. After extracting the closest documents in meaning, we add them to the prompt to enrich the context of the query and pass it to the large language model. Many neural network extensions are currently being built on the RAG approach, which has led to a surge in the popularity of vector databases. In the same way, by the way, you can search for images or sounds in vector databases, build recommendations, and also find anomalies, that is, detect the absence of sufficiently close vectors.

PostgreSQL has a simple extension for all of this called pgvector. It works with the vector data type and provides two search algorithms, IVFlat and HNSW, to choose from when creating a table. One is faster, less memory-intensive, but less accurate, and the other is more accurate but slower and more resource-intensive. They essentially cluster the vectors and then iterate through the smaller clusters.

Syntactically it looks like this:

select text, embedding <=> (
select embedding from t where text = ‘??'
) distance
from t
order by distance;

It is worth saying that there are many vector databases available today, including multi-million dollar startups such as Pinecone and Qdrant. PostgreSQL with the pgvector extension is a pretty decent option. The main limitation will be the same scaling.

How to build multidimensional cubes

The last specific story, not intended for relational databases, is multidimensional cubes. They were invented as a tool for building reports in a data warehouse, designed for NON-programmers. To prevent non-programmers from screwing up the database, they were provided with special data marts on multidimensional cubes. The technology of interactive analytical processing of OLAP (online analytical processing) data consists of preparing aggregated information based on large arrays of data structured according to the multidimensional principle. Data marts allow users to independently describe queries and the type of reports using a simple interface and reduce the risk of overloading the database with non-optimal queries.

In the SQL language, standard extensions for implementing OLAP scenarios in relational databases were added a quarter of a century ago. This GROUP BY ROLLUP, GROUP BY CUBE and GROUP BY GROUPING SETS. Extensions allow you to build a showcase based on the source data and then create an interface.

Verdict

Relational databases were a response to a very important need in the industry — to decouple the physical storage level from the logical one and to hire employees who could write queries to the database at the logical level without risking breaking it. Relational databases gave us this simplicity of logic, and it would be strange to refuse it today. Adding NoSQL solutions that could handle data that was not typical for relational databases, but did not have a simple language, made sense in the 2000s. But today, relational DBMSs, including PostgreSQL, have already responded to the NoSQL wave and introduced a number of tools that can handle most tasks, including full-text search, working with time series, and even vector data. Of course, you will have to figure out the structure of these tools first, but it is definitely easier than using a new DBMS for each type of task and hiring a staff of specialists who understand their structure.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *