How it all began
In the 60s of the last century, there was a need for a reliable model for storing and processing data. This data was primarily generated by banks and financial institutions. At that time, there were no uniform standards for working with data and models, and the work itself consisted in manual ordering and organization of stored information.
At the very least, banks managed to record information about transactions in the form of files into a pre-prepared structure. Each organization had its own understanding of how it all should look and work. There were no such concepts as data consistency, data integrity. The files often contained duplicates of customer data and their transactions, which needed to be clarified and put in order in some way, this was done mainly by hand. In general, all the problems of that time in relation to working with data can be divided into several main types:
The representation of the structure in each file was different.
It was necessary to reconcile data in different files to ensure the consistency of information.
The complexity of developing and maintaining applications that work with specific data, and updating them when the file structure changes.
In fact, we see here the “clean architecture” anti-pattern that was described by Robert C. Martin.
It should be noted that there have been attempts to create models to bring order to the data and their processing. One such attempt – hierarchical modelin which the data has been organized in a tree structure. The hierarchical model was in demand, but not flexible. In it, each record could have only one “ancestor”, even if individual records could have several “descendants”. Because of this, databases represented only one-to-one or one-to-many relationships. Failure to implement a many-to-many relationship could lead to data problems and complicate the model. Moreover, the issues of data consistency and lack of duplication of information were not at all raised here. The first hierarchical DBMS was called IMS from IBM.
I came to the aid of the hierarchical network data model, and already a new concept has implemented a many-to-many relationship. This approach was proposed as a specification of the CODASYL model within the Data Base Task Group (DBTG).
But these are all models that were difficult to maintain. Edgar F. Codd was able to simplify the task of collecting and processing data. His fundamental work led to the emergence of relational databases that are needed by almost every industry. Codd proposed the Alpha language for manipulating relational data. Codd’s colleagues at IBM – Donald Chamberlin and Raymond Boyce – created one of the languages influenced by Codd’s work. They named their language SEQUEL (Structured English Query Language), but changed the name to SQL due to the existing trademark.
The emergence of relational databases and their evolution
The active development of database technology began around 1970, when Codd published his work, which served as the basis for the creation of a relational data model. Among the advantages of this model, it is worth highlighting:
No data duplication.
Eliminates a number of errors and data anomalies found in other models.
All data is presented as facts, stored as relations with columns (attributes) and rows (tuples).
One of the first RDBMSs is dBase-II from Ashton-Tate, which released its product in 1979. She was able to put on the market about 100,000 copies of her product. As a result, this database became the most popular of all products that existed at that time. By the way, Ashton-Tate was later acquired by Borland. In general, a relational DBMS, this product could only be called a very long stretch.
But a start was made – and other companies began to present their products. This is how Oracle, Ingress and Focus appeared.
By the 1980s, there was an architectural (high-level) and engineering (low-level) understanding of how an RDBMS should function. And the decision came to introduce a standard (SQL Standard ISO and ANSI).
The development of databases intensified after the spread of local networks, and then the global network. The network allowed equipment to be shared. Also, users wanted to work together (in parallel) with the RDBMS, which accelerated the development of multi-user applications for local area networks. A client-server data processing architecture was created.
Since the 90s, the technology has become more user-friendly. According to marketers, any person could now figure out the DBMS. Of course, this is an exaggeration, but in general the direction of evolution is clear.
At the same time, the first online services began to appear (for example, sending flowers, gifts, postcards, blogging, etc.). Most of these services began (and continue) to work in the PHP + MySQL bundle.
The proliferation of cellular communications and cell phones since 1996 has led to the creation of specialized databases for processing information in a mobile device. Now they have evolved into a separate database stack (In-Memory) and are used either as a data cache, located in front of the main database, or as a Warm / Hot logical layer, used in the Lambda / Kappa architecture.
In my opinion, there was a kind of evolution in the 2000s, when the non-relational model was integrated into the relational database (I’m talking about the integration of the XML format). It was possible to define a model in a model (in fact, any model can be described in one column of the table). Examples of the technical implementation of such “fractal modeling” were Oracle Nested Tables, as well as the XML type, and later JSON. These models were called Post-Relational Models, and we can say that the beginnings of work with noSQL-models of the key-value / key-document format began to appear.
The value and value of data has gradually become recognized by the business, as a result of which the need for the right specialists has increased. Database technologies began to be used far from only for OLTP and OLAP traffic, but also for deep research in data (searching for anomalies, correlations, using a statistical apparatus, etc.).
Since 2006, the Amazon AWS cloud service began operating, according to its representatives, now it already has over 20,000 private Data Lakes built inside the cloud.
But now the role of databases has increased even more. After all, data is generated by any smart device, and there are more and more of them. These are not only TVs or smartphones, but also toothbrushes and even kettles (IOT traffic, not to be confused with Index-Organized Tables!;)
True, with the increase in the amount of data, there are more and narrowly focused databases that specialize in working with different types of information and models.
There are not so many relational DBMSs (compared to non-relational databases), but the architecture of each of them is unique. Each has its own pros and cons. It can also be noted that there is no RDBMS in the world that would fully describe Frank Codd’s mathematical relational model, except for one named Rel and the Tutorial-D language. Why is this so? Is it really the complexity in the technical implementation of relational theory? Of course not. In my opinion, in fact, everything is simpler: the business implicitly dictates its conditions for the implementation of data storage and processing. Let’s recall some of the basic properties of relations in relational theory and compare them with real life.
“All tuples are unique“… This means that it is necessary to store one fact about an accomplished event from the real world. This statement also supports the definition of the simplest unique key for a relationship, which must include the entire set of relationship attributes.
2. “The order of the lines is irrelevant”… Why do we even need to introduce such a concept as sorting strings, and even more so by specific attributes, and even in ascending / descending order or based on a formula? Hmm, in general, this requirement is unacceptable!
3.“The order of the columns is irrelevant”… We can rearrange the columns (attributes) as we like, since R (A, B, C) = R (B, C, A).
4.”Each attribute has a unique name within a table”. Each relationship attribute must have a unique name.
5.”Each column represents a single data element”. On the one hand, there is the issue of normalization. On the other hand, there is a question of denormalizing and speeding up the model.
6. “NULL value support”. NULL () means unusable information, and in reality we can use NULL as the status of a value in our model.
Moreover, each RDBMS interprets NULL values differently. Try below command for PostgreSQL / MySQL:
SELECT 'Hello' || NULL || ' world!';
and the command for Oracle:
SELECT 'Hello' || NULL || ' world!' FROM dual;
It’s just that Oracle uses the identity NULL ~ ” (empty string), and it was this feature that caused a lot of pain during codebase migrations from Oracle to PostgreSQL (especially if you have data mart with materializations based on sorts in window functions and window functions themselves based on attribute concatenations in your project).
Each turn of the spiral in the IT industry is marked by another model discovery or implementation approach. So it was with relational databases, then object-oriented databases appeared on the scene, then noSQL broke in with the slogan “Down with rigid structures of relations!” I am sure that this story will repeat itself over and over again depending on the challenges that are already facing the IT community.
But nevertheless, now you should pay attention to the fact that in the top 5 databases, the first 4 places are occupied by relational databases (according to the Solid IT research).
Why is this so? How, given its half-century history, can an RDBMS claim such a high place in the modern world? Could it be the legacy of the code and the generated model structures that need to be supported within relational databases? Or is it because it is too expensive to upgrade to more modern engines?
You might recall that relational databases have the ACID principle and fundamental mathematics. And non-relational databases have BASE-semantics with “softer” conditions for functioning and modeling, as well as a set of algorithms that support working with data. Semantics versus Principle, sometimes you want to live by Principles without data anomalies and in SERIALIZABLE mode.
But still, the truth lies somewhere in between, and this can be seen in different project architectures, when a “zoo” (in a good sense of the word) of heterogeneous databases and their integration with each other is used. This is especially true when creating shared data warehouses (DWH + Data Lake) in corporations.
How to combine OLAP and OLTP and get full-fledged “diagonal” databases? Why not look towards deductive databases? Finally, why not fully involve the entire Data Science apparatus in the engines of optimizers and information processing?
A new separate direction – Data Engineering gives a strong impetus to research not only data, but also in the search for architecture / integration / validation of data for a specific business domain. Standards for the definition of data / information / knowledge / wisdom are emerging, which develops into a generalized logical perception of information through different logical layers of data (here I can refer to the DAMA-DMBOK book) with the help of new engineering roles and delineation of responsibilities in the direction of Data Engineering:
If you have interesting ideas about the prospects for the development of databases, models, standards and Data Engineering in general – let’s discuss it in the comments.