An alternative approach to storing SQL code in a repository and developing it

Hello current_user()!

In the world of database development, storing a code base (hereinafter – KB) and updating it on different instances are tasks that many specialists face. One common approach to this problem is to use migration files, which contain changes to the state of the database and move it from one version to another. However, apart from this method, there are other migration management tools and techniques that can be more efficient and convenient in different scenarios.

In this article, we review existing approaches to storing and updating a code database, and introduce a new method that may be an effective alternative solution. We'll discuss the main challenges and shortcomings of the standard approach to migrations, and then introduce a concept based on modern design and automation principles that can significantly improve the database change management process.

The goal of this article is to provide readers with new ideas and tools to improve their database change management process, allowing them to move forward with their projects more efficiently and securely.

Preface

The start of this journey happened by accident when I was simply experimenting with a new approach. Recently, our interest (in our company) in this approach has increased, and we decided to implement it. We are currently actively working on it, making many changes, as we have encountered a considerable number of pitfalls.

In this article, we will focus solely on the concept of the storage approach and codebase development. We won't offer any off-the-shelf tools here (except perhaps some pseudocode or bash code to explain what's going on).

For ease of understanding, the entire article will be divided into the following parts:

so, let's begin.


An example of how migration files are currently used

Everything looks different for different tools and approaches, but I’ll try to describe a pseudo-option.

The meaning here is very simple: we have a KB version of the product, we want to add a new feature. For this feature, you need to add some columns, change some functions, add triggers.

All these manipulations are recorded in a migration file named $number+1.sqlWhere number is the number of the last migration in the migration directory.

After this, you can roll out this migration using the tool that is used in this project.

Example

Let's create a migration file:

# путь к папке с миграциями
MIGRATION_PATH="~/git/project/sql/migrations"

# запишем сразу number+1 сюда
NUMBER="101" 

# в результате фичи добавляли таблицу users
echo "create table users(id int, name text);" >> $MIGRATION_PATH/$NUMBER.sql

# допустим, ещё был создан индекс по users.name
echo "create index on users(name);" >> $MIGRATION_PATH/$NUMBER.sql

That’s it, the migration is ready, all that remains is to feed it to your software for uploading migrations

There are a very large number of tools for uploading migrations. Here, for example, are some of them: flyway, liquibase, pg_mig, db-migrate, pg_codekeeper.

In general, their algorithm for such tools (except pg_codekeeper) looks like this:

  1. Collecting all migration files as input

  2. Checking whether these migrations are uploaded to the database

  3. Those that are not filled are filled


Formulation of the problem

This section is not named entirely correctly.

It will describe the characteristics of the development approach to which the concept described in this article is suitable.

Objective (criteria):


Description of the concept structure

I categorically did not like the fact that we had to create our own migration file for each feature. During parallel development, it often happened that changes were made to the same function, as a result, 2 migrations were created that did not see each other’s changes, because of which, when uploading migrations to production, we could lose part of the functionality

An example of such a situation

Let's say we had a function:

-- migration #100
create function myfunc(
  p_user_id bigint,
  p_amount numeric) 
returning bool 
as $$
begin
  return exists(
    select from users
    where 
      id = p_user_id
      and amount > p_amount);
end;
$$

We are tasked with adding another check here for status="active" and in MP we will create a migration file that will contain a function like this:

-- migration #101
create function myfunc(
  p_user_id bigint,
  p_amount numeric) 
returning bool 
as $$
begin
  return exists(
    select from users
    where 
      id = p_user_id
      and amount > p_amount
      and statuss="active");
end;
$$

In parallel with this task, we were given a new task, it says that checking for amount should be with a reserve of 100 USD, i.e. amount > p_amount + 100 it turns out that in the second MP we will create a migration file that will contain the function:

-- migration #102
create function myfunc(
  p_user_id bigint,
  p_amount numeric) 
returning bool 
as $$
begin
  return exists(
    select from users
    where 
      id = p_user_id
      and amount > p_amount + 100);
end;
$$

And what is the whole problem:

When the tool for uploading migrations is working, it will first upload the first migration, then the second. Accordingly, changes in the first task will be uploaded to the database and then erased.

To avoid this, you have to check everything many times and write a bunch of random things to check. Or, prevent parallel development of the same object in the database (in this case, it’s a function)

To get rid of such problems, you can try to store each database object in its own migration (as in traditional programming). For example: the creation of the table `users` will be stored in the file `public/table/users.sql` and the function `myfunc` will be stored in `public/function/myfunc.sql`

We also need a tool that can upload such migrations (that is, not only when a file is added to the list, but also when it is modified). This is exactly what liquibase can do.

What we come to is that we can take the code of all database objects and parse this code into separate migration files. In the future, when these migration files change, a copy of these objects will not be created (as in the example above), but the existing migration file will simply be modified.

But this won't be enough.
If we added, for example, a column to a table, the new script will not add it.
The so-called “incremental” migrations will help us here. I will describe them a little later.

As a result, taking into account some improvements, our migration repository turns into something like this:

  • migrations/before_scripts/

    • Scripts that are executed before filling the main part. For example, these are scripts that set up session variables

  • migrations/migrations_scripts/

    • Scripts that migrate the database structure for some objects. For convenience, let’s call them “incremental” migrations.

  • migrations/migrations/

  • migrations/data_scripts/

    • These are scripts that do not contain a database schema, but table data (for example, a translation table, a table with configs, etc.)

  • migrations/after_scripts/

    • Scripts that will be executed after all migrations have been uploaded. For example, updating configs, rebuilding some views, resetting pg_state_statement statistics, etc.

Description of the development concept

In this section I will try to describe everything that should happen with MR and database from the beginning of feature development until it is merged into the main branch (or deployment to prod).

And so, our entire development is divided into several parts (stages):

Now let's look at each stage more specifically:

Note: below I will use some possibly unclear expressions, so I will give their explanations here:

Initializing feature development

Before a programmer starts working with a database, he needs to create it. To do this, we will make a structural copy of the database from the dev branch (I think the variables that are used here do not need comments):

psql -d postgresql://$DB_USER:$DB_PASS@$DB_HOST:$DB_PORT/$sourceDB -c "create database $targetDB"
pg_dump --format=c --schema-only -d postgresql://$DB_USER:$DB_PASS@$DB_HOST:$DB_PORT/$sourceDB \
  | pg_restore --format=c -d postgresql://$DB_USER:$DB_PASS@$DB_HOST:$DB_PORT/$targetDB

What does this code do?

  1. Creates a database $targetDB

  2. Create a dump of the database structure $sourceDB (this is our dev-DB)

  3. Restoring this structure in $targetDB

After this, it will be necessary to connect all external crowns, services, etc. to this database. But this is DevOps and here we will not consider all this, we are only interested in the database and the developer.

It is very convenient to do all this on one cluster with a dev branch. Because you will already have all shared_library connected and all roles created (the shared hba is also configured).

Development

In this section, the developer will rejoice like no other.

Because in this approach, the developer does not need to record his code anywhere (I mean in migration files, comments to MP, chats, etc.). All a developer does is do his work directly in the feature database (yes, there are nuances, but there are very few of them).

When a developer needs to merge the result of his work into a repository and send it for review, he performs the following actions:

  1. Creates a dump of the test database structure

  2. Parses this dump and places it in the required folders/files in the repository.

  3. Commit and push

Yes, some programs can do this at once, but still there is more trust in pg_dump 🙂

A small example

I’ll try to describe here a small example of what it looks like.

# первым делом сделаем дамп схемы
pg_dump --file "dump_schema.sql" --format=p --schema-only -d postgresql://$1:$2@$3:$4/$5
Слияние фич-ветки в дев-ветку
# запустим скрипт ( в данном случае на python ), который распарсит все объекты и упакует всё в нужный файл
./pg_schema_split.py dump_schema.sql $project_path/sql/migrations

And so, we received the dump, now let’s parse it.

The parsing option will be very convenient DDL/DCL through their identification by some SQL script parsing library, gradually putting them into the necessary folders/files. An example of such a library: pg_query_go

At the beginning, for tests, I wrote a version of a “native” parser in python.
The thing is that pg_dump outputs scripts in the form:

--
-- Name: users; Type: TABLE; Schema: public; Owner: postgres
--

CREATE TABLE public.users (
    id integer NOT NULL,
    name text
);


ALTER TABLE public.users OWNER TO postgres;

those. displays the header of a block of code, and then the script itself is written in the body of this block. If you parse everything correctly, you can quickly understand where and what each block of code belongs to.

After parsing, you still need to remake the scripts for multiple launches (for example create table change to create table if not exists ).
Please note that not all objects support if not exists / replace for some you have to write a similar shell:

DO $$
BEGIN
  IF NOT EXISTS ( <проверка того или иного объекта> ) THEN
    <скрипт создания объекта>
  END IF;
END
$$;

Examples of the output will be below (the example below shows old screenshots, I took them with a python script).

If as a result of reviews/tests problems are discovered, you can easily edit everything directly in the database (as during development) and merge it into a feature branch with a script.

After the developer has done everything, the tests have passed, the reviewer has checked everything – you can proceed to preparation for joining the dev branch.

Merging the feature branch into the dev branch

It is at this stage that those incomprehensible “additional” migrations will be created.

An example of why such migrations are needed:

There was a table:

create table if not exists users(
  id int,
  name text
);

As a result of feature development, we added a field to this table age int our migration script for this table became the following:

create table if not exists users(
  id int,
  name text,
  age int
);

As we all understand, executing this code again will not create a new column in the product.

We need a query that will create a column separately:

alter table if exists users add column if not exists age int;

Such scripts are easy to generate using various sql-diff utilities.

The sequence will be as follows:

  1. Let’s create a new database from the dev branch (we already know the script, I won’t duplicate it)

  2. Let's roll there the main migrations that we obtained as a result of the work of the last stage.

  3. Let's compare the new database and the feature database with some sql=diff utility, and write the result to the “additional” migration file

Here is an example of using pg_codekeeper:

$run_pgcodekeeper $newDB_url $targetDB_url >> $migrationsPath/migration_scripts/$continueMigrationName.sql

When the “additional” migration is created, all you have to do is check it again (automation is cool, but you don’t want to drop the product either). After which – close the MR.

As a result, the flowchart of work will look something like this

As a result, the flowchart of work will look something like this


Usage example

I will attach screenshots here to make it a little clearer what is where and how.

The pgAdmin and sublime-merge programs will be shown here

We have a function and a table in the database:

We are given this task:

  1. Create a new table user_cards

  2. Add to users new field count_cards

  3. Create a new function get_user_count_cards($user_id)which will display the number of cards the user has

  4. In an existing function my_func redo the answer with a+a on a*a

Create a new table user_cards :

Well, it's simple here:

create table user_cards(
  id serial not null primary key,
  user_id int not null, -- тут бы стоило добавить FK, но я уже прикрепляю старые скрины и не хочу там ничего править :)
  card_number text,
  other_data jsonb
);

Add to users new field count_cards :

I think you don’t even need to write a request here, everything is clear (yes, you can even create a column manually in pgAdmin)

Create a new function get_user_count_cards($user_id)which will display the number of cards the user has:

In an existing function my_func redo the answer with a+a on a*a :

And so, the task is done, now we want to merge the database schema in the form of migration files for each object. Let's run pg_dump + script to parse sql code from files.

Well, let's check what has changed in the repository?

Changes to public/function/my_func.sql file

Changes to the file public/function/my_func.sql

Changes in the file with the users table

Changes in the table file users

New file with get_user_count_cards function

New file with function get_user_count_cards

Migrations have been created for the user_cards table: Since it is serial, you need to create a sequence, then connect it, so everything is in several files.

Created migrations for the table user_cards :
Since there is serial, you need to create a sequence, then connect it, so everything is in several files.

The reviewer checks us, the task is done, let’s create an “additional” migration, since in the table users a field was added count_cards .

Let's run the scripts that will do the work of creating such a migration (described above). But the output is a file:

"finishing" migration

“Additional” migration

If after this a correction to the technical specifications “suddenly” arrived at us (well, it happens), then all we need to do is remove the “additional” migration from the branch, develop it, and send it again for review with the subsequent creation of an “additional” migration.

If “additional” migration is not needed, you don’t have to create it.

“Additional” migrations that are uploaded to all existing databases (I mean review databases and production databases) can be deleted; we will no longer need them (the dirt needs to be cleaned).

Conclusion, notes, etc.:

This approach allows you to store the sql code of database objects in various files and prevent repetition of scripts.

All this is achieved by the fact that the tool for uploading migrations will re-upload migrations that have been modified.

“Additional” migrations are also created specifically for those migrations that cannot bring the database structure to the required version (the example with adding a column was above).

All this allows you to support competitive development of the same objects, not duplicate code, convenient review and convenient workflow for the developer and reviewer.
True, for this you will have to pay with the nerves of setting up, and also refuse to rollback migrations (in the standard approach, both a migration to a new version of the database is created, and a migration to a rollback of the version, and such rollbacks are not supported in this architecture).


I want to emphasize once again that such a system is well suited specifically for projects where the database is the backend for the most part, and where backend development mainly involves modifying functions.

There is a logical question: Why not use pg-codekeeper, since it can already store files with objects structurally, and also create migrations?

There are several answers here:
Firstly, if you generate one large migration as an output, save it in migration files or upload it to production, problems with competitive development arise (as was the example at the beginning).
If you store object code in a repository and generate migration at the moment of uploading, you will have to do a lot of reviewing, which will be quite inconvenient.

Secondly, its functionality is a little modest; you can only manage parameters include/exclude .

The approach described here allows you to make a sufficient number of modifications from above. For example: dividing the code into parts (modules), substituting some parameters (for example, overriding the default owner), saving and transferring table data, managing through contexts (using the example of liquibase), etc.. For now, I can’t say exactly all the goodies, because we weren’t able to squeeze everything out of it. But what is now supported is already starting to make me happy.

I'll repeat it again. I just wanted to describe an approach to storing migrations that will be convenient for developers and reviewers.

And I repeat once again that everything is at the stage of testing, finishing features and eliminating pitfalls.

Why did I write this article?

To show how the development can be implemented.

Perhaps in the comments someone will ask a couple of questions that will give food for thought and we will be able to eliminate this at this stage. Or maybe someone will want to implement something like this for themselves, and after reading this article with comments, they will know whether it will work or not.

Thank you for your attention, good luck to everyone!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *