Taming Liquibase. How to feed a list of 10K offices to the database so that you don’t write crutches later?


2. Generate csv file from Excel

Let’s start with a little theory. To migrate data from files, Liquibase needs this file to be in the format csv.

The data in the file is split into rows, just like in a table. Each value is enclosed by a comma, and the first line describes the column names.

The Liquibase documentation says that the imported data must be transferred in csv format in UTF-8 encoding. Of course, you can specify another encoding with the encoding parameter, but why, if Excel has a function for converting to CSV UTF-8. She’s perfect for us.

In excel, you will not notice the difference, but if you open the file in a text editor, you will see the following picture:


3. Data normalization

After we have translated our list into csv format, we also need to put the data in order so that this list can be conveniently used and modified. To do this, several problems must be solved.

3.1 Selecting the primary key

First you need to select a field that will become the main key. Our main task here is to make it easy to update the list. The main problem is that, probably, there are already several contours everywhere. At the very minimum, these are dev and prod. If you have to manually change something in the database to update the list, this is, firstly, very inefficient, and secondly, getting access to it, if possible, is very long. And you should always leave yourself the opportunity to update the list. The client can say that this list will definitely not change, or it is an intermediate list and we will get rid of it soon. But in business, anything can change at the last moment.

To be able to update the list, there are two options:

  1. More preferable – using a unique field. The list already has a field that is unique for each table element. Unfortunately, when the list consists of 1000+ elements, this rarely happens. In my task, each office had a number, but there were difficulties: either there was another office next to the same number, but with a different address, or there was one office with the same address, but the additional parameters were different, and therefore several almost identical lines were displayed. But for 1000+ offices, it’s impossible to figure out how everything is organized on the spot, and the team often has no way to find out why. Therefore, we turn to the second option.

  2. More complex – using id. When no column can uniquely designate a single element, there is the following solution. In theory, a composite master key can be used. But practice has shown that if there are huge lists, everything can change in them, and if one of the fields in the table is changed, then somewhere in the tables the link will break. Therefore, the best option would be to create a new column in the file, which will contain id. It is important that the id’s are pre-set in the file, and not given for generation when the data is uploaded to the table. Ideally, make id a numeric value, for example, from 0. Thus, all stands will have the same links, which will allow you to update the list by simply filling the table with a new list. This will make it easy to communicate within the team when it comes to what exactly needs to be updated and where. It’s much easier to update element 500 than a huge uuid or chunk of it. Yes, and elements are added only manually, they will almost always be less than the capacity of int’a.

3.2 Translating data to type requirements

Next, you need to look at the table and check it for type matches. For example, the Main office field is a Boolean, which in Liquibase takes the value “t and “f” and in CSV “yes” and “no”.

3.3 Data filtering

Since the list is huge, there will certainly be errors there, for example, extra commas that break the csv format. In this case, instead of excel, you can use another converter to replace the semicolon separator, for example, with a semicolon. But that’s not all that can go wrong. So the best way to filter errors is to try to host on your local environment, uploading to the database, finding and fixing errors that Liquibase swears at. The easiest way to do this is to use grep on Linux, and on Windows use wsl or a docker container with a perceptual folder. But that’s a topic for another article 🙂

As a result, we get one of two variants of tables:

  1. Table where the primary key is the City field

  1. A table where the main key will be the id field, which we ourselves added


4. Project setup

I will not describe the standard instructions for adding hibernate and liquibase here, I just want to focus on one hibernate setting in the yaml config called ddl-auto. For all migrations to work properly, this setting needs to be set to validate.

spring:
  jpa:
    hibernate:
      ddl-auto: validate


5. Create tables

We decided that we will represent the main key, now it’s a matter of technology – to create a change set to create a table. For convenience, I will give here example with added id field.

5.1 Loading data into a table

Project structure for storing different versions of the list

Since Liquibase is essentially a git for the database structure, we will accumulate csv, and somehow it must be stored in the project structure. I could not find a generally recognized structure, so I will give an example of how I do it.

By default, Liquibase looks for migrations in the src\main\resources\db\changelog path. I organize inside this folder like this:

Description of the loadData command

databaseChangeLog:
  - logicalFilePath: db/changelog/migrations/db.changelog-0.2-importToSimbirsoftOfficesTable.yaml
  - changeSet:
      id: 002_import_csv_to_simbirsoft_offices_table
      author: TreatHunter
      changes:
        - loadData:
            columns:
              - column:
                  index: 4 # индекс обозначает порядковый номер столца csv
                  name: id # название столбца таблицы в базе данных
              - column:
                  header: Город # вместо индекса можно и использовать название столбца в csv
                  name: city
              - column:
                  index: 1
                  name: index
              - column:
                  index: 2
                  name: address
              - column:
                  index: 3
                  name: is_head
            tableName: simbirsoft_offices
            encoding: UTF-8 # кодировка csv файла если она UTF-8 можно этот параметр не ставить
            separator: ',' # символ, разделяющий столбцы в файле. Если это запятая, то параметр можно пропустить
            quotchar: '"' # в какие кавычки будет экранироваться элемент, строки если в нем есть знак, разделяющий столбцы. По умолчанию "
            file:  "db/changelog/csv/changelog-0.2-importToSimbirsoftOfficesTable/Список офисов Simbirsoft с добавленным основным ключом.csv"

It seems to me that it is worth adding that the order of the column elements is not important, you can create them in csv, table or any other order.

Also column has 3 type parameter. This parameter specifies the type of data format that is in csv. But the Liquibase library is smart enough to understand what it is, so you don’t have to write it. The only thing she may not understand is in a situation with dates or times. But you need to look in the documentation, which specific type should be set. I decided to mention this parameter, since it has an interesting SKIP value, which makes it possible to skip the values ​​from the column in the csv file and place null in the table.

I did not use all the parameters that are in the Liquibase documentation, but indicated those that will be used in 80% of situations. The encoding, separator and quotchar parameters were not useful to me.

Here is the output from Liquibase:


6. How to update data using loadUpdateData. Examples

Now let’s talk about updating the list. Imagine that a client comes in and says that there is an error in the directory and the list needs to be updated. Since we initially prepared csv, we simply drop it and ask you to correct this file, explaining the structure of csv if necessary. When we get the file and try to write another change set with the loadData method, an error occurs when starting the project – you cannot insert new data into the table, since the table already has records with such main keys. Also, Liquibase will not let you delete all records in the table using the delete method, saying that other tables refer to this table. Realizing the situation, we understand two things:

  1. How cool is Liquibase that keeps us from breaking anything.

  2. We need to find another method.

This method is loadUpdateData. In my example, I added 50 to all house numbers.

databaseChangeLog:
  - logicalFilePath: db/changelog/migrations/db.changelog-0.4-updateOfficesWithLoadUpdateData.yaml
  - changeSet:
      id: 004_update_simbirsoft_offices_table_with_LoadUpdateData
      author: TreatHunter
      changes:
        - loadUpdateData:
            columns:
              - column:
                  index: 4
                  name: id
              - column:
                  header: Город
                  name: city
              - column:
                  index: 1
                  name: index
              - column:
                  index: 2
                  name: address
              - column:
                  index: 3
                  name: is_head
            tableName: simbirsoft_offices
            primaryKey: id # названия столбца таблицы, а не файла csv, по которому будет проверяться наличие данной строки
            file:  "db/changelog/csv/changelog-0.4-updateOfficesWithLoadUpdateData/Список офисов Simbirsoft с добавленным основным ключом v2.csv"

This method is almost the same as loadData, but you also need to specify a table column by which Liquibase will understand which lines to replace with which, if there is no match, it will simply add a new one. It is logical that you need to specify the main key.

Here link on the official documentation of this method, but it is so bad that in the parameters there is nothing at all about the primaryKey parameter and the example there is also incorrect at the time of writing this article.

Here’s what the data in the table will look like after a successful migration.

For those who didn’t have enough of the article and want to “touch” everything with their hands, I suggest going to github, where the project is located, in which I described all the examples and added simple logic so that you can play around and understand how everything works in practice. You can follow the link here or here.


Conclusion

After writing the first version of this article, I realized that I ended up with underdocumentation that did not reveal all the important details that I, as a developer, had to take into account on the project. Then I turned to the experience of running my small Youtube channel, where I recorded step-by-step guides for friends and classmates on how to set up and run servers for Minecraft.

Therefore, in the end, this tutorial included all the steps for creating and maintaining large lists in the database, so that you can understand not only the concept in a vacuum, but also what problems have to be solved. I also gave a set of good, in my opinion, solutions. Therefore, if you were able to use this tutorial to solve a similar problem in your project without googling anything else, then I managed to achieve my goal.

Thank you for your attention!

We also publish author’s materials for backend developers in our social networks – In contact with And Telegram.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *