A non-trivial approach, or how we discovered a bug in a domestic DBMS and successfully bypassed it

Three years ago, one of our customers, the largest Russian bank, had the task of reconfiguring the payment service for the GIS GMP (fines, duties and taxes), the GIS housing and communal services, as well as requests for charges (subscriptions). The choice of solutions on the market was small, since we needed to choose a product with ready-made sets of application software and DBMS, certified by the Federal Security Service of Russia and the FSTEC of Russia. After going through several options, we settled on a comprehensive Open Source solution from Russian companies ID Systems and Red Soft. The use of a similar package by one of the bank’s departments was also a plus, only with other SMEV adapters.

narrow neck

At first, the solution worked stably, and release updates did not take much time. The first “bell” that sounded during operation was the volume of the transaction counter in the DBMS version 2.6.0.13370. Since the 32-bit version was used and its volume was 2 ^ 32-1, and the number of requests for charging customers was measured in millions, we could easily hit the maximum, which would lead to the software stopping. We did not rely on chance and decided to test resetting the counter. In Red Data Base (RDB) it is done with the following commands:

  1. Change the database to read_only mode: ..rdb\bin\gfix -mode read_only -user логин -password пароль {путь_до_БД}.fdb

  2. Next, we perform a backup / restore with the commands:

    1. backup: ..rdb\bin\gbak -b -v -g -nod -user логин -password пароль {путь_до_БД}.fdb {путь_до_каталога_бекапа}.fbk

    2. restore: ..rdb \bin\gbak -c -v -n -user логин -password пароль {путь_до_резервной_копии_БД}.fbk {путь_до_новой_БД}.fdb

  3. After that, we transfer the database back to normal mode: ..rdb\bin\gfix -mode read_write -user логин -password пароль {путь_до_БД}.fdb

Having done this operation once, we did not find any noticeable downtime, since the database at that time was small in size. However, observing the monthly increase in its volume, we decided to act proactively: create an archive database and upgrade the DBMS to the 64-bit version.

There were no problems with setting up the archive database. The main essence is the usual transfer of data from the source database to the archive, with the subsequent removal of the transferred data from the source (Fig. 1)

Fig.1. How queries work in a schema with an archive database

But in the task of updating the RDB DBMS from version 2.6.0.13370 to version 3.0.9-rc.4, a “pleasant surprise” awaited us: the backup and restore time instead of the maximum allowable for a financial organization of three hours was 14!

The most “long-playing” downtime items were DBMS backup on version 2.6.0.13370, as well as the creation of the 3.0.9-rc.4 database based on the restored one. In fact, backup of the second generation of the DBMS is done in one thread, which increases the time of the backup and restore operations.

Depending on the size of the database, the execution time for these tasks varied. So, for a 140 GB database, backup was about 4 hours, restore – 7 hours. It turns out that if we sum up the rest of the software update process, the estimated downtime of the payment service is at least 14 hours.

Of course, such a simple one for payment services is a complete fiasco, and this is unacceptable for the customer.

The way out is in a joint approach

This bug was discovered by the head of the Jet Infosystems banking systems administration group, which was immediately reported to ID Systems engineers. We must give them their due – the guys reacted to this with understanding, actively helped in resolving the situation on the side of the vendor and jointly eliminated the bug.

Thus, the specialists carried out stress testing, hardware and disk speed audits, since they initially assumed that the problem lies precisely in the slow operation of disks. However, after testing the system, we got results with almost no loads, which led the team to decide to abandon the standard update method (Fig. 2).

Fig.2. Load testing

Since the RDB DBMS uses the archive kernel technology, which is configured either on the same server or on a separate one, and is able to retrieve data from the main database, the solution plan was as follows:

  1. Prepare a new server with a DBMS already updated to version 3.0.9-rc.4 and a new database, respectively. Deploy on it, in fact, a duplicate of the “combat” server, with a preliminary transfer of all PPO settings, including adapters for connecting to SMEV, task schedulers and CIPF.

  2. Schedule night work to switch the functionality to the scheduled window allocated by the customer.

  3. Disable PPO on the main server and change its name.

  4. Give the new server the name and IP address of the main one.

  5. Run the software on a new one.

As a result of testing, the necessary functionality worked flawlessly, and the switching downtime was only 35 minutes, which, of course, satisfied the customer. Bingo! Therefore, together with ID Systems engineers, we implemented it.

Moreover, we were able to do the rest of the work included in the transition plan not in the “scalded cat” mode:

  1. From the former “combat” database, transfer the remaining data to the archive database.

  2. Update the version of the DBMS on the old server and apply the changes to the “archived” database.

  3. And, finally, to transfer the “archival” database to a new “combat server” using the technology of the archive core.

We understand that there are no ideal products, but in our case we got an “ideal team” – a system integrator plus a vendor, who quickly connected and were able to solve the customer’s non-standard situation. By the way, at all stages of the transition, the customer service worked smoothly.

Similar Posts

Leave a Reply Cancel reply