Smart storage or how we reduced the growth of the Oracle database by one and a half times

Today we propose to discuss data compression in Oracle.

Situation: One of our large clients in the telecom industry is striving to increase its subscriber base and encourage service consumption and traffic growth. This allows the company as a whole and each of its systems to develop separately, including ours. system of mutual settlements with partners — Partner Relationships Management.

Problem: a regular increase in information in the database by 0.6 TB per month turns into 7.2 TB per year. At the same time, information only for the last two to three months is in demand for changes. The remaining data is accumulated and stored for reporting. With this approach, the database grows very quickly, and each SSD of the data storage system becomes worth its weight in gold. In addition, it is necessary to maintain consistency between the backup and test databases.

There are two options: throw an endless number of disks into databases, like “into a furnace,” or optimize information storage. We chose the second one.

In this article, chief software engineer for business systems Kirill Soldatov will tell you what exactly they did at Nexign. The information will be useful to everyone who, like us, is faced with the need to manage large arrays in the Oracle database.

Intro: why we need to compress data

Every month after calculating customer accruals in mobile, fixed-line communications and interconnect (payment system between operators), we conduct our billing to determine remuneration for dealers and agents who work with the operator. To do this, we download from the billing system the reference data used by subscribers, aggregated customer charges for the month, information about consumed services and connected packages, data on subscriber registration at points of sale.

Over 10 years of system operation, the amount of data could become unacceptably large, more than 80 TB (7.2 TB per year for 10 years, plus technological data and calculation results). In fact, as of March 2024, we have 41 TB. To understand how the storage volume changed over the entire period, consider graph 1. It shows visual statistics of the amount of data from March 2019 By March 2024 for each month in terabytes.

*Statistics of space occupied in PRM database files*

At the time of the start of statistics collection, the database was occupied 32 TB and in 5 years this value has grown to 41 TB This mathematics does not coincide with the growth of 7.2 TB per year that I stated earlier. And there are reasons for this:

In 2019 and 2022, activities were carried out to remove old, irrelevant data and to optimize the occupied space. During these periods the schedule decreases.
The old data compression functionality was also performed for subpartitions storing data for periods older than 12 months, I will talk about this in the article below.

Since data younger than 12 months has not yet been compressed, let's consider a smaller, more recent period for the last year – with March 2023 By March 2024.

For 2023-2024 there is a linear increase in the occupied space, and subpartitions for this period have not yet been archived. Based on these data, we obtain the average monthly actual growth 0.57 TB. This value is comparable to the average value of the sum of subpartitions for each month over the last year – 0.414 TB.

*Occupancy statistics for the period from March 2023 to March 2024, without compression*

The remaining data (out of 0.57 TB) is added incrementally to non-partitioned tables, so they are not included in the statistics in the next table. Subpartition size statistics for March 2023 – March 2024:

Month	Amount of subpartitions for the period, gb.
2023.03	373.4631
2023.04	380.2969
2023.05	376.6112
2023.06	381.6728
2023.07	385.8993
2023.08	392,797
2023.09	417.98
2023.10	430.4785
2023.11	440.8479
2023.12	440.8199
2024.01	441.5043
2024.02	463.7789
2024.03	454.9307
on average	413.9

For comparison, consider the graph of the increase in occupied space for the early period from February 2020 to April 2022.

Statistics for February 2020 - April 2022, with active compression — *Statistics for February 2020* — *April 2022, with active compression*

For this period, the old data compression functionality is already working. Due to this, the average monthly data growth for the period with compression turned out to be 0.38 TB against 0.57 TB for the period when archiving has not yet been carried out. We get a reduction in the monthly increase in occupied space 1.5 timeswhich confirms the effectiveness of the applied compression method for old subpartitions. Let's look at Oracle's partitioning and archiving functionality, which helped us achieve this result.

What we did to stop the rapid growth of the database

The reduction in monthly growth was possible thanks to compression control at the subpartition level. Partitioning is a very convenient tool that allows you to divide tables into smaller parts, called partitions and subpartitions.

Each partition may have separate physical attributes. For example, compression enabled or disabled, compression type, physical storage configuration, table space. These capabilities provide a framework that allows you to fine-tune availability and performance. In addition, each partition can be managed individually, which simplifies and reduces the time for creating backups and administration.

For partially or fully partitioned tables, all rules and operations for accessing and managing tables work, as for non-partitioned ones. Access and management of partially or fully partitioned tables is the same as for non-partitioned tables.

We have implemented LIST partitioning by list of values and RANGE subpartitioning by event date, namely by month. In this article I will only talk about dividing subpartitions by month.

Now let's talk about compression

The question arises, if compressed tables do not differ in the ways of working with them from regular ones – why not apply compress (compression) to the entire table at once? Why bother with partitioning?
To answer, let's look at how data compression works in Oracle.

The ability to compress tables is implemented by removing duplicate values from database tables. Compression is performed at the block level. When a table is defined as compressed, the server takes up space in each block to store a single copy that occurs in multiple locations. This reserved space is called a symbol table (symbol table).

Information marked for compression is stored only in the symbol table, not in data rows. When data marked for compression appears in a line, a pointer to the corresponding information in the symbol table is stored instead of the data itself. Space saving is achieved by removing redundant copies in the table.

Compression helps reduce disk space, reduces the amount of memory consumed by the database buffer for cache formation, and can have a beneficial effect on increasing the speed of query execution when reading data.

There are three types of compression:

Basic Table Compression — high level of compression, low CPU overhead. Compression is possible only during direct-path loading (INSERT /*+ APPEND */ INTO … SELECT … or CREATE TABLE … AS SELECT… or via SQL Loader with the DIRECT=TRUE directive). Updating basic-compressed strings causes immediate decompression and return to storage in normal string form. Therefore, this option is not optimal if the compressed data will be subject to frequent and massive changes.
Advanced Row Compression – high level of compression, low CPU overhead, supports on-the-fly compression for OLTP operations (for example, INSERT INTO x VALUES …). This is accomplished by delaying compression until there is enough data in the block to make it worth compressing. This method can compress rows in a block that were inserted separately in different sessions and at different times. The compression ratio is slightly worse or comparable to basic compression. But the main drawback is that the Advanced Compression Option is purchased from Oracle for an additional fee.
Hybrid Columnar Compression (HCC) – this option is divided into Warehouse compression And Archive compressionfor which the compression level and CPU costs depend on the selected LOW or HIGH level.
HCC allows you to compress a table by 10-50 times! This result is achieved because data in hybrid compression tables is stored by joining columns rather than rows. This allows you to optimize compression and minimize duplicates:

But this storage form conversion has a negative impact on performance. This option is not suitable for us.
Archive compression, as the name suggests, is ideal for compressing tables with old data that is not accessed frequently.

For the Basic, Warehouse and Archive Compression types, data compression will be applied when first enabled for a subpartition and during subsequent insertions (changes), but only for inserted bulk packs or array arrays.

In the databases of the Nexign Partner Relationships Management system we use Basic compression – this is determined by the logic of working with data. At the beginning of each month, subpartitions are cut to store data for the new period. After billing (for mobile subscribers and fixed-line clients), the newly created partitions are filled with data one-time and stored for a year, after which compression is applied to them.

Over the course of twelve months, during the operation of the system and analysis of internal corporate requests, individual data can be changed by employees, and we rarely have to update older data.

Data is loaded into each subpartition in bulk and once in the first days of the month, after which it rarely undergoes changes. In this case, in the absence of numerous and constant DML operations (insert, update, merge – which are not affected by online compression), the basic compression option is the most suitable. Advanced requires additional cash costs, and Hybrid requires more hardware and execution time.

How data compression works in Nexign

As I noted earlier, we do not compress new data immediately as it appears, but only after a year of use. I configured the automatic launch of compression processes in the middle of each month, when the load from monthly calculations has already passed (and the thirst for adventure is not yet there). The algorithm looks like this:

Based on a pre-prepared list of tables, a cursor is built that selects subpartitions that store data with a depth of more than 12 months.
In a loop over the result set, the cursor is disabled bitmap indexesif there are any for the table.
The subpartition is compressed into several parallels (8, if anyone is interested).
The index for the processed partition is rebuilt.
The global level index of the entire table is rebuilt.

Some overview statistics: in our case, the analysis and compression of a subpartition storing 78 million rows, 15 GB. data, Oracle spends 8 parallels 2 minutes. In total, we process up to 1400 subpartitions of varying degrees of fullness and rebuild the same number of their indexes.

Due to the fact that the processes work in 8 threads of 8 parallels, these works fit into those. window – 2 hours at night. That is, the work is carried out unnoticed by business users.

Compression of partitions and subpartitions

Data compression can be applied both to tables as a whole and to its individual partitions and subpartitions. Everything is done with a simple command (depending on the type of partitioning):

alter table TABLE_OWNER.TABLE_NAME compress;

alter table TABLE_OWNER.TABLE_NAME move partition PARTITION_NAME compress;

alter table TABLE_OWNER.TABLE_NAME move subpartition SUBPARTITION_NAME compress;

For optimal use of database power, you can add parallelism parallel N. You may also need to move old data to a separate tablespace. Then we add to the team move tablespace TABLESPACE_NAME. For example, such an archived TP can be physically placed on separate, cheaper storage disks.

Index compression

The situation is similar for indexes – compression can also be applied to them both for the entire index and for its parts. Regardless of the compression of the table itself.

alter index INDEX_OWNER.INDEX_NAME compress;

alter index INDEX_OWNER.INDEX_NAME move partition PARTITION_NAME compress;

alter index INDEX_OWNER.INDEX_NAME move subpartition SUBPARTITION_NAME compress;

It is important to note the peculiarity of working with bitmap-indexes. Before compressing, they must be turned off, and after compressing, they must be reassembled.

alter index INDEX_OWNER.INDEX_NAME unusable;

alter index INDEX_OWNER.INDEX_NAME rebuild;

This option is acceptable if our index is single and indivisible. If the index is partitioned, it is necessary to rebuild every minimal part of it.

alter index INDEX_OWNER.INDEX_NAME rebuild partition PARTITION_NAME;

alter index INDEX_OWNER.INDEX_NAME rebuild subpartition SUBPARTITION_NAME;

Oracle also allows you to compress individual columns in compound indexes. But that's a completely different story.

Let's sum it up

In different products, for different database use cases, you can select your own compression type: Basic Table Compression, Advanced Row Compression, Hybrid Columnar Compression (HCC). Archiving is especially convenient to use in conjunction with dividing tables into partitions and subpartitions.

The archiving functionality in Oracle allows you to significantly optimize stored data arrays. In our case, we were able to reduce the monthly growth in 1.5 times – With 0.6 TB to 0.38 TB. I think that for a 40 TB database (and we have three of them) this is a good result. It will allow you to save on the purchase of additional storage systems.

I hope my experience will help optimize data storage in your systems too!