Using hash values ​​with collision handling as surrogate keys in DWH directories


Introduction

It is well known that surrogate keys are used in data warehouses to link fact tables with reference books. In most cases, this is an integer counter that one-to-one determines the business key (or business key plus time dependency for slowly changing reference books). With an increase in the amount of information being processed, in the case of a large cardinality of directories, the use of counters as surrogate keys becomes a problem in terms of performance, because when loading facts, it is necessary to determine the value of the surrogate key using a fairly large reference book. To solve this problem, many companies are moving to the formation of surrogate values ​​based on the hash values ​​of business keys. This approach is described in great detail in the report by Evgeny Ermakov and Nikolai Grebenshchikov on the data model of the Yandex Taxi storage (Evgeny Ermakov, Nikolay Grebenshchikov – Highly Normalized Hybrid Model – YouTube)

However, when using hash values ​​as a surrogate key, collisions can occur when the hash function yields the same surrogate key hash value for different business key values. The number of collisions depends on the cardinality of the directory and the selected hash function, but remains in any case.

We faced this problem explicitly in one of the tender requests and later in a real project, as a result of which it became clear that it was great to develop some kind of template solution, especially since a quick search on the Internet did not give a ready-made solution.

Formulation of the problem

In the general case, there is an array of data at the input, which must be loaded into the target fact table, replacing business keys with surrogate ones for linking with directories.

In conditions where a significant part of the directory values ​​is formed on the basis of these facts, the formation of directories with surrogate keys based on hash values ​​can be carried out in parallel with the loading of facts. Many companies do just that, neglecting possible collisions as negligible.

Very often, the MD5 algorithm is used as a hash function, which gives a low probability of duplicate values, but at the output it produces a string of 32 characters, which is not optimal in terms of data storage and use as a link key. A small discussion on the experience of using these hashing algorithms can be found at the link: https://www.sql.ru/forum/1266277/hash-vs-surrogate-keys

Oracle 11.2g has a built-in hash function, ORA_HASH() , which produces an integer output but has a significant chance of getting duplicate values. If the problem with duplicates is solved, then it can be used to generate surrogate keys.

Suggested Solution

Directory formation

The general idea in the formation of a directory with surrogate keys based on a hash function with collision processing was that for most directory values, hash values ​​are used that take positive values, and for duplicates, a value counter in the negative area is used. Thus, the spaces of possible values ​​do not intersect, which eliminates duplicates.

To demonstrate the implementation of this idea, we need the following tables:

The SRC table is our input data stream, which in the simplest case contains only the business key bkeyfor which we want to calculate the surrogate key

The TMP table is a table with temporary data that includes only those business keys that are not in the target HUB directory within one download cycle. Those. this is the delta for which the surrogate key needs to be generated.

Table fields:

bkey – business key

h key – business key hash value

Hkey_qty – sequence number of the hash value for the business key within the delta

Skey – surrogate key for the double, calculated through a sequence/counter having a negative value.

The HUB table is our target directory, which in its simplest form contains the primary key Pk_bkey and business key bkey. The set of values ​​of the primary key is the union of the sets of values h key and Skey.

The Excl table is an exclusion table containing surrogate keys for duplicate business keys. Used when loading facts.

The diagram below shows the directory generation algorithm that implements the approach described above.

On the first step we clear our temporary table.

On the second step records are inserted from the SRC input stream into the TMP temporary table for those business keys that are not in the target HUB directory. When filling, the hash values ​​of the business key are calculated, and for those business keys that have duplicates on the input stream data, the exclusion key is calculated based on the sequence.

Request example:

INSERT INTO tmp
           (bkey, hkey, hkey_qty, skey)
    SELECT DISTINCT
           bkey,
    ORA_HASH(bkey) as hkey,
           DENSE_RANK() OVER (PARTITION BY ORA_HASH(bkey) ORDER BY bkey) as HKEY_QTY,
           CASE
	      WHEN DENSE_RANK() OVER (PARTITION BY ORA_HASH(bkey) ORDER BY bkey) > 1
		THEN -1* seq_excl.nextval
	    END as skey
      FROM src
     WHERE bkey IS NOT NULL
       AND not in (SELECT bkey FROM hub)

On the third step we check if there are duplicate hash values ​​for new entries in relation to those entries that are already in the HUB directory. For such records, an exclusion key is calculated based on the same sequence.

Request example:

UPDATE tmp
  SET tmp.hkey_qty = hkey_qty + 1,
      tmp.skey = -1* seq_excl.nextval
  WHERE tmp.skey is null
    and tmp.hkey in (select pk_bkey from hub)

As a result of the third step, hash values ​​were calculated for all new business keys in the TMP table, and exception key values ​​were generated for duplicates, and everything is ready for the final step – inserting records into target tables.

On the fourth step records are inserted into the target directory HUB and the exclusion table EXCL.

Request example:

INSERT INTO hub
            (pk_bkey, bkey)
     SELECT nvl(skey, hkey) as pk_bkey
          , bkey
       FROM tmp;
 
INSERT INTO EXCL
            (bkey, skey)
     SELECT bkey, skey 
       FROM tmp
      WHERE skey is not null;

Loading Facts

Now that the EXCL exception table has been formed, facts can be loaded from the source (let’s call it LOADF) with reference to the HUB directory without referring to this directory using the query:

INSERT INTO fact
            (pk_bkey, …)
     SELECT nvl(EXCL.skey, ORA_HASH(LOAD.bkey)) AS pk_bkey
            , …
       FROM LOADF 
         LEFT JOIN EXCL ON (EXCL.bkey = LOADF.bkey);

Application experience

The approach described above was tested in one of the projects to optimize the existing structure of the data warehouse based on the clicks (visits) of the online store.

Prior to optimization, visit strings were loaded into the data warehouse, parsed, and the parsing results were written directly to the fact table in the appropriate columns along with the original string.

For example, from a line like:

Column

Meaning

url

/katalog/bumaga-i-bumazhnye-izdeliya/bumaga-dlya-ofisnoj-tekhniki/formatnaya-bumaga/c/135000/?sort=price-asc&listingMode=GRID&sortingParam=price-asc&email=polydkinamargo%40mail.ru&sc_src=email_5359903&sc_lid=3214301= ipz4a29dvS&sc_llid=16237&sc_eh=53cbe15b802f05341&utm_source=mailing1-mail-ems-v4&utm_campaign=20220419_1709_mailing1-priyatnyye_syurprizy_20220418-b2c-op_z1-nbr-ntr-ntm-v4&utm_term=B2C&utm_medium=%D0%91%D1%83%D0%BC%D0%B0%D0 %B3%D0%B0+%D0%904+%D0%BF%D0%BE+%D1%81%D1%83%D0%BF%D0%B5%D1%80%D1%86%D0%B5%D0% BD%D0%B5&utm_content=Akcion_block

A line like this was formed:

Column

Meaning

url

/katalog/bumaga-i-bumazhnye-izdeliya/bumaga-dlya-ofisnoj-tekhniki/formatnaya-bumaga/c/135000/?sort=price-asc&listingMode=GRID&sortingParam=price-asc&email=polydkinamargo%40mail.ru&sc_src=email_5359903&sc_lid=3214301= ipz4a29dvS&sc_llid=16237&sc_eh=53cbe15b802f05341&utm_source=mailing1-mail-ems-v4&utm_campaign=20220419_1709_mailing1-priyatnyye_syurprizy_20220418-b2c-op_z1-nbr-ntr-ntm-v4&utm_term=B2C&utm_medium=%D0%91%D1%83%D0%BC%D0%B0%D0 %B3%D0%B0+%D0%904+%D0%BF%D0%BE+%D1%81%D1%83%D0%BF%D0%B5%D1%80%D1%86%D0%B5%D0% BD%D0%B5&utm_content=Action_block

cat_lvl_1

catalog

cat_lvl_2

bumaga-i-bumazhnye-izdeliya

cat_lvl_3

bumaga-dlya-ofisnoj-techniki

cat_lvl_4

formatnaya paper

utm_source

mailing1-mail-ems-v4

utm_campaign

20220419_1709_mailing1-priyatnyye_syurprizy_20220418-b2c-op_z1-nbr-ntr-ntm-v4

utm_term

B2C

utm_medium

A4 paper at a super price

utm_content

Action_block

As a result of the refactoring, it was proposed:

  • split the fact table by separating underused wide source fields into a separate table (in the example given, this is the URL field);

  • logically group the fields obtained as a result of parsing into directories with the generation of surrogate keys based on hash values ​​(in the example above, directories of the structure of the catalog of site pages – Cat_lvl_1, Cat_lvl_2, Cat_lvl_3, Cat_lvl_4 and a directory of marketing promotions – utm *) are highlighted.

As a result, the majority of analytical queries began to be executed on a much “narrower” fact table, and at the same time, due to the allocation of directories, the disk space occupied was reduced by 30%.

Taking into account the fact that non-column Oracle 11.2g was used as the DBMS for the storage, the gain was significant.

Here are some numbers:

No.

Type of transaction

Operation description

Operation time – Up to

Operation time – After

one

Loading data

Stage layer

25 min

69 min

2

Loading data

DDS layer

71 min

10 min

3

Loading data

Full data loading to targets

96 min

84 min

four

Execute sql query

Number of transactions per day

2731 sec

6 sec

5

Execute sql query

Number of transactions per calendar week

2864 sec

10 sec

6

Execute sql query

Running a scheduled report with a selection of data for a calendar week

980 sec

281 sec