Using hash values with collision handling as surrogate keys in DWH directories
Introduction
It is well known that surrogate keys are used in data warehouses to link fact tables with reference books. In most cases, this is an integer counter that one-to-one determines the business key (or business key plus time dependency for slowly changing reference books). With an increase in the amount of information being processed, in the case of a large cardinality of directories, the use of counters as surrogate keys becomes a problem in terms of performance, because when loading facts, it is necessary to determine the value of the surrogate key using a fairly large reference book. To solve this problem, many companies are moving to the formation of surrogate values based on the hash values of business keys. This approach is described in great detail in the report by Evgeny Ermakov and Nikolai Grebenshchikov on the data model of the Yandex Taxi storage (Evgeny Ermakov, Nikolay Grebenshchikov – Highly Normalized Hybrid Model – YouTube)
However, when using hash values as a surrogate key, collisions can occur when the hash function yields the same surrogate key hash value for different business key values. The number of collisions depends on the cardinality of the directory and the selected hash function, but remains in any case.
We faced this problem explicitly in one of the tender requests and later in a real project, as a result of which it became clear that it was great to develop some kind of template solution, especially since a quick search on the Internet did not give a ready-made solution.
Formulation of the problem
In the general case, there is an array of data at the input, which must be loaded into the target fact table, replacing business keys with surrogate ones for linking with directories.

In conditions where a significant part of the directory values is formed on the basis of these facts, the formation of directories with surrogate keys based on hash values can be carried out in parallel with the loading of facts. Many companies do just that, neglecting possible collisions as negligible.
Very often, the MD5 algorithm is used as a hash function, which gives a low probability of duplicate values, but at the output it produces a string of 32 characters, which is not optimal in terms of data storage and use as a link key. A small discussion on the experience of using these hashing algorithms can be found at the link: https://www.sql.ru/forum/1266277/hash-vs-surrogate-keys
Oracle 11.2g has a built-in hash function, ORA_HASH() , which produces an integer output but has a significant chance of getting duplicate values. If the problem with duplicates is solved, then it can be used to generate surrogate keys.
Suggested Solution
Directory formation
The general idea in the formation of a directory with surrogate keys based on a hash function with collision processing was that for most directory values, hash values are used that take positive values, and for duplicates, a value counter in the negative area is used. Thus, the spaces of possible values do not intersect, which eliminates duplicates.
To demonstrate the implementation of this idea, we need the following tables:
|
The SRC table is our input data stream, which in the simplest case contains only the business key bkeyfor which we want to calculate the surrogate key |
|
The TMP table is a table with temporary data that includes only those business keys that are not in the target HUB directory within one download cycle. Those. this is the delta for which the surrogate key needs to be generated. Table fields: bkey – business key h key – business key hash value Hkey_qty – sequence number of the hash value for the business key within the delta Skey – surrogate key for the double, calculated through a sequence/counter having a negative value. |
|
The HUB table is our target directory, which in its simplest form contains the primary key Pk_bkey and business key bkey. The set of values of the primary key is the union of the sets of values h key and Skey. |
|
The Excl table is an exclusion table containing surrogate keys for duplicate business keys. Used when loading facts. |
The diagram below shows the directory generation algorithm that implements the approach described above.

On the first step we clear our temporary table.
On the second step records are inserted from the SRC input stream into the TMP temporary table for those business keys that are not in the target HUB directory. When filling, the hash values of the business key are calculated, and for those business keys that have duplicates on the input stream data, the exclusion key is calculated based on the sequence.
Request example:
INSERT INTO tmp
(bkey, hkey, hkey_qty, skey)
SELECT DISTINCT
bkey,
ORA_HASH(bkey) as hkey,
DENSE_RANK() OVER (PARTITION BY ORA_HASH(bkey) ORDER BY bkey) as HKEY_QTY,
CASE
WHEN DENSE_RANK() OVER (PARTITION BY ORA_HASH(bkey) ORDER BY bkey) > 1
THEN -1* seq_excl.nextval
END as skey
FROM src
WHERE bkey IS NOT NULL
AND not in (SELECT bkey FROM hub)
On the third step we check if there are duplicate hash values for new entries in relation to those entries that are already in the HUB directory. For such records, an exclusion key is calculated based on the same sequence.
Request example:
UPDATE tmp
SET tmp.hkey_qty = hkey_qty + 1,
tmp.skey = -1* seq_excl.nextval
WHERE tmp.skey is null
and tmp.hkey in (select pk_bkey from hub)
As a result of the third step, hash values were calculated for all new business keys in the TMP table, and exception key values were generated for duplicates, and everything is ready for the final step – inserting records into target tables.
On the fourth step records are inserted into the target directory HUB and the exclusion table EXCL.
Request example:
INSERT INTO hub
(pk_bkey, bkey)
SELECT nvl(skey, hkey) as pk_bkey
, bkey
FROM tmp;
INSERT INTO EXCL
(bkey, skey)
SELECT bkey, skey
FROM tmp
WHERE skey is not null;
Loading Facts
Now that the EXCL exception table has been formed, facts can be loaded from the source (let’s call it LOADF) with reference to the HUB directory without referring to this directory using the query:
INSERT INTO fact
(pk_bkey, …)
SELECT nvl(EXCL.skey, ORA_HASH(LOAD.bkey)) AS pk_bkey
, …
FROM LOADF
LEFT JOIN EXCL ON (EXCL.bkey = LOADF.bkey);
Application experience
The approach described above was tested in one of the projects to optimize the existing structure of the data warehouse based on the clicks (visits) of the online store.
Prior to optimization, visit strings were loaded into the data warehouse, parsed, and the parsing results were written directly to the fact table in the appropriate columns along with the original string.
For example, from a line like:
Column |
Meaning |
url |
/katalog/bumaga-i-bumazhnye-izdeliya/bumaga-dlya-ofisnoj-tekhniki/formatnaya-bumaga/c/135000/?sort=price-asc&listingMode=GRID&sortingParam=price-asc&email=polydkinamargo%40mail.ru&sc_src=email_5359903&sc_lid=3214301= ipz4a29dvS&sc_llid=16237&sc_eh=53cbe15b802f05341&utm_source=mailing1-mail-ems-v4&utm_campaign=20220419_1709_mailing1-priyatnyye_syurprizy_20220418-b2c-op_z1-nbr-ntr-ntm-v4&utm_term=B2C&utm_medium=%D0%91%D1%83%D0%BC%D0%B0%D0 %B3%D0%B0+%D0%904+%D0%BF%D0%BE+%D1%81%D1%83%D0%BF%D0%B5%D1%80%D1%86%D0%B5%D0% BD%D0%B5&utm_content=Akcion_block |