Business key and surrogate key are both needed

Data schema that links business keys to a surrogate key

Data schema that links business keys to a surrogate key

A couple of days ago, I campaigned for a new standard for a well-respected data warehousing expert. surrogate keys UUIDv7 for highly loaded systems, and received a response from him:

“Actually, and in my humble opinion, NO surrogate key is good enough, or should be used. In my humble opinion, real and true business keys (alphanumeric values ​​with human readable meaning) are the way forward and the right way to design, identify, architecture and link data together.An example of this would be the vehicle identification number (a global standard identifier with a human readable meaning.) Yes, it contains serial numbers as a component, but the key itself is much more .

To me, all surrogates (machine generated identifiers) are useless meaningless bits of data. They only serve to “uniquely identify a row/data set”. Once “disconnected” from the data it represents, the value of the key becomes completely meaningless, while the real business key (as described above) still retains its value to humans, even as a separate key.

I had to write a response letter from which this article was born.

Indeed, it would be great if the same alphanumeric code could be used both to refer to external data (for example, registries or classifiers) and to link rows in different database tables. But examples of such business keys (natural keys) in the form of alphanumeric codes are very difficult to find, and on closer examination they turn out to be unusable. For example, it was an unpleasant surprise for me that the Russian TIN is not unique (in rare cases). In addition, external identifiers change from time to time (for example, changing the last name), and therefore they cannot be used in databases without tricks. And when it became necessary to combine all types of collateral in one table, then the cadastral number of the property had to be added to the column c VIN (vehicle identification number, vehicle identification number), and there were doubts about the uniqueness of the mixed identifier in the column. Also, Japanese cars don’t have a VIN! They have different markings. This can be a chassis number, or a body or frame number. Instead of the conventional VIN of 17 numbers and letters, it contains 9 to 12 characters.

The identifier, as a rule, has a dual nature. On the one hand, the identifier must point to real world objects, display values ​​from external classifiers and directories as they are, and change with changes in external classifiers and directories (this is a business key). On the other hand, to link rows in different database tables, the identifier must be absolutely stable (it’s a surrogate key). Trying to combine these incompatible requirements in one identifier always leads to huge problems in the databases (data defects, recoding, excessive increase in the number of business key fields, etc.). Therefore, the association of business keys with a surrogate key must always be maintained. Each type of key should perform only one of its narrow functions.

The preferred data schema for associating business keys with a surrogate key is located at the beginning of this article. A hub is a database table related to a single entity (people or cars, etc.). The hub contains an association of business keys with a surrogate key. One surrogate key can correspond to several different business keys. For example, a person can be identified by TIN, or by full name, date and place of birth, or by passport number. The variability of the business key is provided by the attributes of historicity (changes in reality) and versioning (changes in accounting), for example, datetime_from, datetime_before, record_datetime. A business key can consist of several attributes, such as last name, first name, and patronymic. The remaining tables containing attributes of an entity or relationships of this entity with other entities are connected to the hub using a surrogate key.

When it is necessary to compile a report for business users and external organizations, not surrogate, but actual business keys as of the reporting date should be used.

As for the best type of surrogate key, it’s now undeniable UUIDv7. Read my article “Meet the Next Generation UUID for High-Load System Keys” and stay tuned.

If, in addition to the real world and the information system, a third party arises – a person (analyst and user of the data warehouse), then there is a need for human-readable identifiers. With a small number of possible values, it is common to use hand-crafted short alphabetic or alphanumeric codes as the key, which are easily confusing. For UUIDv7 it is also easy to get a human-readable identifier 26 characters long by applying an encoding Crockford’s Base32. Here is an example of such an identifier: 01H7K17QKJ5ND32CQS0NB03FXW. It may take a long time to pronounce it, but on the other hand, you can easily find a similar identifier, copy it to the clipboard, and compare it with another identifier by eye.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *