How to calculate the "similarity" of numbers in passports. And find the same even with typos

HFLabs Products looking for duplicate customers in the databases of federal companies. The most obvious way to find the same customer cards is to compare passports or other identification documents.

Previously, we compared the numbers of documents strictly: the same – excellent, no – sorry. For manual analysis, because of a typo in the room, even those cards that had the same name and address were left. This approach unnecessarily burdened the customer staff.

Therefore, we climbed into the data, examined the statistics and deduced the criteria – when different numbers are really different, and when it comes to typos. I tell you how the algorithm works.

Introduced the coefficient of “similarity” numbers

To divide the numbers of passports and other documents into “match-not match” is too rude a decision. You can act finer and catch simple mistakes.

Let’s say the company has the following rules for finding duplicates (DUL – identity document):

“Name, address and DUL completely coincided” – duplication factor – 100;
“Full name and full name” – 97;
“Name and address completely coincided” – 95
“Names completely coincided” – 80.

Automation combines cards with a ratio higher than 97. The rest will someday be taken apart by special people – data stewards. If you are lucky and the turn comes.

The result – in the queue for manual analysis are quite obvious duplicates. Even those cards with the same name and address as passport numbers are distinguished by a common typo. As is the case with 46 01 859473 and 45 01 859473 (The keys 6 and 5 are nearby, they are often confused.) Data stewards are distracted by simple typos, and real duplicates are detected more slowly.

Looking at what was happening, we taught our products to count the “similarity” of numbers in documents. Customers are already using the new option in the rules for automatic duplicate merging.

We consider “similarity” according to clear rules

Comparing documents, the algorithm first of all cleans the numbers from garbage. Leaves only letters and numbers: A — Z, A — YaE, 0–9. And then the magic begins, for which I wrote this article – the calculation of the coefficient of “similarity”.

Important caveat: odds are not likely. This number is needed to divide duplicates into groups with the same type of errors. It doesn’t even matter what the “similarity” in absolute value is – it’s just a parameter for comparing numbers.

And now – to the rules of calculation.

The rule	Similarity Ratio	Example	A comment
Full match	100	46 07 324654; 46 07 324654	There’s nothing to talk about, everything is clear
Transgraphics	100	AB 4,358,333; Ab 4358333	Transgraphics is when the characters of one alphabet are replaced with the same from another. In the first case, the characters are Cyrillic, in the second – Latin. Typical harmless typo
One common typo	95	50 16 631502; 50 16 631602	A common typo is when characters are closely located on one of the number blocks of the keyboard or are similar in spelling. Factor searches for common typos by similarity tablecollected by our analysts. (It’s better to download it sooner until your colleagues forced to remove the link)
Layout change	94	As 98787; FY 98787	It works if there are only numbers and cyrillic in one line, and only numbers and Latin in the other. Otherwise, it doesn’t seem that a person in good faith made a mistake with the layout
Replacing Roman Numerals with Arabic	93	XIX 987987; 19 987987	Only works at the beginning of a line. The logic is this: “honest” Roman numerals can only be in a series, and a series – only at the beginning
One common typo	90	1234 987987; 3234 987987	Non-common typo – one that is not included in the table of common
One permutation of two characters	90	3554 463678; 3554 466378	Typical typo, there’s nothing to add
Character pairs are mixed up	89	12 34 987987 34 12 987987	Only works for episodes longer than four characters. We consider it a typo only if it occurs at the beginning of the line. This is a typical statement mistake when entering a series of documents. And no wonder – on a form a series is printed with two pairs of numbers. In the middle and end of a line, such permutations are an error.
One number is included in another	88	123456789; 3456789	With this comparison, we catch cases of “lost the series”. Only works for strings with a length of six characters or more. Six characters – the minimum number length in documents known to us. For typos we count only at the beginning or at the end of the line. Otherwise, instead of randomly cropped props, there will be fragmentary occurrences of some sequences in others. So you can take the zip code inside the TIN for a good typo
Any two typos	80	15 02 478643; 15 05 478648	Already pretty close to the border, but mistakes still seem like “honest” typos
All other cases	0	46 07 987987; 32 34 987987	Typing up the remaining discrepancies is dangerous. The probability of a mistake is too high

“Similarity” is taken as a parameter when we are looking for the same customers

The federal bank is already using the new rules – with the help of them they are looking for duplicates among potential customers. Next we will connect a large insurance.

During integration, we adjust the duplicate search scripts so as to take into account the “similarity” of numbers in documents.

Back to the typical rules for finding duplicates, I described them at the beginning:

“Name, address and DUL completely coincided” – duplication factor – 100;
“Full name and full name” – 97;
“Name and address completely coincided” – 95
“Names completely coincided” – 80.

By introducing new rules for comparing numbers, we change the scripts for finding duplicates from the customer:

“Full name, address, DUL” – 100;
“Full name, address, DUL coincided 90 and above” – 98;
“Full name and full name” – 97;
“Name and address completely coincided” – 95;
“Names completely coincided” – 80.

Automation still “sticks together” all cards with a coefficient above 97. But with new orders, cards that do not differ only in typos in document numbers will not go away for manual analysis. Explicit duplicates instantly collapse, and data stewards sort out really complex cases.

Article first published on the HFLabs blog.