How to calculate the “similarity” of numbers in passports. And find the same even with typos
HFLabs Products looking for duplicate customers in the databases of federal companies. The most obvious way to find the same customer cards is to compare passports or other identification documents.
Previously, we compared the numbers of documents strictly: the same – excellent, no – sorry. For manual analysis, because of a typo in the room, even those cards that had the same name and address were left. This approach unnecessarily burdened the customer staff.
Therefore, we climbed into the data, examined the statistics and deduced the criteria – when different numbers are really different, and when it comes to typos. I tell you how the algorithm works.
Introduced the coefficient of “similarity” numbers
To divide the numbers of passports and other documents into “match-not match” is too rude a decision. You can act finer and catch simple mistakes.
Let’s say the company has the following rules for finding duplicates (DUL – identity document):
- “Name, address and DUL completely coincided” – duplication factor – 100;
- “Full name and full name” – 97;
- “Name and address completely coincided” – 95
- “Names completely coincided” – 80.
Automation combines cards with a ratio higher than 97. The rest will someday be taken apart by special people – data stewards. If you are lucky and the turn comes.
The result – in the queue for manual analysis are quite obvious duplicates. Even those cards with the same name and address as passport numbers are distinguished by a common typo. As is the case with 46 01 859473 and 45 01 859473 (The keys 6 and 5 are nearby, they are often confused.) Data stewards are distracted by simple typos, and real duplicates are detected more slowly.
Looking at what was happening, we taught our products to count the “similarity” of numbers in documents. Customers are already using the new option in the rules for automatic duplicate merging.
We consider “similarity” according to clear rules
Comparing documents, the algorithm first of all cleans the numbers from garbage. Leaves only letters and numbers: A — Z, A — YaE, 0–9. And then the magic begins, for which I wrote this article – the calculation of the coefficient of “similarity”.
Important caveat: odds are not likely. This number is needed to divide duplicates into groups with the same type of errors. It doesn’t even matter what the “similarity” in absolute value is – it’s just a parameter for comparing numbers.
And now – to the rules of calculation.
The rule | Similarity Ratio | Example | A comment |
---|---|---|---|
Full match | 100 |
| There’s nothing to talk about, everything is clear |
Transgraphics | 100 |
| Transgraphics is when the characters of one alphabet are replaced with the same from another. In the first case, the characters are Cyrillic, in the second – Latin. Typical harmless typo |
One common typo | 95 |
| A common typo is when characters are closely located on one of the number blocks of the keyboard or are similar in spelling. Factor searches for common typos by similarity tablecollected by our analysts. (It’s better to download it sooner until your colleagues forced to remove the link) |
Layout change | 94 |
| It works if there are only numbers and cyrillic in one line, and only numbers and Latin in the other. Otherwise, it doesn’t seem that a person in good faith made a mistake with the layout |
Replacing Roman Numerals with Arabic | 93 |
| Only works at the beginning of a line. The logic is this: “honest” Roman numerals can only be in a series, and a series – only at the beginning |
One common typo | 90 |
| Non-common typo – one that is not included in the table of common |
One permutation of two characters | 90 |
| Typical typo, there’s nothing to add |
Character pairs are mixed up | 89 |
| Only works for episodes longer than four characters. We consider it a typo only if it occurs at the beginning of the line. This is a typical statement mistake when entering a series of documents. And no wonder – on a form a series is printed with two pairs of numbers. In the middle and end of a line, such permutations are an error. |
One number is included in another | 88 |
| With this comparison, we catch cases of “lost the series”. Only works for strings with a length of six characters or more. Six characters – the minimum number length in documents known to us. For typos we count only at the beginning or at the end of the line. Otherwise, instead of randomly cropped props, there will be fragmentary occurrences of some sequences in others. So you can take the zip code inside the TIN for a good typo |
Any two typos | 80 |
| Already pretty close to the border, but mistakes still seem like “honest” typos |
All other cases | 0 |
| Typing up the remaining discrepancies is dangerous. The probability of a mistake is too high |
“Similarity” is taken as a parameter when we are looking for the same customers
The federal bank is already using the new rules – with the help of them they are looking for duplicates among potential customers. Next we will connect a large insurance.
During integration, we adjust the duplicate search scripts so as to take into account the “similarity” of numbers in documents.
Back to the typical rules for finding duplicates, I described them at the beginning:
- “Name, address and DUL completely coincided” – duplication factor – 100;
- “Full name and full name” – 97;
- “Name and address completely coincided” – 95
- “Names completely coincided” – 80.
By introducing new rules for comparing numbers, we change the scripts for finding duplicates from the customer:
- “Full name, address, DUL” – 100;
- “Full name, address, DUL coincided 90 and above” – 98;
- “Full name and full name” – 97;
- “Name and address completely coincided” – 95;
- “Names completely coincided” – 80.
Automation still “sticks together” all cards with a coefficient above 97. But with new orders, cards that do not differ only in typos in document numbers will not go away for manual analysis. Explicit duplicates instantly collapse, and data stewards sort out really complex cases.
Article first published on the HFLabs blog.