How the DLP system and the OCR module prevented employees from forging passport scans

Remember passport data leak have 500 million Marriott hotel chain customers? The attackers could have found the data, and the hotel group even promised to pay the cost of changing passports to the affected guests. There are many similar cases. It is clear why: today more 50% of companies stores more than half of its documents in the form of scans, screenshots, PDF. Three years ago, no more than a third of such documents were in organizations. According to a new survey by SearchInform, 51% of companies said that the number of documents in image format increased.

Recently, most often leaks in the form of images are subject to legally relevant documents, for example, contracts. In second place in the “risk group” are financial documents: balance sheets, profit and loss statements and so on. The loss of such data not only threatens reputational risks for the company, but can also lead to disruption of transactions. In order to protect important data from outsiders and intruders, they install in the information systems of companies DLP – systems for preventing information leaks.

We already spoke on Habr about how “SearchInform Information Security Circuit” (CIB) and the OCR module based on a technological product ABBYY FineReader Engine. Now, together with the employees of the SearchInform product implementation department, we have collected four stories about leaks of different types of data through corporate and personal mailboxes. And we figured out how to identify them using a DLP system with an OCR module.

In one travel company, an employee sent files in graphic format to personal mail. Using ABBYY technologies, it was possible to establish that the investments were passport scans, and this is a gross violation of the work with identity documents. In addition, this was a serious violation of the security policy of this travel company.

How exactly did it turn out that the graphic files were scans of passports? Using the built-in OCR technologies, the DLP system recognized the text on the scan, analyzed it and determined that the document had a passport number. There are other characteristics that are characteristic only of passports, for example, the presence of phrases such as “Passport issued”, “Unit code”, etc. in the document. Moreover, the DLP system uses the ABBYY classifier to recognize a number of documents, including passports. He refines the work of OCR technologies, and this ultimately improves the accuracy of the result.

Specialists of the IB service began to investigate the incident and found out that confidential files were transferred from under the account of the company’s designer, from his computer. All documents had similar names – “Scans”, “Scans_new”, “Scans_1”:

Recording from the monitor of the designer workstation in the mode of individual screen shots, which the MonitorController DLP system module makes, showed that the designer worked in Photoshop with passport scans. He cut out photos from them and then inserted new ones instead:

After analyzing all the actions of the designer, the security service found that the employee forged scans of documents. High-quality fakes could be used to register in Internet services when an attacker does not want to “shine” his real identity. It would be difficult for automatic verification systems to determine the authenticity of information in such images.

Thus, technology helped to track the situation with data leakage and fake passport scans. Thanks to this, the company eliminated the risk of harming its reputation.

The petrochemical company kept hand-filled questionnaires with employee data. The DLP system recorded the fact of sending these questionnaires outside the organization: the security policy for sending personal data worked.

The DLP system gave a signal due to the fact that the built-in OCR module can work with handwritten text and recognize it with an accuracy of over 88%. This is done using a structural classifier. In more detail about technologies of intellectual recognition of characters ABBYY – intelligent character recognition (ICR) – we already told on Habré.

The presence of personal data in the questionnaires became a signal for checking the incident. It turned out that the questionnaires also contained telephones, as well as detailed information about the health status of employees. If the data is leaking, then someone needs it. For example, they may be of interest to those who advertise medical services and engage in social engineering.

Scans of profiles could easily be in the public domain and this would lead to irreparable consequences. This data could be extracted by attackers and thereby harm not only employees, but also the reputation of the entire company. In this case, the employee whose questionnaire was in the wrong hands could complain to the labor inspectorate, Roskomnadzor or tell about the story on social networks.

The complexity of this case is that not all technologies can recognize handwritten text, but the OCR ABBYY module can do this. We give an example. Below is a hand-filled questionnaire:

And the result of recognition of such a profile:

ABBYY’s text recognition module helped uncover industrial espionage patterns. One of the hired top managers of the company, who moved to Russia from abroad, sent graphic files from his personal mail to his former colleagues. The DLP system has discovered this fact.

Thanks to the OCR module, the DLP system extracted text from photos and found out what the employee sent out photo of technical documentation to current company developments. Then DLP analyzed the texts using the “search for similar” algorithm. He is able to identify texts that are close in content or even meaning to the standard.

The difficulty was that confidential documents were in the language of one of the CIS countries. But both the DLP system and the OCR module can work with this language. The OCR module recognizes documents in 210 languages ​​(in printed text format) and 126 languages ​​(in handwritten format) – for example, languages ​​with alphabets based on Latin, Cyrillic, Greek and Armenian characters and many others. You can even work with documents in mixed languages, if, for example, words in the CIS language and names in English are used there.

Moreover, all technical documentation contains many tables, drawings, graphs and diagrams. Often you need to understand what is written in them, since this information can play a significant role. The OCR module recognizes tables and other complex structures in documents well. Thanks to this, he can extract all the information from the graphs, for example, to understand whether the data is current or already outdated.

The DLP system signaled the leak of technical documentation to the IB service employees, they analyzed the incident and confirmed that the signal was not false and the photo was really taken from confidential documents. As a result, verification of the working correspondence of this manager began. Information security experts found that he was merging his friends abroad with valuable data that competitors from another state could use (spoiler: and use it). For example, in his letters there was an informal conversation with boasting about how “his friends will conquer the market first and go around everyone”, including the company in which the top manager worked at that time.

But the story does not end there. The security service continued to investigate this case, using the capabilities of the DLP system. The program helped to find correspondence with customers. It turned out that the top manager opened his own legal entity and passed it off as an authorized service center of the “native” company. He took part of the repair orders from the employer, but at the same time used not new, but discarded parts. This led to customer complaints about the main company and a loss of reputation. Firstly, the company lost its competitive advantage, and secondly, it didn’t receive profit, as orders left.

The head of the engineering department of a large company issued a sick leave certificate. This fact would not have attracted attention if the security policy that fixes the forwarding had not worked previously in the DLP system. air tickets.

The fact is that earlier a letter was sent to the employee’s mail with a graphic attachment in PDF format. Thanks to the OCR module, the text on the PDF was recognized,

and the DLP phrase search analytic module clarified that the attached file is a ticket. This was done using a set of phrases, which is typical only for electronic tickets, for example, “departure time”, “booking code”, “flight”, “electronic ticket”, etc. As a result, it turned out that the dates of the flight coincided in time with the sick-list.

A further investigation showed that the head of the engineering department was going to another city for an interview, which was confirmed by his further correspondence with HR competitors, which the security service found and analyzed. Thus, the DLP system helped the company management put the situation under special control and prepare for the dismissal of the employee. It was possible to stop the potential leak of important data to competitors and maintain the continuity of the work process in the enterprise.


As you can see, the cases are different, but in all cases, documents can be recognized and analyzed. If you have examples of unusual document leaks in the form of images or photographs, share them in the comments. We will help to sort these situations out.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *