Experience in implementing the neural network module of the Russian DCAP system

Hello, my name is Mikhail, I am a senior machine learning specialist at Makves (part of the Garda group of companies). I prefer to call myself a deep learning engineer, since most of my working time is spent training and putting into production neural network models. The task described in the article occurs quite often in practice, but is not sufficiently covered in the Russian-language materials available to me.

In this article I will talk about how we at Makves trained neural networks to solve corporate security problems, or more precisely, to create a smart DCAP system.

You will learn about how it works DCAP system that seeswhat's happened neural network classifierhow they are selected threshold values and something else.

Let's imagine a large company with a corporate network in which employees work, interact with each other, exchange files, etc. Files are continuously created, copied, and modified. What will happen if, for example, salary slips become publicly available (even only within the corporate network)? What if someone mistakenly posts personal information about employees? What if this data is then copied to a flash drive and leaked to competitors or hackers? To prevent such situations, it is necessary to control the huge amount of data stored within the network in very different formats, and this is precisely the problem that the DCAP system solves.

To do this, it implements a robot that “crawls” through the corporate network and checks any information it can access. Such a robot can notice a huge number of problems: passwords that have not been changed for a long time, passport data in clear text, etc.

DCAP system that sees

What should our robot do when, while inspecting the corporate network, it sees thousands (or millions) of graphic files? What's in them? Which ones contain potentially sensitive data? What files should be reported to an information security specialist about the discovery? You can try to use heuristics, for example, if the file is called `sales_report_scan_final.jpg`, or is located in the `SALES_2024` folder, then most likely there is sales data there. You can analyze the structure of directories and file names quite quickly, but what to do with the files `/common/temp/work/unnamed/022/h/img_6657.JPG`? You can’t do this without a neural network.

DCAP system that sees

DCAP system that sees

So, the challenge: there are thousands or millions of graphic files. It is necessary to quickly analyze them and provide the information security specialist with the most useful information about the contents of the files. Let's analyze the files in two stages.

Stage 1. Classification

In different cases, the specific contents of files have different meanings. Therefore, first of all, files in the IT infrastructure need to be classified. Their samples are easy to synthesize and train a simple classifier network on them.

It would seem to be a classic computer vision problem, but there is a nuance. The customer is interested in `N` classes that need to be responded to. However, there will be a huge number of graphic files on the network that do not belong to the required classes: scans of official memos, orders, drawings, screenshots, desktop wallpapers, photographs from corporate events and many other materials that may not be relevant to the work. It turns out that the neural network must process `N+1` classes: `N` customer classes and the “other” class, to which we will classify all other images.

Neural network classifier

Neural network classifier

Neural network classifier

Having analyzed the needs of customers, we decided to further divide the “other” class into two categories: “other” and “other documents”. We decided to classify as “other” all images that most likely do not contain work information: photographs from corporate events, vacations, children’s drawings, desktop wallpaper, etc. And scans of official memos, drawings, orders and blocks -schemes are better classified as “other documents”. Thus, we increase the information awareness of the information security specialist: at a minimum, it will be possible to quickly estimate the amount of disk space occupied by images that could potentially simply be deleted.

Implementing the “other” class is not so easy precisely because of the potential diversity of files. It may also include photographs of nature, people, cars, animals, abstract paintings, etc. Collecting a dataset containing all possible variants of such images is a very difficult (and sometimes unsolvable) task.

In practice, two approaches have worked well to solve this problem: selecting thresholds for the loss function values ​​and collecting a super-diverse dataset.

Both approaches have their advantages and disadvantages. Let's look at each one.

Selection of threshold values

The idea is quite simple: at the output of the neural network we receive logits, which can be interpreted as the probabilities of assigning the input data to a particular class. If the probabilities for all customer classes are low, we have the “other” class. You can train the model, for example, with the `triplet-loss` loss function[1]which correlates input data with embeddings in a compact Euclidean space, while the distances between embeddings can be directly interpreted as the degree of similarity of the input data.

The `triplet-loss` function is described as follows:

$$ L(a, p, n) = max\{d(a_i, p_i) – d(a_i, n_i) + \alpha, 0\}$$,

Where $d(x_i, y_i) = ||x_i – y_i||_p$ is the distance between embeddings, and $\alpha$ — hyper-parameter of the loss function.

The idea of ​​the function is quite simple – during the training process we try to minimize the distance between embeddings belonging to the same class ($d(a_i, p_i)$)and maximize the distance between embeddings belonging to different classes ($d(a_i, n_i)$).

The advantages of this approach include simplicity: we do not need to collect documents of the “other” class; it is enough to simply train the neural network on the customer’s classes and cut off “others” by a threshold value.

Implementation of additional classes

You can go straight ahead – add an additional class to the dataset and fill it with the most diverse data possible. It should be understood that very different (and often unexpected) files may potentially be found on the customer’s network. Photographs of dogs, cats, people, abstract paintings and drawings. Photographs of furniture, interiors, cars. Images of cups, spoons and Pithecanthropus. Ideally, our training dataset should contain a huge number of different images.

The disadvantages of this approach are obvious – it is very difficult to assemble a sufficiently diverse class. On the other hand, one cannot fail to note an important advantage – flexibility. For example, at some stage we discovered that the neural network incorrectly classified a photograph of a chair (during the training process, it did not see chairs and does not know which class to assign it to). The problem can be solved quite simply: we collect several photographs of chairs (and armchairs so as not to get up twice), add them to the dataset and retrain the network.

Solution

We decided to test both approaches and compare them in relation to our tasks. When comparing approaches, we took into account that it is extremely important to reduce the number of type I errors (false positives).

We hoped that we would be able to implement an approach based on the use of threshold values ​​of the loss function, but, unfortunately, experiments showed that on our data it is inferior to the second approach precisely in terms of the FPR metric (false positive rate, the proportion of type I errors) .

Collecting data for the “other” and “other documents” classes turned out to be quite a challenge. From time to time we encountered the fact that the neural network incorrectly classified certain images, since we did not think to add the corresponding examples to the training set, but we managed to significantly reduce the proportion of false positives.

A separate issue was the formulation of the distinction between classes. The memo refers to “other documents,” and the employee’s photo from a fishing trip refers to “other.” And to what class should we classify, for example, a photograph of a company’s exhibition stand? This is clearly not a document, but at the same time it can be considered a working file. What if some documents are visible in the photo of the stand? What if a photo of the stand is inserted into the report and scanned?

Subtotal

The goal was to increase the information awareness of information security specialists by providing them with the data necessary to analyze a large number of images. At the first stage, we implemented a neural network classifier, which allows us to select images of classes of interest to customers, as well as divide the remaining files into documents and other things.

Stage 2. Detection

Image classification allows you to find predefined image types, as well as separate potentially irrelevant image files. This is an important part of the analysis. It must be taken into account here that image files that we classify as “other documents” may also contain sensitive information. It is necessary to implement the identification of indirect signs by which information security specialists can judge the contents of files.

To assess the importance of files of the “other documents” class, we decided to implement an additional object detection block in the neural network module. Similar approaches are often used, for example, to analyze video from surveillance cameras, but we decided to use it to process office documents.

Object detection

Our approach is based on the fact that business documents often contain typical elements: tables, graphs, seals, stamps, blocks of details, etc. If you train a neural network to detect such graphic elements, based on the results of its work it will be possible to indirectly assess the content of the image. For example, an information security specialist will be able to filter out images containing tables and corner stamps in order to detect scans of letters with any statistics.

We have compiled a list of elements of business documents that need to be detected. Unfortunately, it was not possible to find data sets in open sources that would be close to the task, so we had to create our own dataset from scratch.

We have collected many examples of business documents from open sources. We paid special attention to the diversity of the collected data so that our training sample included examples of a wide variety of documents of varying quality (scanned documents and their copies, old documents, etc.).

After collecting a fairly large sample (more than 40,000 individual pages), we began markup. This stage turned out to be quite difficult, but not only because of the labor intensity of the process. To operate several markers in parallel, we used free software. For consistency, we have created markup instructions.

Marking up objects inside business documents presented unexpected challenges. When marking physical objects, such as pedestrians or cars. With good image quality, it was easy to determine whether the image was actually a pedestrian or a car and where its physical boundaries were. But the marked entities: tables, graphs, stamps are rather concepts that can be formally defined.

For example, it is quite difficult to clearly formulate what a table is. Imagine you are labeling data. Look at the 6 pictures in front of you. Which of them show tables?

It looks like the first picture is not a table after all (even if formally there are cells and they are grouped), rather it is some kind of filled-in field. And the second picture is more of a diagram. The third and fourth pictures are definitely tables, although the cells are not separated. And in the fifth and sixth pictures, perhaps, there are tables, but how many are there? One table per drawing or two? And where are their borders? Similar questions had to be answered for almost all detected classes. The domain of business documents is very diverse, so preparing a dataset has become a very non-trivial task.

Subtotal

To further increase the information awareness of information security specialists regarding the content of graphic files of the “other documents” class, we have collected a dataset for solving the problem of object detection. We have marked the main graphic elements of business documents so that, based on their presence in a specific image, information security specialists can indirectly assess the content of the document.

Result

We have implemented a complex two-stage neural network system for analyzing graphic files. Based on the results of its work, information security specialists receive data to quickly assess the importance of a huge number of documents stored within the corporate network. Extracted data allows you to filter and quickly locate files containing critical or sensitive information.

Conclusions

Based on the results of the work carried out, several conclusions can be drawn:

· neural network technologies have long reached the level of maturity that allows their implementation in information security products;

· labeled data is of great value, but the process of data labeling itself is complex and labor-intensive;

· despite the enormous progress of modern multimodal large language models (LLM, large language models), their implementation for solving some problems is often impossible due to the danger of data leakage, the cost of implementation and the duration of obtaining results;

· At the same time, small neural network models are free from the above-mentioned disadvantages and can be successfully implemented to solve complex problems.

I invite you to discuss even more data protection issues at the conference “Save everything: information security» October 24 at the Soluxe Convention Center. Our CEO, Roman Podkopaev, will speak in the section “Data Protection Platform. A look from the inside.”

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *