ML-based binary dependency analysis. Final part

In the previous article we explored the idea of ​​our component analyzer and shared the results of some experiments carried out in the laboratory. The results obtained on a small part of the dataset of 3000 libraries were quite optimistic. In this article we will describe the difficulties that we encountered when trying to apply the solution to ~105k libraries, and tell how we dealt with them.

First attempts to scale the solution

After the experiments, we had everything to expand to a larger number of libraries. As a result, we received a dataset of 25 million vectors of dimension 768. The volume of such data without compression is about 71 GB, and potentially the volume of data will increase after building the index.

Let us point out one unpleasant detail: to search for the nearest vectors in Milvus, the entire dataset with the index must be loaded into RAM. We didn't have such resources, so we decided to use DiskANN — a new algorithm that allows you to search data on NVMe disks. It builds a weighted graph on the points of the dataset and, in response to a request, walks along the edges of the graph, bringing the desired point closer.

The DiskANN index turned out to be very resource-hungry: for our dataset, 200 GB of memory turned out to be the lower limit. In addition, we encountered difficulties in building the index, which were resolved by switching to a new version of Milvus, which was released just the other day.

Visualization of the graph construction process in DiskANN on 200 two-dimensional points

Visualization of the graph construction process in DiskANN on 200 two-dimensional points

As a result, the accuracy turned out to be near zero: we tested the model on a jar file collected from the dataset’s dependencies. At first it turned out that the index bad recall came out, We quickly corrected this by adjusting the hyperparameters and discovered another problem: there were too many duplicate vectors in the database.

A common situation has turned out to be in which a certain library class is so typical that it is found in more than 500 other libraries. This problem was also present in the experiments from the previous article, but there it was solved using search heuristics: for a query in milvus, we searched for the 30 closest ones and grouped them by distance, this was enough to find out whether the method was found in different libraries. On a large dataset, our method turned out to be untenable.

In addition, the model requires a GPU to operate, since we use a heavy RoBERTa transformer, and even so, processing the test jar for ~4k classes took about 15 minutes. We have outlined two further strategic goals: to get rid of repetitions of vectors in the database and switch to a lighter model, ideally with the possibility of inference on the CPU.

Transition to a lighter language model

We settled on all-MiniLM-L6-v2 due to her lightness And good quality indicators in comparison with other models. We tested all-MiniLM-L6-v2 in the first experiments on a small dataset:

Compared to RoBERTa, the quality dropped only during obfuscation, but we are ready to trade this in favor of better performance. We used the translation of the model into the ONNX format and on a 6-core 2.6 GHz.

As a result, processing a test jar with ~4k classes began to take an average of 11 minutes. The embedding dimension here is 384, so this potentially halves data storage costs.

Removing duplicate vectors in the database

In the case of repetitions, it was not possible to easily find a ready-made solution or algorithm. Due to the nature of milvus, getting rid of duplicates in an existing database is more difficult than simply creating a new database and keeping it free of duplicates.

The difficulty is that the most straightforward algorithm with a linear database search will take the quadratic time of the number of classes in all libraries: to make sure that a vector is not a duplicate of one, you need to query the database. In our case, for 25 million objects this is an prohibitively long time, that is, we cannot do without an index. But the index does not guarantee 100% recall of the search; the accuracy highly depends on the hyperparameters. Then you can build and update an index on vectors on the fly, supporting hyperparameters for optimal search accuracy and speed.

The method of storing data showed its inconsistency: for each vector, information about which library it came from was stored in milvus next to the vector, and this is not very convenient.

We decided to create a relationship in the Postgress database, where for each library we will store vector indexes in milvus. As a result, the algorithm is like this:

  1. We create an empty dataset without an index.

  2. We count the vectors of library classes and try to add them to the database, checking before adding that they are no longer there:

    1. It’s more optimal to type a certain batch of vectors and send it to milvus to search for the closest ones;

    2. among those vectors that were not found, select unique ones, because within one batch there may be repetitions;

    3. add unique vectors to the database, returning their indices;

    4. from the previous steps, for all vectors of the batch, we received the indices of their vectors – all that remains is to enter this information into the Postgress relation.

  3. When a certain dataset size is reached, we build an IVF_FLAT index on it with optimal hyperparameters.

  4. We continue to update the vector database by repeating step 2 and rebuilding the index to maintain optimal hyperparameters.

Index parameters were selected based on articles from the milvus blog. We took IVF_FLAT because the algorithm is based on dividing the entire dataset into nlist of clusters, and the search occurs first along the centers of the clusters, and then linearly within nprobe of the nearest clusters. The index does not require additional memory, construction time depends on the clustering algorithm, but milvus uses Lloyd's heuristic K-means, which provides linear asymptotics.

The algorithm worked in 1.5 days, and the dataset turned out to be 2.64 million vectors. At the same time, we also solved the problem with the amount of data: now the vectors more than fit into 16 GB of RAM. We immediately had the idea that we could move on to a finer division: using method vectors instead of classes.

Transformer models have a limit of 512 tokens per input sequence length. When we counted class embeddings, we took mean-pooling of method embeddings. With a smaller split, we can afford not to lose information when taking the average, but immediately take embeddings of methods into the database. We calculated the dataset of vectors for the methods – 4.93 million elements came out, and the calculation took 5 days. We carried out all tests using methods.

Tests, problems and a little cheating

We agreed to solve problems as they came up and for a while forgot about obfuscation, and all tests were carried out with n_neigbours=1.

First, we checked on a test jar file: out of 21 dependencies, 19 were found, and 500 were returned as false-positive. Two dependencies that the model did not see were not in the dataset. Upon closer examination, it turned out that most fp are packages in which the set of methods in the source code is limited to one, two or three representatives. If you remove them, it will come out 15 true addictions and 41 fp. Moreover, 35 packages out of these 41 are incorrect versions of dependency libraries.

We are faced with a problem: there are a number of libraries that, within the framework of our method, can cause a significant number of fp, because they have few methods and these methods are found elsewhere, but even if they are removed, there are many errors in the versions.

We came up with a method that will not exclude problematic libraries from consideration, but will use additional information in the form of constants. There is another way – to work under the assumption that such packets are not of much interest to us and they can be removed from the dataset by filtering by vectors. We will look at it further.

The method with constants guarantees us 100% recall, but the precision is in question. Then we decided to do this:

– take not all the constants for each library, but a small part – 500 pieces;

— create a pg-relation in which to store which library the constant came from;

— intersect the baseline prediction with the prediction of our model so that unnecessary small libraries are eliminated.

Yes, it looks like cheating, but if most of the files we see have such a great feature, then why not take advantage of it? Moreover, the solution cost us free: the baseline has already been implemented, and due to the limitation on the number of constants, we received a small ratio in pg and fast queries. But the results were again disappointing: 19 tp and 34 fp came out, all of which were incorrect versions of the true dependencies.

We were not going to give up and sorted out all the FP cases with our own hands. It turned out that they are indistinguishable by code from the true version depending. Here is an example of the google-http-client-apache-v2-1.42.0 library. The prediction shows 15 versions of it (besides them, there is another version in the dataset – 1.33.0):

1.34.0

1.38.0

1.38.1

1.39.0

1.39.2

1.40.1

1.41.0

1.41.1

1.41.4

1.41.7

1.41.8

1.42.0

1.42.2

1.42.3

1.43.1

If on the official repositories Consider the difference between 1.34.0 and 1.43.1, then in the google-http-client-apache-v2 folder there are no significant semantic changes in the source code: comments, new fields in methods.

The same thing happened with the rest of the dependencies. We came up with a hypothesis: the different versions of the library produced in the prediction will actually be indistinguishable by code. In the sense that many of their classes and methods are the same, as is the corresponding bytecode.

Hypothesis testing with the same code

To test the hypothesis about different versions of libraries, we changed the approach to the experiment. Previously, we gave one library as input to the model and expected it at the output, but now we have brought the experiment closer to more realistic conditions. We provided the model with jar files collected from arbitrary libraries of the dataset as input, and expected these libraries to be the output.

We recorded a dataset of 1000 jar files, each of which contains 4-15 dependencies. The metrics were calculated element by element, that is, for each jar file:

We calculate the metrics for each element of the dataset: recall_i will show the share of correctly predicted dependencies among the true ones for element i, precision_i will show the share of correctly predicted dependencies in the prediction for element i

We calculate the metrics for each element of the dataset: recall_i will show the share of correctly predicted dependencies among the true ones for element i, precision_i will show the share of correctly predicted dependencies in the prediction for element i

In the basic version, we looked at the names of libraries and versions, that is, we considered any discrepancy in the name or version of the library as errors. It was important to take into account that the code of different versions may be the same. We decided that if the fp case was a bug in the library, which means that no other version of the package is in the dependencies, then consider it an fp. In all other cases, this is an error in the package version of the dependencies; we compare the bytecode of the extra version with the true one and, if it is different, we consider it an fp case.

We were able to compare how many unjustified fp cases occur: when formally it is fp by name, but in fact it is an indistinguishable version of some library from a dependency. There is nothing wrong with such cases, because ultimately the resulting SBOM will be used for specific tasks, such as CVE analysis.

Distribution of metrics for each element of the dataset

Distribution of metrics for each element of the dataset

We can say that the hypothesis about fp being indistinguishable by code was confirmed: we see a significant increase in the metric if we do not take into account the matching code as an error. It turns out that only 1204 fp versions out of 35,081 differ from the true version by code. But we must not forget about cases when fp is not another version of some true dependence. Because of these, there are still code errors in 416 elements of the dataset, that is, despite the fact that within one element of the dataset, on average, we do not make much mistakes, we make mistakes quite often in general.

Two extremes: only constants or only vectors

In the following experiment we used only constants:

The example shows how much the use of code information in analysis improves the accuracy of determining dependencies

Another experiment was devoted to whether it is possible to use only vectors, because before the experiment we set the goal to be resistant to obfuscation, but now we use constants.

When using only vectors in prediction, many small libraries appear whose methods are not representative. From a relation in Postgres, we can easily get the distribution of the number of libraries that a vector belongs to:

The graph is trimmed at the 0.95 quantile - 44 libraries for better visualization

The graph is trimmed at the 0.95 quantile – 44 libraries for better visualization

A simple idea immediately arises: let's not take into account in the analysis vectors that occur in at least X packets. In a sense, X is responsible for controlling recall: the smaller it is, the fewer libraries in the dataset we can potentially recognize. In other words, we lose some libraries from consideration (decreasing recall), but those libraries that remain are better distinguishable from each other, so the confidence in correctness is greater and precision is greater.

The question arises: how many libraries are we potentially losing for different values ​​of X?

At X = 40 we lose about 5000 libraries. If we want precision = 1, we can take X = 1, however, we can potentially recognize only a third of the dataset. But is it possible to achieve a trade off by selecting the optimal X? Let's look at the metrics at X = 40:

It turned out noticeably worse than when using vectors with constants, but better than just constants. Now let's try X = 10:

In fact, you can filter out vectors not only by the number of packages in which they are presented. You can try different vector statistics: count the number of packages up to the library name, ignoring versions (for example, embedding occurs in three versions of library A and ten versions of library B, then the value of the statistic is 2). Or count libraries not for a specific vector, but for some of its surroundings.

What conclusions did we draw?

Unfortunately, we did not have time to evaluate the quality of the model using obfuscated data. The time for the task has run out, and we spent too much of it trying to make the model work clearly on a large dataset and understand the reason for the large number of FPs.

But we also consider the experience that we managed to gain to be quite valuable. For example, repetitions of vectors in the database are an evil that prevents the model from working, and we have found a way to get rid of them in an acceptable time.

Looking only at package names when calculating metrics is a bad idea, because it turned out that there are a bunch of versions that are indistinguishable by code. In addition, this approach is in no way tied to jar files: their only feature that we used was the large amount of information in the constants.

But, as we found out in the last paragraph, reasonable accuracy can be achieved through high-quality selection of vectors. In general, you can transfer the approach to other binary files, use the same rizin as a disassembler, take functions or a connectivity component in the call graph as a partitioning unit – then everything is absolutely the same.

This ended our internship and we moved on to the staff to solve even more interesting problems. But that's a completely different story.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *