Behavioral analysis in the problem of malware detection

Malware has long been one of the main threats in the field of information security. There are different approaches to analyzing and protecting against such attacks. In the general case, there are two approaches: static and dynamic analysis.

The task of static analysis is to find patterns of malicious content in a file or process memory. These can be strings, fragments of encoded or compressed data, sequences of compiled code. A search can be made not only for individual patterns, but also their combinations with additional conditions (for example, with reference to the location of the signature, checking the relative distance in the location from each other).

Dynamic analysis is the analysis of program behavior. It is worth noting that the program can be run in the so-called emulated mode. Safe interpretation of actions is expected without causing damage to the operating system. Another way is to run the program in a virtual environment (sandbox). In this case, there will be an honest execution of actions on the system with subsequent fixation of calls. The degree of logging detail is a kind of balance between the depth of observation and the performance of the analyzing system. The output is a log of program actions in the operating system (behavior trace), which is amenable to further analysis.

Dynamic or behavioral analysis provides a key advantage – regardless of the attempts to obfuscate the program code and the desire to hide the attacker’s intentions from the virus analyst, the malicious impact will be recorded. Reducing the problem of malware detection to action analysis allows us to put forward a hypothesis about the stability of an advanced malware detection algorithm. And the reproducibility of behavior, thanks to the same initial state of the environment for analysis (a snapshot of the state of the virtual server), simplifies the task of classifying legitimate and malicious behavior.

Often approaches in behavioral analysis are based on sets of rules. Expert analysis is translated into signatures, based on which the malware and file detection tool draws conclusions. However, in this case, a problem can arise: only those attacks that strictly comply with the written rules can be taken into account, and attacks that do not meet these conditions, but are still malicious, can be skipped. The same problem occurs in the case of changes to the same malware. This can be solved either by using softer trigger criteria, that is, you can write a more general rule, or by using a large number of rules for each malware. In the first scenario, we run the risk of getting a lot of false positives, and the second one requires a significant amount of time, which can lead to a delay in the necessary updates.

There is a need to extend existing knowledge to other similar cases. That is, those that we have not met before and have not processed the rules, but based on the similarity of some signs, we can conclude that the activity can be malicious. This is where machine learning algorithms come to the rescue.

ML models with correct training have a generalizing ability. This means that the trained model has not only learned all the examples on which it was trained, but is able to make decisions for new examples based on patterns from the training set.

However, in order for the generalization ability to work, it is necessary to take into account two main factors at the training stage:

The set of features should be as complete as possible (so that the model can see as many patterns as possible, and, accordingly, better extend its knowledge to new examples), but not redundant (so as not to store and process features that do not carry useful information for the model ).
The data set should be representative, balanced and regularly updated.

Since we had the opportunity to collect the required amount of data and there was a hypothesis that machine learning could expand the existing solution, we decided to do this research: form a set of features, train the model on them, and achieve accuracy that allows us to trust the model’s conclusions about the maliciousness of files.

How expert knowledge is transferred to machine learning models

In the context of malware analysis, the source data is the files themselves, while the intermediate data is the helper processes they create. Processes, in turn, make system calls. The sequences of such calls are the data that we need to transform into a set of features.

The compilation of the dataset began on the expert side. Features were selected that, according to experts, should be significant in terms of malware detection. All signs could be reduced to the form of n-grams by system calls. Then, using the model, we assessed which features make the greatest contribution to the detection, discarded the excess and got the final version of the dataset.

Initial data:

{"count":1,"PID":"764","Method":"NtQuerySystemInformation","unixtime":"1639557419.628073","TID":"788","plugin":"syscall","PPID":"416","Others":"REST: ,Module=\"nt\",vCPU=1,CR3=0x174DB000,Syscall=51,NArgs=4,SystemInformationClass=0x53,SystemInformation=0x23BAD0,SystemInformationLength=0x10,ReturnLength=0x0","ProcessName":"windows\\system32\\svchost.exe"}
{"Key":"\\registry\\machine","GraphKey":"\\REGISTRY\\MACHINE","count":1,"plugin":"regmon","Method":"NtQueryKey","unixtime":"1639557419.752278","TID":"3420","ProcessName":"users\\john\\desktop\\e95b20e76110cb9e3ecf0410441e40fd.exe","PPID":"1324","PID":"616"}
{"count":1,"PID":"616","Method":"NtQueryKey","unixtime":"1639557419.752278","TID":"3420","plugin":"syscall","PPID":"1324","Others":"REST: ,Module=\"nt\",vCPU=0,CR3=0x4B7BF000,Syscall=19,NArgs=5,KeyHandle=0x1F8,KeyInformationClass=0x7,KeyInformation=0x20CD88,Length=0x4,ResultLength=0x20CD98","ProcessName":"users\\john\\desktop\\e95b20e76110cb9e3ecf0410441e40fd.exe"}

Intermediate data (sequences):

syscall_NtQuerySystemInformation*regmon_NtQueryKey*syscall_NtQueryKey

The feature vector is presented in the table:

…	syscall_NtQuerySystemInformationregmon_NtQueryKey	regmon_NtQueryKeysyscall_NtQueryKey	syscall_NtQuerySystemInformation*syscall_NtQueryKey	…
…	1	1	0	…

How the knowledge of the model was accumulated, how this process changed, why it is important to stop the accumulation of data in time

As mentioned above, the main requirements for data are representativeness, balance and regular updates. Let’s explain all three points in the context of behavioral analysis of malicious files:

Representativeness. The distribution of data by features should be close to the distribution in real life.
Balance. The initial data for training the model comes with the markup “legitimate” or “malicious”, and this information is transferred to the model, that is, we solve the problem when the number of malicious examples is close to the number of pure examples.
Regular update. A lot has to do with representativeness. Since the trends in the field of malicious files are constantly changing, it is necessary to update the knowledge of the model regularly.

Taking into account all the above requirements, the following data accumulation process was built:

The data is divided into two types – the main data stream and reference examples. The reference ones are manually checked by experts, the correctness of their markup is guaranteed. They are needed to validate the model and control the training set by adding standards. The main stream is marked up with rules and automated checks. Needed to enrich the sample with a variety of examples from real life.
All standards are immediately added to the training set.
In addition, some initial dataset from the stream is added for the required amount of data for training. The required amount of data here is understood as the amount at which the training sample is sufficiently complete (in terms of data diversity) and representative. Since reference examples are checked manually by experts, it is not possible to collect several tens of thousands from references alone, hence the need to extract a variety of data from the stream.
Periodically, the model is tested on new data from the stream.
Accuracy should be guaranteed first of all for the reference examples, if there are contradictions, the preference is given to the reference data, they are preserved in any case.

Over time, a lot of data from the stream was accumulated, and there was a need to get rid of automated accumulation based on errors in favor of a more controlled training sample:

The training sample accumulated at the current moment is fixed;
The data from the stream is now used only for testing the model, no instance is added to the training sample;
Updating the training sample is possible only if the set of reference examples is updated.

Thus, we were able to achieve the following:

We made sure that the trained and fixed model can be sufficiently resistant to data drift;
We control each new example added to the training sample (reference examples are manually checked by experts);
We can track every change and guarantee accuracy on the reference dataset.

How to ensure model quality improves with each update

After the described process of data accumulation, a quite logical question may arise: why are we so sure that each update of the model improves it?

The answer is still the same reference sample. We consider it the most correct, because the examples from this sample are manually checked and marked by experts, and with each update, first of all, we check that we still guarantee 100% accuracy on this sample. Testing in the wild confirms that the accuracy is improving.

This is achieved by cleaning the training sample from conflicting reference data. By contradictory data, we mean examples accumulated from a stream that are sufficiently close in vector distance to the traces from the reference sample, but at the same time have the opposite label.

Our experiments have shown that such examples are outliers even in terms of data from the stream, since after removing them from the training sample in order to increase the accuracy on the reference sample, the accuracy on the stream also increased.

Complementation of the ML approach and behavioral detections in the form of correlations

The ML model performed very well in combination with behavioral detections in the form of correlations. It is important to note that it is in combination, since the generalizing ability of the model is good in cases where it is necessary to expand the solution with the detection of similar and close incidents, but not in cases where detection is needed within the framework of a clear understanding of the rules and criteria of what constitutes malware.

Examples where the ML approach could really expand the solution were:

Anomalous chains of subprocesses. In itself, a large number of branching chains is a legitimate phenomenon. But the model notices an anomaly in the number of nodes, the degree of nesting, the repeatability or non-repetition of some specific process names, and a person does not fantasize in advance to find this malicious.
Non-standard values for call parameters by default. In most cases, analysts are interested in significant parameters of functions in which they are looking for malware. The remaining parameters, roughly speaking, are the default values, they are not of particular interest. But at some point, it turns out that instead of, say, five default values, a sixth occurs. The analyst might not have guessed that this was possible, but the model noticed.
Atypical sequences of function calls. The case when each function individually does not do anything malicious. Yes, and in the aggregate – too. But it so happened that their sequence is not found in legitimate software. It would take a great deal of experience for an analyst to notice such a pattern on his own. And the model notices (and more than one), solving the non-standard problem of classification according to a feature that was not actually laid down as an indicator of harmfulness.

Examples where signature behavioral analysis is important:

Using a specific component with a single call for a malicious action. The system uses hundreds of objects in varying variability, to varying degrees. It is hardly possible to catch the use of one against the background of a million others – the granularity of the anomaly is still rather low.
Proactive detection based on the threat model. We decided that some action on some object in the system at least once is unacceptable. The model may not understand from the first time that this is a significant phenomenon, and there will be a chance of an error or an uncertain decision at the stage of classifying something similar.
Obfuscation of the sequence of actions. For example, you may know that you need to do 3-4 actions in a certain order. It doesn’t matter what happens between them. If you throw garbage actions between 3-4 key ones, this will bring down the model, the decision will be made incorrectly. At the same time, the dimension of the number of features does not allow taking into account such obfuscations by storing all combinations of call sequences, and not just the total number.