Intrusion Detection Using Machine Learning Technologies. Part 1

telegram channel IT Talks.

Most recently, I conducted several webinars on the use of machine learning in the field of information security and now I want to share this topic with you in several articles. This is the first part where I will talk about intrusion detection systems and the use of machine learning in solving information security problems. The first part of the implementation of an intrusion detection system using machine learning models will also be discussed. The practical part will cover the data that will be used, their analysis and preliminary preparation. The second part will describe the training of models, as well as the analysis of their work and the conclusions obtained as a result.

It is important to note that the example discussed in the practical part of this article is educational in nature and is intended to demonstrate the principles of operation. Application of this example in real projects requires additional settings and adaptation to specific conditions.

One of the important components of an information security system is intrusion detection systems. Let's first dive a little into the theory and talk about what it is.

An intrusion detection system (abbr. IDS, English Intrusion Detection System (IDS)) is a software product or device designed to detect unauthorized and malicious activity on a computer network or on a separate host. There are different types of data:

  • Network IDS (Network-based IDS, NIDS) – analysis of network traffic.

  • Based on the IDS protocol (Protocol-based IDS, PIDS) – a system for verifying data transmitted via the HTTP/HTTPS protocol.

  • Application Protocol-based IDS (APIDS) is an intrusion detection system that monitors packets transmitted over a specific application-level protocol.

  • Nodal IDS (Host-based IDS, HIDS) – control of the operation of individual devices.

  • Hybrid IDS is a combination of two or more of the above types.

Depending on the type of system, it has a certain architecture, operating logic and performs its tasks. Of all the above types, the two most common are: network and node IDS. There are also often hybrid systems that are a combination of two or more of the above types.

In addition to division by type, IDSs are divided by operating principle. There are two principles on which the logic of their work can be based:

  • Anomaly-based intrusion detection – in this case, IDS compares activity on a network or on a host with a model of correct, trusted behavior of controlled elements and records deviations from it. This method allows you to identify new threats.

  • Signature-based intrusion detection is a method in which the IDS compares the data being scanned with known patterns of attack signatures and generates a security alert if they match. This way you can detect intrusions that are based on previously known penetration methods.

Let's move on to machine learning. First, let's talk about what types of training can be predominantly used depending on the principle by which the OWL works:

  • Anomaly-based intrusion detection – uses supervised learning.

  • Signature-based intrusion detection – uses unsupervised learning.

Can also be used: ensemble learning, transfer learning (TL) and hyperparameter optimization.

Next, let's move on to practice and try to create an intrusion detection system using machine learning. We will train a number of models and look at their performance. After analyzing the effectiveness of the models, we will draw conclusions that will allow us to judge the possibility of using various models and IDS in general using machine learning in practice.

The first step is to determine the operating algorithm of the future system, starting with the first steps of data collection and ending with predictions and estimates. The algorithm for operating intrusion detection systems using machine learning will include the following steps:

  1. Data collection: At this stage, network traffic data is collected.

  2. Data preprocessing: The data is pre-processed to prepare it for machine learning analysis.

  3. Model training: The model is trained on normal network activity to learn to recognize normal network behavior.

  4. Anomaly detection: Once the model is trained, it is applied to new data to detect anomalous or malicious behavior on the network.

  5. Evaluation of results.

Data collection

Let's move on to the first step and talk about the initial data. We will use the open access dataset “Network Intrusion Detection”. “Network Intrusion Detection” is a dataset containing information about network traffic used for intrusion detection. It is often used for classification tasks where it is necessary to determine whether network activity is normal or poses a potential security threat (anomaly or intrusion). Includes a lot of data, for example:

  • Protocols (protocol_type): TCP, UDP, ICMP.

  • Services: ftp_data, http, smtp, domain, etc.

  • Connection flags (flag): SF, S0, REJ, RSTO, SH.

In order to move on to the second step, we will construct two pie charts (Figure 1). The first diagram shows the percentage of protocol types in the source data, and the second shows the percentage of network connection types that relate to attacks and normal connections.

Picture 1

Picture 1

These graphs can be used for data pre-analysis. From the first graph, you can draw conclusions about the types of connections or any other necessary characteristics of the data, and the second graph allows you to understand how evenly the number of different types of connections is distributed, which is important for training models and has a significant role in subsequent work with data. Below is a code snippet for this step.

def pie_plot(df, cols_list, rows, cols):
    fig, axes = plt.subplots(rows, cols, figsize=(10, 5))
    for ax, col in zip(axes.ravel(), cols_list):
        counts = df[col].value_counts()
        ax.pie(counts, labels=counts.index, autopct="%1.0f%%", startangle=90, textprops={'fontsize': 15})
        ax.set_title(col, fontsize=15)
        ax.axis('equal')
    plt.tight_layout()
    plt.show()

pie_plot(data, ['protocol_type', 'class'], 1, 2)

Preliminary data preparation

Next, let's move on to the second step – preliminary data preparation. This step will be divided into three stages: extraction of categorical features, then scaling and categorical coding of features.

For scaling, we will use the RobustScaler class from the preprocessing module of the Scikit-learn library in order to reduce outliers and make the data more uniform. This class is used to standardize features by removing the median and scaling the data according to the interquartile range. Below is a snippet of data before and after scaling. Using the example of the scr_bytes column, you can see that the attribute values ​​have changed and, in simple terms, have become “closer” to each other.

Original values:

duration src_bytes dst_bytes

0 0 491 0

1 0 146 0

2 0 0 0

3 0 232 8153

4 0 199 420

Scaled values:

duration src_bytes dst_bytes

0 0.0 1.602151 0.000000

1 0.0 0.365591 0.000000

2 0.0 -0.157706 0.000000

3 0.0 0.673835 15.375766

4 0.0 0.555556 0.792079

It is worth saying that it is also necessary to prepare values ​​for the target variable. To do this, we will replace the data in the class column as follows: all values ​​of type normal, which means the connection was normal and there was no attack, will contain the number 0, otherwise 1.

Next, let's move on to coding categorical features. First, you need to isolate these features from the entire dataset. Categorical features will be those whose values ​​represent categories or groups, rather than numeric values. They are characterized by a finite and usually small number of possible values. For coding, we'll use the get_dummies function from the pandas library, which is used to convert categorical variables into a set of binary (dummy) variables. This is important for preparing data for use in machine learning models, since many algorithms require numeric inputs. As a result, the categorical features will be divided into several separate columns, each of which will correspond to a specific category. Below is a fragment of data before and after coding of categorical features. Using the protocol_type attribute as an example, you can see that three columns were created with the corresponding protocol types and binary values.

Before encoding:

0 tcp

1 udp

2 tcp

3 tcp

4 tcp

25187 tcp

25188 tcp

25189 tcp

25190 tcp

25191 tcp

Name: protocol_type, Length: 25192, dtype: object

Columns after encoding:

protocol_type_icmp protocol_type_tcp protocol_type_udp

0 False True False

1 False False True

2 False True False

3 False True False

4 False True False

… … … …

25187 False True False

25188 False True False

25189 False True False

25190 False True False

25191 False True False

Below is a code fragment that contains preliminary data preparation.

def do_scl(df_num, cols):
    print("Original values:\n", df_num)

    scaler = RobustScaler()
    scaler_temp = scaler.fit_transform(df_num)
    std_df = pd.DataFrame(scaler_temp, columns =cols)

    print("\nScaled values:\n", std_df)

    return std_df

cat_cols = ['protocol_type','service','flag', 'class']

def process(dataframe):
    df_num = dataframe.drop(cat_cols, axis=1)
    num_cols = df_num.columns
    scaled_df = do_scl(df_num, num_cols)

    dataframe.drop(labels=num_cols, axis="columns", inplace=True)
    dataframe[num_cols] = scaled_df[num_cols]

    dataframe.loc[dataframe['class'] == "normal", "class"] = 0
    dataframe.loc[dataframe['class'] != 0, "class"] = 1

    print("Before encoding:")
    print(dataframe['protocol_type'])

    dataframe = pd.get_dummies(dataframe, columns = ['protocol_type', 'service', 'flag'])

    print("\nColumns after encoding:")
    print(dataframe.filter(regex='^protocol_type_'))
    
    return dataframe

scaled_train = process(data)

Next, at the stage of preliminary data preparation, we will talk about the target variable and the selection of values. The first step in this fragment is to create an array with the values ​​of the target variable.

The target variable is a class attribute that contains the type of connection: normal or attack. Next, we convert the data types of the target variable to integers for the models to work correctly.

The next step is to divide the data into training and test sets for classification. To do this, we use the train_test_split function from the Scikit-learn library. This function is used to divide arrays or data matrices into random training and test subsets. The value 0.2 is passed to the test_size parameter, the size of the test sample will be 20 percent, and the training sample will be 80 percent.

Below is a code fragment that implements the steps from the description above.

y = scaled_train['class'].values
y = y.astype('int')

x_train, x_test, y_train, y_test = \
    train_test_split(x, y, test_size=0.2, random_state=42)
x_train_reduced, x_test_reduced, y_train_reduced, y_test_reduced = \
    train_test_split(x_reduced, y, test_size=0.2, random_state=42)

After the preliminary analysis and data preparation is completed, we move on to training the models and then to their evaluation. We will continue in the second part, where we will take a detailed look at various models, their characteristics, analyze the process of training, evaluation, and draw conclusions about the development of an intrusion detection system using machine learning.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *