Intrusion Detection Using Machine Learning Technologies. Part 2

telegram channel IT Talks. In the first part of the article, I talked about some theoretical foundations about intrusion detection systems and the use of machine learning in solving information security problems. I also looked at the data that will be used, their analysis and preliminary preparation.

In the second part, I will continue to talk about the implementation of an intrusion detection system using machine learning and will consider in detail the training of models, as well as the analysis of their work and conclusions based on the results obtained.

It is important to note that the example discussed in this article is educational in nature and is intended to demonstrate the principles of operation. Application of this example in real projects requires additional settings and adaptation to specific conditions.

The first step is to cover the theory and talk about evaluating the effectiveness of the model. After training, we will evaluate the model's performance using metrics. In order to calculate metrics, we need to use some general concepts. There are several terms denoting quantitative indicators that refer to the following values:

  • TP (True Positive). This number shows the number of cases when the classifier correctly assigned the object to the class in question.

  • TN (True Negative). This number shows the number of cases when the classifier correctly states that the object does not belong to the class in question.

  • FP (False Positive). This number shows the number of cases when the classifier incorrectly assigned an object to the class in question.

  • FN (False Negative). This number shows the number of times the classifier incorrectly states that an object does not belong to the class in question.

Next, let's look at the model's metrics and talk about three simple assessments. We will use accuracy, recall and precision.

Accuracy — (precision) shows the share of correct classifications. Despite its obviousness and simplicity, it is one of the least informative assessments of classifiers.

Acc=(TP + TNT)/(TP + TN + FP + FN)

Recall – (completeness, sensitivity, TPR (true positive rate)) shows the ratio of correctly classified objects of a class to the total number of elements of this class.

Recall=TP/(TP+FN)

Precision – (accuracy, translation coincides with accuracy) shows the proportion of correctly classified objects among all objects that the classifier assigned to this class.

Precision=TP/(TP+FP)

In addition to metrics, we will use an error matrix to evaluate efficiency. A confusion matrix is ​​a table used to evaluate the performance of classification algorithms. It displays the number of correct and incorrect predictions made by the model and allows you to visualize where and what errors occur. The matrix is ​​divided into four parts, each of which corresponds to one of the four numbers discussed above and located on the matrix as follows:

y=1

y=0

a(x)=1

True Positive (TP)

False Positive (FP)

a(x)=0

False Negative (FN)

True Negative (TN)

To build the confusion matrix, it is necessary to save the actual labels of the target variable of the test data set. Then, predict the labels of the target variable for the test data set using the model. After that, the confusion matrix is ​​calculated based on the actual and predicted labels of the target variable, an object is created to visualize the confusion matrix with the specified class labels, and a graph is plotted. The function with the code for evaluating and plotting the confusion matrix is ​​presented below:

def evaluation(model, name, X_train, X_test, y_train, y_test):
    start_time = time.time()
    model.predict(X_test)
    end_time = time.time()
    execution_time = end_time - start_time
    print("Time " + str(name) + " : %.3f sec" % execution_time)

    train_accuracy = metrics.accuracy_score(y_train, model.predict(X_train))
    test_accuracy = metrics.accuracy_score(y_test, model.predict(X_test))

    train_precision = metrics.precision_score(y_train, model.predict(X_train))
    test_precision = metrics.precision_score(y_test, model.predict(X_test))

    train_recall = metrics.recall_score(y_train, model.predict(X_train))
    test_recall = metrics.recall_score(y_test, model.predict(X_test))

    kernal_evals[str(name)] = [train_accuracy, test_accuracy, train_precision, test_precision, train_recall, test_recall]
    print("Training Accuracy " + str(name) + " {}  Test Accuracy ".format(train_accuracy*100) + str(name) + " {}".format(test_accuracy*100))
    print("Training Precision " + str(name) + " {}  Test Precision ".format(train_precision*100) + str(name) + " {}".format(test_precision*100))
    print("Training Recall " + str(name) + " {}  Test Recall ".format(train_recall*100) + str(name) + " {}".format(test_recall*100))

    actual = y_test
    predicted = model.predict(X_test)
    confusion_matrix = metrics.confusion_matrix(actual, predicted)
    cm_display = metrics.ConfusionMatrixDisplay( \
        confusion_matrix =                       \
        confusion_matrix,                        \
        display_labels = ['normal', 'attack'])

    fig, ax = plt.subplots(figsize=(10,10))
    ax.grid(False)
    ax.set_title(name, fontsize=15)
    cm_display.plot(ax=ax)
    plt.show()

Let's move on to training models. We will cover logistic regression, k-nearest neighbors, Naive Bayes Gaussian classifier, gradient boosting, random forest and neural network.

Logistic regression is a machine learning method used to solve the problem of binary classification, dividing data into two classes. It is based on the logistic function, which transforms a linear combination of features into the probability of belonging to one of the classes. The essence of the method is to find the optimal weights for each feature so that the model predicts the classes of new data as accurately as possible.

lr = LogisticRegression().fit(x_train, y_train)
evaluation(lr, "Logistic Regression", x_train, x_test, y_train, y_test)

As a result, the following estimates were obtained:

Training Accuracy Logistic Regression 87.59986106286905

Test Accuracy Logistic Regression 88.80730303631672

Training Precision Logistic Regression 83.48096953178235

Test Precision Logistic Regression 84.8625629113434

Training Recall Logistic Regression 91.44806995094903

Test Recall Logistic Regression 92.68498942917547

Based on the evaluation, it can be said that the logistic regression model shows high performance with an accuracy of 87.60% on the training dataset and 88.81% on the test dataset, indicating good generalization ability and no overfitting. High precision (83.48% on the training dataset and 84.86% on the test dataset) and recall (91.45% on the training dataset and 92.68% on the test dataset) values ​​indicate high-quality predictions and effective capture of positive cases, confirming the balance of the model and low error rate. Figure 1 shows the confusion matrix, which also shows that the model performance is very good, but not perfect, as are the numbers on the side diagonal.

Picture 1

Picture 1

Next we will consider the k-nearest neighbors method. K-Nearest Neighbors (or simply KNN) is a classification and regression algorithm based on the compactness hypothesis, which assumes that objects located close to each other in feature space have similar values ​​of the target variable or belong to the same class.

The essence of the method is to classify or predict a value for a new object based on its nearest neighbors from the training data set.

knn = KNeighborsClassifier(n_neighbors=20).fit(x_train, y_train)
evaluation(knn, "KNeighborsClassifier", x_train, x_test, y_train, y_test)

As a result, the following estimates were obtained:

Training Accuracy KNeighborsClassifier 98.0747283282886

Test Accuracy KNeighborsClassifier 98.0551696765231

Training Precision KNeighborsClassifier 98.38536060279871

Test Precision KNeighborsClassifier 98.25457641549595

Training Recall KNeighborsClassifier 97.46214544679036

Test Recall KNeighborsClassifier 97.58985200845666

The k-nearest neighbors method demonstrates high performance with an accuracy of 98.07% on the training and 98.06% on the testing datasets, which indicates its effectiveness in classification. High values ​​of accuracy (98.39% on training and 98.25% on test) and recall (97.46% on training and 97.59% on test) indicate good prediction ability and effective capture of positive cases. These estimates indicate the high reliability and balance of the method. Figure 2 shows the error matrix, from which one can also draw conclusions that the indicators for the model are very high, while the numbers on the side diagonal are extremely small, which also indicates the good quality of the model.

Figure 2

Figure 2

Next, we will consider the use of the Naive Bayes Gauss classifier. Naive Bayes is a simple probabilistic classifier based on the application of Bayes' theorem with the assumption of naivety (independence) of features. The essence of the method: to predict the belonging of an object to a certain class using probabilistic models for each class.

nb = GaussianNB().fit(x_train, y_train)
evaluation(gnb, "GaussianNB", x_train, x_test, y_train, y_test)

As a result, the following estimates were obtained:

Training Accuracy GaussianNB 92.62640797896094

Test Accuracy GaussianNB 93.84798571145069

Training Precision GaussianNB 92.63180639585133

Test Precision GaussianNB 94.23159707275074

Training Recall GaussianNB 91.42674344209853

Test Recall GaussianNB 92.55813953488372

The Gaussian Naive Bayes classifier demonstrates moderate but consistent performance. With an accuracy of 92.63% on the training and 93.85% on the test datasets, the model effectively identifies object classes. The high precision (92.63% on the training and 94.23% on the test) and recall (91.43% on the training and 92.56% on the test) values ​​indicate its robustness.

Figure 3 shows the error matrix, from which we can draw conclusions that the indicators are at a fairly good level, but the numbers on the side diagonal are already larger than those of the previous model.

Figure 3

Figure 3

Next, let's look at gradient boosting. Gradient boosting is a machine learning technique for classification and regression problems that builds a prediction model in the form of an ensemble of weak predictive models, usually decision trees.

In the code we will use the implementation of XGBoost (Extreme Gradient Boosting, extreme gradient boosting) – this is an implementation of the gradient boosting algorithm. The essence of the method is to train an ensemble of weak models (usually decision trees) sequentially, where each new model corrects the errors of the previous ones.

xgb_classifier = xgb.XGBClassifier(objective="binary:logistic", n_estimators=100, random_state=42)
xgb_classifier.fit(x_train, y_train)
evaluation(xgb_classifier, "XGBoost", x_train, x_test, y_train, y_test)

The implementation solves a binary classification problem using logistic regression using 100 base models.

As a result, the following estimates were obtained:

Training Accuracy XGBoost 99.9909347152389

Test Accuracy XGBoost 98.8809287557055

Training Precision XGBoost 99.91762638918734

Test Precision XGBoost 98.91536182818452

Training Recall XGBoost 99.99817263451826

Test Recall XGBoost 98.830866807611

Gradient boosting, represented by the XGBoost model, shows impressive performance on both the training and test datasets. With an accuracy of 99.99% on the training and 98.88% on the test, the model successfully classifies the objects. The high precision (99.92% on the training and 98.92% on the test) and recall (99.99% on the training and 98.83% on the test) values ​​confirm the robustness and effectiveness of the model in predicting positive cases. However, it is important to keep in mind that very high performance scores of the XGBoost model may indicate potential issues such as overfitting, class imbalance, or the presence of noise in the data. Although the current results show excellent classification quality, it is important to consider these factors for proper interpretation and further improvement of the model.

Figure 4 shows the error matrix; from it one can also draw conclusions about the high performance of the model, as evidenced by the extremely small values ​​close to zero on the side diagonal.

Figure 4

Figure 4

Let's move on to the random forest method. Random forest is a machine learning algorithm that uses a combination of multiple decision trees to solve classification or regression problems. The essence of the method is that each decision tree is built independently of the others; they are trained on different subsets of data and features. The results from all the trees are then combined for classification or regression.

rf = RandomForestClassifier().fit(x_train, y_train)
evaluation(rf, "RandomForestClassifier", x_train, x_test, y_train, y_test)

As a result, the following estimates were obtained:

Training Accuracy RandomForestClassifier 99.99167281423711

Test Accuracy RandomForestClassifier 98.78170271879341

Training Precision RandomForestClassifier 99.9914571898426

Test Precision RandomForestClassifier 98.83065198983911

Training Recall RandomForestClassifier 99.99572841246449

Test Recall RandomForestClassifier 98.70401691331924

It can be concluded that the random forest model (RandomForestClassifier) ​​demonstrates high performance with an accuracy of 99.99% on the training and 98.78% on the test data sets, which indicates its ability to effectively classify objects. High values ​​of precision (99.99% on training and 98.83% on test) and recall (99.99% on training and 98.70% on test) indicate accurate and reliable predictions. However, as in the previous case, you need to understand that such high scores may indicate overfitting, when the model remembers the training data well, but may be less effective on new data. This requires careful analysis and, possibly, additional steps to adjust the model to ensure its stability and generalizability.

Figure 5 shows the error matrix, from which we can also say that the model’s performance is close to ideal.

Figure 5

Figure 5

The code, which will be linked at the end of the article, additionally discusses the random forest method using the principal component method in data preparation. I will not dwell on this in more detail in this article, since using this method did not improve the model's performance and is not a very effective option for data preparation in my case, but as a practice, you can experiment with data preparation and try using a different dimensionality reduction method for preparation.

Let's move on to neural networks. First, let's talk about what types of neural networks there are and how each type is used for intrusion detection.

  • Multilayer perceptrons can be used to detect anomalies and attack signatures.

  • Recurrent neural networks can be used to analyze network traffic and detect anomalies based on consistent patterns.

  • Deep neural networks can be used to process complex data and identify unusual or anomalous patterns in network traffic.

  • Generative adversarial networks (GANs). Can be used to create realistic synthetic data and train the IDS model on more diverse attack patterns.

To solve the problem, I will use the first type, the Sequential model and a linear stack of layers, where each layer has exactly one input and one output. Adam will be used as an optimization algorithm.

If we talk about architecture, the neural network will consist of the following layers:

  • Fully connected layers with ReLU activation. Used to extract and transform features from input data.

  • Dropout layers. Helps prevent model overtraining by accidentally disabling neurons during training.

  • Output layer with Sigmoid activation. Outputs values ​​between 0 and 1, which is interpreted as the probability of belonging to the positive class.

The implementation in code is presented below:

model = tf.keras.Sequential([
    tf.keras.layers.Dense(units=64, activation='relu', input_shape=(x_train.shape[1:]),
                 kernel_regularizer=regularizers.l2(0.01),
                 bias_regularizer=regularizers.l2(0.01)),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(units=32, activation='relu',
                 kernel_regularizer=regularizers.l2(0.01),
                 bias_regularizer=regularizers.l2(0.01)),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(units=1, activation='sigmoid')
])

Let's look at the summary table in Figure 6, where the first column indicates the type of layer, the second column shows the shape and dimension of the output data, and the third column shows the number of parameters in the layer that is to be trained. Dense – fully connected layers between which parameters are distributed. Dropout layers do not contain neurons because their main function is to regularize the network, not to perform computation. They work by randomly “turning off” a certain percentage of neurons in previous layers at each stage of training to prevent overfitting and improve the generalization ability of the model.

Figure 6

Figure 6

The neural network was trained on 100 epochs, the metrics for the first and last epochs are presented below:

era

Accuracy (%)

Losses

Validation accuracy (%)

1

78.35

11.7777

94.94

2

91.72

4.4146

95.30

3

92.78

1.1391

96.13

4

93.41

0.8637

96.19

5

93.77

0.7069

96.29

96

96.33

0.1778

97.30

97

96.34

0.2730

97.30

98

96.27

0.2695

97.18

99

96.08

0.2383

97.26

Next, let's draw conclusions about the operation of the neural network based on the metrics by eras:

  • Accuracy: At the beginning of training, the accuracy was about 78.35%, and by the end it reached 96.29% on the training dataset and 96.98% on the validation dataset.

  • Loss. At the beginning of training, the loss was high (11.78), but as the network trained, it decreased and by the end it was 0.1871 on the training set and 0.1404 on the validation set.

  • Validation Accuracy. At the beginning of training, the validation accuracy was lower than the training accuracy, but as training progressed, it approached the training accuracy.

  • Validation Loss: Similar to accuracy, validation loss was high at the beginning of training, but decreased as training progressed, indicating that the generalization ability of the model improved.

Below is a code snippet:

model.compile(optimizer="adam",
              loss="binary_crossentropy",
              metrics=['accuracy'])
model.summary()

history = model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=100, verbose=1)

Now let's sum it up and look at the summary table, which shows the metrics and operating time of each model. It is important to clarify that I tested my developments on the device Device: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz 1.99 GHz. Of course, completely different hardware is used on an industrial scale.

Model

Accuracy

(Train)

Precision

Recall

Accuracy

(Test)

Precision

Recall

Time (s)

Logical regression

87.60

88.81

83.48

84.86

91.45

92.68

0.040

Nearest neighbors

98.07

98.06

98.39

98.25

97.46

97.59

0.899

Naive Bayes classifier

92.63

93.85

92.63

94.23

91.43

92.56

0.055

XGBoost

99.99

98.88

99.92

98.92

100.00

98.83

0.063

Random forest

99.99

98.78

99.99

98.83

100.00

98.70

0.095

Neural network

97.00

97.00

99.00

99.00

94.00

95.00

0.258

Based on the final data, we can say that gradient boosting and random forest have the highest metrics. Next comes the nearest neighbor method and neural network, although in most cases the neural network gives the highest indicators. Much depends on the quality of data preparation, I tested this implementation on another dataset and there the indicators were completely different, based on which we can say that the dataset that I took in this case requires a more detailed analysis and preliminary preparation. In addition to preparation, the volume of data, as well as the parameters of the models and the construction of the neural network, have a lot of influence.

Despite the fact that in this case there is not a big time gap, you definitely need to pay attention to this, since intrusion detection systems in most cases are located on layers 3 and 4 of the OSI model, which means that in addition to the quality of work, the following factors are important:

  • Real-time detection.

  • Reduced latency.

  • Efficient use of network resources.

  • Ability to scale.

  • Effective response to threats.

In addition, it must be remembered that depending on the model, a large amount of resources may be required, especially for the operation of a neural network. Therefore, to choose a solution, it is worth assessing all performance indicators together, not forgetting about hardware or time.

Machine learning in cybersecurity can be used not only for online work, but also for post-analysis of data, which makes the requirements for operational efficiency and resource requirements less stringent.

Link to the solution discussed in this series of articles: https://github.com/Oshurkova/IDSNetwork. The solution I mentioned above is similar to the current one, using a different dataset and a slightly different implementation: https://github.com/Oshurkova/ids_model.

I wish you good luck in learning machine learning for cybersecurity, intrusion detection systems and more.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *