Security Operations Center efficiency: what parameters to look at?

When we talk about KPIs and efficiency, the question arises: what should SOC track in its daily activities? On the one hand, everything is clear here: firstly, compliance with the SLA, and secondly, emerging events with suspicion of incidents. But the statistics of events can be viewed from different angles. And also problems with sources, and also anomalies, and also the load on SIEM, etc. – mass parameters. Which of them are worthy of an analyst’s dashboard – we’ll talk about this below.

Tracking SLA fulfillment

The first thing that comes to mind is the degree to which the SLA has been met. It is necessary to track this parameter at all stages: registration and analysis of information security events, notifying the customer, solving an incident, conducting an investigation, etc. And here you need to see not only the average time (or better the median), but also the maximum and minimum values. Of course, it should be possible to see the SLA fulfillment for each incident in order to analyze and find out in what cases it is violated.

It is clear that keeping track of time alone is not enough – you need to keep statistics on changes in the number of incidents for which the SLA was violated. In other words, when a customer connects, we prescribe an SLA not only for the service as a whole, but also for each specific information security event (in this case, we are interested in information about what kind of incident it is and at what moments it changed its statuses). So we can distribute luli to understand who is “to blame” and at what point there was a failure in the process in order to immediately identify the reasons. Well, and spread straws in order to avoid such “happiness” in the future.

Keeping statistics on information security events

Such statistics will give an understanding of how susceptible the company is to external and / or internal attacks, to which objects they are directed (and this can already become an incentive for optimizing the customer’s internal information security processes). And here it is important for us to understand:

– what is normal behavior of systems / users, and what is worth paying attention to;
– how many IS events occur in general;
– how many we report to the customer as a suspicion of an incident;
– what events are most frequent;
– where they occur;
– how important assets for the customer are affected (sometimes the commercial SOC knows the customer’s infrastructure better than himself);
– what units are involved in the events as victims and as sources;
– at what point do bursts in the number of events occur, why does this happen, is there any pattern in this and can we do something about it.

In addition, it will be useful to understand at what stage of the life cycle the information security event is now (registered, in operation, escalation, resolved, rejected or investigated), how critical it is, what assets it affects, how much effort has already been spent on its “elimination” and how much still need.

Separately, we look at the statistics on events recognized as incidents. There are many aspects in which we can view statistics: categories and statuses of events; external attack or internal incident; criticality; SLA; hosts or users affected by events; systems or units affected; sources from which information security events were obtained, etc.

We look at the customer’s reaction

Here we track how long it takes the customer to respond to our notification of a possible information security incident, and keep statistics on notifications that were left unanswered by him. This is necessary, first of all, to build effective interaction. For example, if the customer is not too actively watching our alerts, then in the event of a critical incident, we will use all available communication channels to inform him. In addition, some companies have internal KPIs for responding to SOC messages, and our statistics are in great demand in this area.

Depending on what the customer responds (recognizes the event as an incident, legitimate activity or false positive), we adjust the rules. That is, we understand which IS event is the norm and it can be included in the exception, and which is a critical incident, about which the customer must be notified not only by e-mail, but also in a more efficient way, for example, by phone.

What else?

For ourselves, we have identified a few more parameters that we consider important to track on dashboards. Let’s go in order.

1. Scenarios triggered on the basis of which the final IS event is formed.

If the scripts for some reason stopped working, you need to see it in time. This way we can spot malfunctions and fix them. There are also scenarios that have never worked / worked extremely rarely before, and suddenly it happened for the first time / became more frequent. The reasons for such anomalies also need to be clarified. When this is explicitly shown on the dashboard, it is much easier to keep track of such moments. As a result, we can identify an incident that nothing foreshadowed.

2. Anomalies in the arrival of information security events from sources.

Here we not only monitor the complete absence of events from the source, but also automatically compare this period of “silence” with the statistics of the “normal” behavior of the source. And if the “pause” is too long, it is worth highlighting on the dashboard. So we can quickly react and restore the source operation in time. Otherwise, a false feeling may arise that there are no events, which means that everything is fine and no one is attacking us. But in fact, there can be many reasons for such silence – from incorrect intervention of the customer’s IT service in the infrastructure to the activities of an attacker who, for example, wrote malware in such a way that it bypasses some of the points in the rule. Do not relax!

The dynamics of receiving events is also an important indicator. There is such a curious thing: there seem to be events, but mysterious “subsidence” or “failures” are observed in their stream. It is better to notice such a picture on the chart in time so that you do not miss the incident later.

3. Data from a specific source.

By seeing the dashboards for specific sources, we can compare the events they collect with what SIEM considered incidents. Here we analyze which of the events were processed correctly, and which need to be added to the correlation to identify incidents. We take the raw statistics from the means of protection into the dashboard and compare how many attacks were blocked by them, and how many flew into the SOC. Thus, the dashboard should show the general level (number and complexity) of attacks on the customer and how many attacks require his reaction. That is, how many of them were committed and how many reached the customer.

4. Load SIEM.

You can write many beautiful scripts and rules in a SIEM system that will work cool. But at the same time, they can load it very strongly, and at some point they will even “put it down”, and it will be difficult for the monitoring lines to cope with them. Therefore, statistics on the load of the rule should be shown explicitly on the dashboard – in this matter it is better to overdo it than not to miss it. Well, then we look and analyze how the load of each rule changed. We monitor the impact on the SIEM load of adjustments within the rule and external changes (for example, if there were changes in the customer’s infrastructure or the rule was launched for several customers). Based on this, we draw conclusions: is such a rule necessary in general (and in its current form in particular), is it not easier to disable it completely or modify it.

5. Massiveness of attacks.

When you have many customers all over the country or even the world, it makes sense to create a dashboard with a map on which you can track the massiveness of attacks. The map will display almost in real time information about emerging incidents, coordinates of the start of the attack (IP source) and its target (IP destination).

If there are too many vectors on the map, we see massive attacks. It happens that they all come from one point, but are directed at different ones. It also happens vice versa. For clarity, it is better to highlight these vectors with different colors depending on the criticality of the incident.

Elena Trescheva, lead analyst at Solar JSOC

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *