Security Operation Center KPIs: How We Came to Our Metrics System

I will not write here long abstruse texts about “how to build a KPI system for SOC correctly”. I’ll just tell you how we are ~~fought and searched~~ found their own methodology and how we now measure, “how bad / good / safe / (underline the necessary)”.

How it all began

Oddly enough, our first steps towards the formation of KPIs for Solar JSOC were in no way connected with the centers for monitoring and responding to cyber attacks. “At the dawn of our youth,” we helped companies build systems for assessing the effectiveness of information security (ISMS 27001 and that’s all). The understanding of their need arose then in the market in a natural way: almost any information security department day after day is forced to analyze large amounts of data from many different systems. Of course, each of them has some kind of reporting, but with a large number of them, it is very difficult to form a holistic picture of the state of information security, and subsequently – to provide a report to management in a convenient format. The problem is compounded if the organization is geographically dispersed.

We just helped customers build not just a complex of KPIs / metrics, but a full-fledged analytical solution that aggregates data from information security systems. In fact, it is a visualization system in which you can quickly and easily see the essence of the problem and its localization in order to quickly make the required decisions. In these projects, we gained experience and came to the conclusion that the system is really convenient and useful. And also – that the work of the SOC also needs to be evaluated.

Why evaluate the effectiveness of a SOC, especially an external one?

It’s simple: on the one hand, we want to understand how well we provide the service, and on the other, to have the most complete picture of the customer’s infrastructure, to see all the “black spots” that were not included in our audit and became risk factors for the service. Simply put, we want to understand: will we see this or that attack on the customer or not.

When we started working as a service provider, it happened that the client refused to give us specific sources for connection, which were necessary for 100% identification of the attack. As a result, such an attack happened, we did not see it and received reproaches, despite our initial warning.

Another example: we said that in order to correctly and accurately identify an incident, you need to configure the sources in a certain way, gave a list of these settings, but the customer did not carry out this work. The result is the same – a missed incident.

So we came to an understanding that it is important to explicitly highlight both the customer and ourselves, what exactly we see in his infrastructure, where there are “blind spots” for us, which attack vectors are most often implemented and in which areas, which IT assets are most susceptible to attacks and how this can affect the business. For this, the visualization system must show the real situation and help in its analysis, and not just be a “wow effect” for the leadership (as is often the case).

KPI for SOC – what and how to measure?

First of all, you need to understand: why, why do you need this very KPI / metrics system? Do you want to measure the performance of your information security department? Understand how well / successfully (or vice versa) your processes are performing? Or maybe just show the management, “who is great?” Or maybe department bonuses depend on KPI performance? Without understanding the goals of evaluating effectiveness, it is impossible to build a really operating KPI system.

Let’s say we have decided on the goals, and now the most interesting question arises: how to measure something? You can’t go to SOC with a ruler, everything is a little more complicated here. After all, this is not only SIEM as a system for collecting and correlating information security events, it is also a huge variety of systems that allow the service to function correctly. There is an insane amount of data inside the SOC, so there is a lot to evaluate.

And in this matter, we are trying to get away from subjective KPIs as much as possible, i.e. those metrics that cannot be measured automatically. For example, the metric “How bad everything is with us” is difficult to assess directly, without the participation of a person (who will give ~~crooked~~ not always the correct opinion, based on my own experience). But if we break this metric into smaller ones, then they can already be calculated based on data from technical means. Those. it is necessary to define what is included in the concept of “everything is bad” for us: we do not have a specific information security system; antivirus is not deployed wherever needed; specialists process incidents or requests for a very long time; all our hosts have more than 10 critical vulnerabilities and no one fixes them, etc. And now, if all such small metrics, taking into account their weight coefficients for our business, are collected into a single calculation, then we will get the value of the metric “How bad we are.” Moreover, we will be able to explain what it is based on and why its certain meaning suggests that it is time to urgently solve serious problems in the organization of information security. And most importantly, we can always go down into the details of this metric and understand what tasks are in what priority.

When building our KPI system, we adhere to the following principles:

– The KPI should be really important for both the SOC and the customer;
– the indicator must be measurable, i.e. specific calculation formulas must be built and threshold values set;
– we should be able to influence the value of the indicator (ie, metrics from the category “percentage of sunny days in a year” are not suitable for us).

We also came to the conclusion that the KPI system cannot be flat and must have at least three levels:

1) “Strategic”: these are KPIs that reflect the overall picture of achieving the set goals;
2) “Investigation, analysis, identification of connections”: these are KPIs, on the basis of which the first level is formed and which contribute to the implementation of the main goal.
3) “Monitoring and response”: the most basic KPIs, from which all the higher-level ones are built (we measure them automatically – on the basis of data from the information security system and other sources).

Each of the indicators affects the superior. Since this influence is not the same, each of the indicators is assigned a weighting factor.

Of course, the first thing we want to see all the time is how effective our service is for our customers. And, of course, this information must be timely. For this, we have developed (and continue to improve) a system of metrics that reflects the quality of work of each of the services: 1st and 2nd lines, service managers, analysts, response, administration, etc. For each of these areas, a about 10-15 KPIs – they were calculated based on the database from the systems in which the guys work (whether requests are fulfilled on time, whether we quickly respond to the customer’s request, how sources are connected, and much more).

SLA is good, but real quality of service is more important

It is important for us that the service coverage allows us to identify the maximum number of incidents and attacks, and not to be blind kittens. So that we can interpret incidents at the customer in the format of his own IT assets, and not abstract IPs. So that our notifications do not boil down to the fact that “Mimikatz was found on the host 10.15.24.9”, and would not force the customer to independently find out what kind of host it is, wasting the time required to respond and eliminate the consequences.

In other words, it is important for us to understand how well our SOC customers are protected. So, it is necessary to determine how detailed and sufficiently we “see” them:

• are all significant sources connected to us;
• how effectively the customer’s information security system (they are also sources for our service) cover his infrastructure;
• are all sources configured as we recommend and what are the deviations;
• whether all the necessary and sufficient scenarios for detecting attacks and incidents have been launched at the customer’s premises;
• whether all connected sources send us events with a given regularity;
• whether the customer reacts to all our notifications, and how timely he does it.

And also – how scary it is to live inside this customer, that is:

• how often it is attacked, what is the severity of these attacks (targeted or massive), what is the level of the attacker;
• how effective is the customer’s protection (processes and information security systems) and how often it is updated;
• what is the criticality of the assets involved in incidents, which of the assets are used by attackers most often, etc.

To calculate all such high-level indicators, you must first break them down into smaller ones, and those into even smaller ones – until we get to ~~zen~~ the level of small metrics that can be unambiguously calculated based on the database from sources and our internal systems.

The simplest example: there is a high-level indicator “Effectiveness of information security processes”, consisting of smaller ones, such as “Degree of protection against malware”, “Degree of vulnerability management”, “Degree of protection from IS incidents”, “Efficiency of access control”, etc. … As many information security processes are implemented in an organization, there will be as many metrics of the second level. But to calculate the second level metric, you need to collect even finer metrics, for example, “The degree of coverage of the organization’s hosts by antivirus”, “The percentage of critical incidents with malware”, “The number of assets involved”, “The percentage of false positives”, “The level of cyber literacy of users” , “Percentage of hosts in an organization with disabled anti-virus protection”, “Percentage of hosts with outdated anti-virus databases” – you can go on and on. And these third-level metrics can be collected from information security tools and other systems in automatic mode, and the calculation can be made in the information security analytics system.

Creating KPIs and managing the performance of SOCs is still a challenge both for the developers of these metrics and for the customer (and this is an exclusively pair dance). But the game is worth the candle: as a result, you can fully, centrally and quickly assess the state of information security, find weaknesses, quickly respond to incidents and keep the information security system up to date.

If the topic turns out to be interesting, I’ll talk more about metrics in future articles. So if you want to hear about any specific aspects of measuring SOC, write in the comments – I will try to answer all the questions.

Elena Trescheva, Lead Analyst at Solar JSOC