We deliberately did not build a multi-level linear escalation model of incident investigation along the lines 1-> 2-> 3->, etc. Probably, simply because we could not figure out how to fit such an investigation paradigm into an extremely short SLA for incident resolution (0.5–2 hours), and 30 minutes to analyze an incident is far from uncommon, such cases make up a significant share in the total flow.
That is, by and large, there was only one incident handling line – the first, it is also the last. It was required to fully investigate the entire pool of typical incidents within the framework of a standard notification template. And this scheme worked quite well for itself. Until the number of customers increased. This led to the need to scale resources and in parallel to fulfill the mutually exclusive requirements of different customers. Some said that they needed deeper investigations and the timing issue was secondary, while others wanted the fastest possible notification of incidents, even with incomplete information at the initial stage.
Young engineers after their internships were not able to give the same quality and speed of investigation as their more experienced colleagues. Plus, for some customers, we have accumulated a pool of incidents with higher requirements for the depth of investigation and the delta neighborhood under consideration, which was only “tough” for some engineers.
This served as the starting point for the formation of an “In-Depth Investigation” team that worked with a common pool of incidents but focused on investigating complex cases that were, however, highly non-deterministic. For these reasons, it could not be called a full-fledged second line.
It is quite possible that we would live happily ever after in such a paradigm if it were not for the appearance of the second SIEM. This huge and time-consuming project led us to the creation of the second line, since the analysis of incidents from the new SIEM platform required balancing the load on the first line, which could not quickly learn the skill of working in two different solutions. And of course, the need for an “in-depth investigation” by the engineers of the new, now honestly, second line has not gone anywhere. But this feature was still non-deterministic and required manual processing by the first line to make a decision on each ticket. This created exactly the same overhead and increase in timings, from which we tried to get away.
An additional complication was the increase in the load on the first line due to the fact that the complexity of investigating incidents on the second SIEM was higher. As a result, the engineers of the second line managed to solve a smaller number of incidents than in those days when there was only one line and the tools were the same.
Meanwhile, clients increasingly demanded early notification – so that information about a suspected incident would be sent to them immediately upon detection, and extended data obtained during the investigation and time-consuming downloads / statistics would be provided later, within the SLA.
All this, as well as the thought that already a very large proportion of incidents requires exclusively mechanical work, as well as the idea that line engineers already have, prompted the next step: the implementation of automatic alerts.
The following are included in the pool of incidents worthy of automatic alerts:
1) Incidents that do not require highly intelligent processing, that is, those whose investigation fits into the linear logic of SIEM (in this case, all the work of the engineer was, by and large, copy-paste of fields from SIEM into a standard notification). It is also important that incidents generate a low percentage of false positives.
2) Incidents with two phases of investigation – linear investigation, as in paragraph 1, and the construction of time-consuming manual reports.
3) Critical incidents, for which the customer prefers to get a high percentage of false positives than filtered information with a delay.
For incidents of the 1st type, the DirectAlert mechanism was implemented: all information necessary for response is aggregated and enriched automatically, while a notification is automatically generated towards the customer.
For incidents of types 2 and 3, the Direct & Mon mechanism was implemented: if an incident is suspected, an alert with the main metadata of the incident is automatically generated for the customer. After that, a ticket is formed on the analytics line for further investigation, the formation of analytical reports and downloads / reports.
What it gave:
• Satisfied customers who advocated early warning through DirectAlerts
• With their help, we saved part of the resources of the first line and achieved “morale +30” (reducing the amount of routine is important, you must agree)
• Increased the depth of investigation of some of the incidents, turning them on the second processing line
• Built a second SIEM into the work of analytics lines
• Well, we made the growth vector of an information security incident analyst more comfortable and understandable. Still, step up the career ladder in small understandable stages 1-> 2-> 3->, etc. more comfortable and more accessible for the majority than the jump into infinity, which originally separated the lines of analytics.
Naturally, we subsequently changed the principle of ticket routing to the second line, more on that below.
Looking for historical context: some more automation
Remembering the collected rake, I would like to talk separately about the problem of preserving / restoring the historical context when investigating incidents. You must admit that it is one thing – an isolated incident without any background, and quite another – a case with a host or account, which recently appeared in other incidents. And if in the first variant, most likely, it is simply a violation of the company’s information security policies, often unintentional, then the second one, with a fairly high probability, may indicate a deliberate harmful effect on information systems. And here it will not be superfluous to highlight the entire chain of incidents both for the customer himself and for the TIER-4 service manager and analyst assigned to him. But how to do this if the first incident in the chain was investigated by one engineer, the second incident was investigated by another (perhaps even from a different regional branch), and the last incident was the third engineer on the night shift after the shift change?
Even when the Solar JSOC structure was not so wide and the investigation of incidents was concentrated in the Nizhny Novgorod division, we were faced with the fact that during the transfer of shifts, a certain general context of the line was lost. Because of this, the investigation of new incidents occurs without taking into account the previous ones – whether this account has been encountered recently, whether this host was shone as a source of scanning, etc.
Our customers’ penetration tests have shown this problem in all its glory. It was very disappointing when, having perfectly investigated several incidents during the day and several at night, we were unable to link them together. And all because it was not possible to show the context of the related incidents to the night engineer, who was investigating, as it later turned out, the development of the pentest.
The problem of losing context when changing changes appeared immediately after the launch of 24×7, but for the time being it was not very acute. With the increase in the number of line engineers, and even more so with the emergence of new regional divisions, it demanded a prompt decision. Because the context began to be lost not only during shifts – just with the growth of the team, the likelihood that each of the chain of non-isolated incidents falls into the hands of one engineer and that during the investigation will remember the background tends to zero.
Out of habit, skipping the first of the “damn Russian questions”, we focused on the second – “what to do?”
When developing SIEM content, we try not to create too long chains of rules, since our experience indicates their low efficiency in detecting incidents. Thoughts about the implementation of complex scenarios, taking into account a deep retrospective, were considered unpromising, and it would not be possible to create scenarios for all possible variability of combinations.
Nevertheless, this did not prevent our desire to look at the incidents in retrospect and glue them into a kind of killchain. The gluing logic was implemented in our service desk, and incident handling in kayako.
The mechanism analyzes the retrospective of the customer’s incidents triggered in different time windows and, if incidents with key fields corresponding to the incident in question are found, displays the basic information and a link to it in the ticket. The mechanism is recursive – the search is performed according to the key fields of detected incidents – thus forming a tree of probable threat development chains. It is difficult to perceive, but it looks simple and understandable – we will show you with examples.
Incidents with a historical context are called non-isolated incidents in our vocabulary. Incidents with no background are isolated.
The Killchain search engine has also become one of the basic criteria for balancing tickets between the first and second lines. An isolated incident, in the key fields of which there are no critical hosts / networks / accounts (in fact, there are many more criteria, but in this example we will consider only the main ones) – most likely, it will not get the necessary scoring to go to the second line, and will be processed on the first without wasting resources on a detailed study of the delta neighborhood of the incident.
In the event of a non-isolated incident, the likelihood of hitting the second line increases significantly. And if, in the background of the current incident, triggers were found within a short time window (less than a day), then this incident is a 100% candidate for consideration on the second line with the obligatory study of the background.
And of course, there is a filtering mechanism here, since not all glues are equally useful. The development and maintenance of filters for each customer, of course, requires certain resources on the part of analysts, but the benefits that the toolkit for gluing chains of incidents brings is still many times higher than these costs.
Such is the happy “perestroika”.
Alexey Krivonogov, Deputy Director of Solar JSOC for Regional Network Development