SOC growth problems: how to take into account +100500 customer wants and not go crazy

Once upon a time, trees were large, and Solar JSOC was small. All his employees could be counted on the fingers of two hands and feet, so there was no question of uniform rules of the game. With a small number of customers, we already perfectly took into account their various wishes for detecting incidents and alerting (in this company, contractors can remotely connect to the infrastructure, and in that company, the administrator Vasya works at night and this is legitimate; here they are waiting for instant messages about any suspicions of incident, and there they are ready to postpone – the main thing is the maximum of analytics, etc., etc.). But over time, customers have noticeably increased, expectations from SOC were different for everyone (and sales promises about customizing the service are generous)… And all this along with an increase in the number of our own specialists with their differing understanding of the “beautiful”. To streamline the entropy that has arisen, it was possible, of course, to continue to produce instructions on every occasion, and send all first-line engineers and analysts to courses for the development of supermemory, but this is fraught … Therefore, we came up with a more effective work scheme.

Conveyor and operating instructions

Let’s start with the first line. The problem is that literally every customer has their own requirements for notification processes (which is quite normal). But when these customers are no longer 5 or 10, but 100, it is simply impossible to spread 100 instructions around each engineer.

We might even have decomposed them, if not for one “but”, namely – our confidence that the instructions do not work. No, of course, they work if they are clear and convenient. And you can easily write instructions like this by simply logging the current activity of as is. So in many countries they make paths for pedestrians – they just lay paving stones or asphalt where people have already trod their paths.

It is more difficult in a situation when you need instructions to convince people to stop walking where it is convenient for them. It is quite possible that this will work for pedantic Germans or diligent Japanese, but here … For all the time, colleagues from only one organization said that their instructions work in a regular manner – regardless of their convenience / usefulness for the performer. To the question: “What is the secret?” – the answer was simple: “We are lucky – we have a disbat”.

What remains for commercial enterprises that want to achieve compliance with instructions by line executors and at the same time remain within the legal framework? (torture is still prohibited by law)? For us, there is only one answer: either to make sure that the instructions cannot be technically violated (in the analogy with a path – we dig a deep ditch and launch an alligator there), or we make an alternative path more convenient (we put cleanliness, flower beds, benches, we put beer). If it is possible to ensure total control of violation of any instructions, this is also an option, but it works only when 100% of violations are followed by their own punishment, otherwise they will try to slip through at random.

If none of the options work, and the instruction remains just an instruction … Then you can equip it with the “auto-wake up” function at the right time. It is not the performer who should remember that there was an instruction for this case somewhere, but the case should wave a flag in front of him – “I am special, instructions for working with me are attached.”

We carry out all innovations and changes through these filters. As a result, several basic tools were formed that solve 99% of all tasks – surprisingly, there are not so many of them:

1. An automatically applied notification profile. When a ticket is opened with suspicion of an incident, the addressees of the notification are auto-filled on the customer’s side, taking into account not only the key metadata of the incident, but also its type and severity, time of day and day of the week. The same mechanism can be used to redefine the severity of the incident, SLA and processing priority. The engineer simply investigates the incident, fills in the notification template, and clicks send. There is no need to think about what recipients he should send it to.

2. A scale with the weight of the incident suspicion. In the ticket system, on the basis of many parameters, the “weight” of the incident is formed, that is, its real significance. The fact is that according to the agreement, the criticality of the incident is determined only by its type. Thus, the suspicion of compromising the account (UZ) of the administrator and the ordinary user under the agreement and SLA have the same criticality, but it is obvious that in reality this is not the case. The same story if the incident occurred in a critical network segment or test zone. To avoid misunderstandings in the definition of the criticality of incidents and unfounded claims against engineers, we have introduced a simple rule: incidents are dealt with from the queue, starting with the most “heavy” in “weight”.

3. Automatic calculation of the real time spent on the investigation of the incident. This is a mega-useful statistics that we use both for a variety of automation mechanisms and for reconciling the resources included in the contract and the real costs of resources.

4. Various “foolproof”. Not the brightest page in the life of a SOC with a large number of clients. Not the brightest, since these protections did not appear preventively, but on their own sad experience, when, in the heat of a heavy load, engineers made mistakes and made a violation of confidentiality. For example, they filled out the profile with information downloaded from the SIEM of another customer (in our reality, the number of SIEM consoles has long been counted in dozens and every second customer has a user with “ivanov” accounting and addressing 172.20 …); built reports on the multi-tenant SIEM core without a customer’s sign, or by writing a template that fell under several customers.

At large volumes, the human factor will manifest itself in the most sophisticated forms. We are simply trying to prevent the repetition of what we have already encountered. Therefore, now notifications are viewed by parsers looking for all sorts of deviations: mentioning words in a ticket of one customer that are characteristic of another customer; discrepancies in the addressees of the notification and the notification itself; the absence of a field with the name of the customer in the unloadings, or the mention of several customers in this field and various other checks.

5. All other instructions, such as: instructions for investigating an incident of this type, signs of false positives, peculiarities of handling this incident for a particular customer, or, for example, a reminder of the need to accompany this incident at night with a call – are automatically glued directly to the ticket in which the incident is investigated (about the need to call at night will also be reminded only at night;). Thus, if this particular ticket has any deviations from the general processing template, information about this must be directly in the ticket.

And now in that case, when you, as the owner of the process, did everything to ensure its execution, when all the necessary automation worked, when the “foolproof protection” triggered, and the instructions waved red flags in the ticket, but the engineer still clapped his ears or deliberately ignored – it is allowed to replace the concept of “human factor” with the concept of “starvation” and forget about the inadmissibility of torture in the legal field – any sane judge will understand and admit that you acted in a state of passion, and will acquit

Content development

Once in this field there were only two, but very strong warriors. But it was obvious that their resources were scarce and expansion was inevitable. Gradually, new employees appeared, whose tasks included the implementation of new scenarios for detecting threats, and we were faced with the problem of a “new broomstick”: the guys had their own vision of how to write correlation rules, how to work with exceptions, etc. As a result, mixed content appeared, with varying degrees of maturity, code optimality and stability of work.

All this manifested itself a little later – with an increase in the number of customers, the scale of their infrastructure and, as a result, an increase in the load on SIEM: we started having problems with the performance of the content and the SIEM itself. There is a need to formulate uniform rules of the game and describe our approach to scenario development: best practice – what can and cannot be done, how we implement typical detections, how profiling rules are written that work on a large stream of events, etc.

In the ideal picture of the world, which we were striving for, the created script does not depend in any way on whose hands it was assembled. It was not easy, but we coped with the task: the content development methodology for both SIEMs is described in Confluence, and all new employees, whose tasks include content development, study it without fail. Well, the vectors for the development of this methodology are set by the leading analysts working with each of the platforms.

As the number of customers and SIEM installations grew, there was a need for centralized content management. Conditional – when adjusting an existing scenario, we need to quickly and easily apply the patch to all SIEMs. Of course, along the way, we encountered many technical difficulties in terms of content synchronization, but this is more a topic for a separate article.

In addition to technology, another important problem has emerged related to the customization of the service. At the request of customers, analysts made improvements to the scripts (for example, they excluded any legitimate activity from them) – in such a situation, there could be no question of any centralized auto-synchronization, since at the very first launch of it all the improvements would be frayed.

This problem was also solved by formalizing and describing the development methodology described above. Exceptions were moved to special active sheets and tabular lists, the content of which is not synchronized between installations, introduced the use of pre-persistence rules and enrichment rules to take into account the nuances of customer infrastructures, etc. To store the reference content, a SIEM master installation was introduced, but all the development is done at the dev-stand, which is not scary to let even “green” newbies with shining eyes and still without full bumps.

To avoid duplication, overlap or irrationality of the developed scenarios, a so-called pre-approval procedure was introduced. If an analyst wants to develop a new rule for identifying a particular threat, he first writes a letter to a group of more experienced colleagues, in which he describes the goal of the scenario, the logic of its work and the proposed implementation method. After a discussion of varying duration and intensity of passion, the initiator receives an approval for development, with or without adjustments. Again, this process allows for a consistent approach to development down to the script naming conventions. Yes, this is also important for a mature SOC!

As the SOC continues to grow, more and more “islands of uncertainty” will inevitably appear, but we now know for sure: the apparent chaos can be brought into order.

Alexey Krivonogov, Deputy Director of Solar JSOC for Regional Network Development

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *