Democratizing DevOps
At our recent GigaConf conference, we reflected a lot on how the DevOps direction will develop. It is unthinkable without tools. Therefore, I will tell you about how we are implementing GitAIOps practices at Sber, what mistakes we made and learned, what conclusions we made about implementing AI. Today, everyone is talking about how AI will help developers, but few people talk about its help for DevOps engineers. We need to fix this.
My name is Yuri Sporynin, I have been in IT for over 20 years. I started with development, creating a processing system for Internet acquiring with my own hands. In 2016, I moved to Sber, where in 2018 we implemented the platform in Sberbank Online. Now, my tasks include a cluster of DevOps tools, which we will partially talk about today.
Tools in Sber
Today, we have over 4,500 teams running over 300,000 builds in Jenkins every day. The tasks include deploying applications to test, staging, production, and other environments. Many of the tasks take quite a long time to run; a simple microservice in OpenShift or Kubernetes can take over 12 minutes to deploy. So, given our scale, it was worth thinking about optimization and implementing new practices.
How it all began
Now we have about 100 Jenkins masters alone, but when we started implementing DevOps, there were only a few of them. There were constant problems with them: by midday they either crashed or started to delay. We quickly got tired of this, and started thinking about how to improve reliability.
At the same time, we asked the teams what other difficulties they were facing, so that we could implement some new practices. Although many teams already had a deployment pipeline set up, there were also those who didn’t know where to get one. Or who would support it and how. “Where you got the pipeline, go there.” Some teams, even those who had chosen a pipeline for themselves, didn’t know how to use it and asked various questions. For example, why this CURL command, which should upload the distribution to the Nexus repository:
$> curl -o /dev/null -d @exmple.zip \F mave2.generate-pom=true
doesn't work?
$> curl: Failed to open example.zip
People didn't understand what the command was downloading, but to do that, the distribution had to be built first.
Sometimes the guys didn't have enough knowledge to solve a particular technical problem. For example, they figured out the assembly, but came back to us with the question, why the instructions don't work again?
$> command not found: -F
Failed to transfer to Shell. It would seem like a simple task, but even with it there was a problem.
We tried LLM's capabilities for composing instructions. It seems to give sensible answers, but if you apply it head-on, it doesn't get any easier.
Let me explain with a simple example. This is what the Perl model generated:
This is an old joke, don't run this code because it just deletes the root file system.
$> rm -rf /* : insufficient privileges
It would seem like two simple lines, but their operation needs to be explained to those who are not familiar with Perl and Ansible. In general, a new approach was needed. Something that would help teams, including more or less savvy managers, quickly and easily roll out new releases. We decided that this should be GitOps.
GitOps
Let me briefly remind you what it is. GitOps is based on a declarative description of the configuration, not an imperative one through commands. That is, we describe what we want to get, not what the system needs to do. This configuration should be in an immutable storage that supports versioning. For example, in Git, it is convenient to work with.
We should also always have some agent that constantly monitors changes in the state of the observed system and the state of Git. And there should be a closed loop that applies changes made in the repository to the target system. A clear and convenient concept that fits well with the tools we have.
What do we have in place to enable teams to deploy apps and services every day using this new approach?
There is DevOps Pipeline Manager – a management (orchestration) tool that works with all segments that are isolated from each other.
For Git, we have a so-called relayer that allows you to synchronize container images or code between segments. And the gentleman's set – our favorite Jenkins and Argo CD (which we have modified) to deploy OpenShift and Kubernetes on the stands.
We conducted successful pilots, but encountered problems during implementation. In solving them, we gained the following experience.
Catalog, not branch
First, the configuration should be stored not in branches, but in directories of one selected branch. The thing is that branches are often managed by different teams or even individual developers. They all work at different paces and can make changes independently of each other. And when all this is combined into one branch, problems often arise.
If you store the configuration in a catalog, you can create a flexible and visual structure for any stand topology, there is no need to resolve conflicts, and you can check the industrial configuration in advance.
We usually use a base folder with charts, for example, Helm, and a directory of some stand that inherits information from Application. We can operate with stand-dependent and stand-independent parameters. This is provided by Customisation.yaml, which is in Argo CD. At the same time, branches help us check the resulting configuration.
When we need to deploy to production, we issue a pull request, which is separately checked for compliance with all our practices for release management, cybersecurity, manifest use, etc. At this stage, something will definitely be added to the pull request. And after approval, you can add it to the master branch, which can then be safely sent to production.
Manifest assembly
It turned out that it wasn't enough for us to just make manifests that are deployed to Argo CD. In small teams, this approach works well because it's easy to make 5-10 manifests manually. But when there are dozens of services, hundreds of stands, and several independent installations of the entire system, writing manifests becomes a difficult task. Of course, we have tools, but we still had to figure out the accumulated Ansible scripts and Groovy logic.
But there is a wonderful tool Helmfile, an analogue of Argo. Its template engine has many different instructions that can be inherited and collected from various files to get the configuration we need. We are still improving this process.
Changes
Not all of our changes can be described declaratively, some of them are imperative in nature. For example, secrets. When changing the database structure, a fork appears:
you can use a Kubernetes operator that, when declaratively describing the configuration, will load the corresponding secrets from some storage into the namespace;
or use the Atlas operator, which can declaratively modify the database.
But not everyone, for example, a DBA, will agree that their database will change online when the application changes when we roll it out. If, for example, some index hangs during recalculation, this can simply lead to a complete and long-term stop of the application. Therefore, the choice of option must be very carefully thought out at the level of the organization or a specific team.
Will AI help us?
AI is developing very quickly today and already provides interesting opportunities when implementing GitOps. For example, AI can check parameters. The system has already been trained on many configurations and has coped quite well.
This was not enough for us, we wanted to work with the context that a specific team has. After all, we create very different applications, with different sets of parameters. And the RAG (Retrieve Augment Generator) mechanism came to the rescue: you can vectorize the database and transfer models in this form along with the prompt. This is very convenient, because additional training is very expensive, it requires a lot of resources. Not everyone can afford it. And with the help of REG, you can ask the model within the context we have without additional training. The mechanism already supports our LLM models GigaChat and GigaCode. Here are some examples.
Checking the relationship between parameters
There may be parameters in the system settings that are related to each other. And an error in one of them can affect the system as a whole. Here's how you can quickly find various errors:
You can also use RAG to parse complex parameters that are specified in the configuration:
Our practice shows that the human factor still remains one of the causes of incidents, including large-scale ones. A study was conducted abroad, according to which the current level of machine learning models during configuration verification allows us to achieve the F1 metric value of up to 0.79. And if we pass a vectorized context to the model, F1 grows by another 0.04.
There is another interesting approach – mixing successful and unsuccessful configurations, the so-called few shot requests – which also allows us to achieve better results. And here again, the GitOps approach and the accumulated base, which we can pass to the model with prompts, help us.
Results
Firstly, we managed to reduce the deployment time by 4 times, from 12 to 3-4 minutes, depending on the application.
The biggest achievement is that we stopped writing code in Groovy. Guys who don't know it don't have to figure it out anymore when it breaks. They just have to throw in YAML files.
Thirdly, we now have a common configuration database that everyone can work with. In a large company with many stands, collecting information about which versions are deployed where can take years. And when you need to migrate something, all you have to do is go to the cluster administrator, ask him to download all the configurations and roll them out, otherwise the cluster may fall apart.
The next change is the unification of stands. Administrators like to do something manually in the production environment from time to time, then forget where and what they did, and when they roll out a new version, everything suddenly breaks. GitOps saves us from this.
The final result is our readiness for AI. We are seriously considering creating a configuration check plugin with LLM and distributing it to teams. Perhaps the market will be interested in it too.
And one more thing
The CueLang language has recently appeared. It allows making relatively simple descriptions that allow you to check the configuration. Is it more convenient than LLM? To make a validator on CueLang, you will need to understand it, describe how to check the parameters (for this, you will most likely need a developer). You can give these instructions to the system, it will check them, and you will solve your problem. You will not be able to lay down all the options, but some.
And in the case of machine learning, the maintenance administrator can write comments in Russian about what configuration he wants to receive. And then the analyst or manager can check whether everything is set correctly. Now I do not dare to say that LLM will replace CueLang, but in the future, neural networks will probably be able to handle this task.