what they will tell you at DevOops 2024
Sergey Bukharov
Dodo Engineering
Sergey will talk about the formation of SLO at Dodo Engineering: where they started, what they came to, how they adapted book practices to a version of reality, and what came of it. Don’t expect ready-made recipes – this will be a report about a long-distance rake race, in which, first of all, the speaker will share his experience and mistakes made.
Twenty-five again, or How to prevent the incident from happening again
Kirill Borisov
VK
Let's consider the main and most popular methods of root cause analysis: 5 Whys, fishbone diagram, cast. Let's understand the intricacies and features of application. Kirill will compare tools and give recommendations on choosing the right tool depending on the specific situation. Using the example of one incident, we will consider the root causes using the listed methods and see which of them more fully describes the reasons for the incident.
The analysis of incidents must be carried out based on the totality of root causes, looking for intersections in different incidents. Kirill will give practical recommendations on how to approach this process.
Dealing with metastable failure states
Vadim Martynov
Yandex
Rate limiters, product degradations, server and client throttling, congestion control to databases, geodistribution – these are the tools that Vadim encountered to protect against excess load and transition to metastable failure states. They are good and useful, but have their drawbacks.
It is proposed to look at another solution that protects services and databases, does not require manual configuration, and helps to correctly utilize system resources.
DevOps at the factory: expectations vs reality
Ilya Oleksiv
Sibur Digital
Mikhail Fufaev
Sibur Digital
One of the most important stages in setting up the processes of any IT company is creating and debugging the process of deploying its products. But speakers work in a digital factory, and in an industrial company environment, the deployment task is complicated by the fact that deployment environments are heterogeneous, independent, and often lack direct network connectivity. Is it even possible to create an efficient and sustainable release process when traditional DevOps practitioners face such limitations?
They will tell you about the difficult path to a simple and understandable deployment process.
How to distribute trillions of files
Konstantin Lebedev
Mayflower
When using GlusterFS as DFS on volumes of more than 50 million files, Mayflower faced the problem of the impossibility of further maintaining the cluster in a reasonable time. Therefore, we returned to the choice of modern distributed storage, taking into account new requirements and technologies.
At first glance, SeaweedFS looked like a very attractive solution, since it is written in the modern Golang language and designed based on the Warm BLOB design. But it was not completely clear how it would behave in production. Konstantin will tell you what the result was.
Synchronization of production. Speed, reliability and simplicity of the DevOps artery
Vladimir Medin
Sber
Vladimir will tell you how Sber built a simple, reliable, distributed system on a hybrid tech stack, which delivers gigabytes of distributions, Docker images and deployment scripts from the development segment to the production segment in a matter of minutes. It was possible to do this in such a way that users do not even think about its existence, although they previously performed dozens of routine operations and waited up to several days for the results of their work to be delivered to the industrial circuit.
The monitoring is green, but nothing works for users. How to monitor the client side
Daniel Khaliulin
T-Bank
There are legends that the phrase “everything works for me” (c) instantly alleviates the suffering of clients, and sometimes miraculously corrects failures. Be that as it may, to ensure the reliability of modern applications, monitoring only the server part is no longer enough. Due to the general complication, monitoring by clients is increasingly moving from the category of “nice to have” to “must have”.
The report will examine the issues of client monitoring. They will tell you what data is especially important to track, and you will find out what kind of big shots you got in T-Bank, building observability in the main T-Bank mobile application with traffic of more than 25 million unique customers per month.
Using HAProxy to load balance between locations
Maxim Kupriyanov
A report on how to use the well-known open-source load balancing solution (HAProxy) to automatically redistribute the load between several sites during sudden traffic surges.
Zero-downtime deployment and databases
Andrey Tsvetsikh
T-Bank, DevBrothers
Microservices have long been firmly established in our lives. They allow you to implement scalable and fault-tolerant solutions. But when deploying a new version to a cluster, errors sometimes occur related to updating the database.
Andrey will look at popular methods of deploying to a cluster. Shows typical problems that arise when updating a database and ways to solve them. Let's figure out how updating NoSQL databases differs from updating traditional relational databases.
Culture
Mentoring as part of a DevOps culture
Tatiana Serdinova
TAGES
The modern economy is a knowledge economy. Increasing collaboration and data sharing among technical departments in a company is one of the key cultural principles of DevOps.
And here mentoring comes to the rescue, which Tatyana will talk about in detail. You will learn what mentoring is, who mentors are, and who they mentor.
Combo fakapi, or the Butterfly fakap effect
Grigory Koshelev
Circuit
Stories of investigations into fakes caused by chains of unlikely events coupled with a scattering of harmless bugs.
Andrey Zarubin
Raiffeisen Bank
The purpose of the report is to dispel the hype around SRE on the one hand and the conservatism around ITSM on the other. Andrey will talk about the principles of SRE and basic ITIL practices. How, in his opinion, they should be combined using DevOps CALMS and what the industry is now offering us.
R&D platform. Chapter 1: Getting organized
Maxim Zalysin
Positive Technologies
As in life, before starting a big project you need to put things in order, and sometimes putting things in order is the first step towards results. In his report, Maxim will tell how the DevOps team at Positive Technologies began moving towards creating an “R&D Platform” taking into account requirements, expectations and reality.
Updating infrastructure dependencies without pain: secrets of our DevOps kitchen with Renovate
Vlada Zubareva
Mayflower
Like any DevOps team, Mayflower creates and maintains many Ansible roles, Terraform modules, and its own Docker containers. These components are actively used by various teams of the company to configure the infrastructure. However, updating versions and communicating changes between teams in a timely manner can be a major challenge.
Vlada will tell you how Mayflower organized the management of internal roles and modules, and how the Renovate tool helps automate and simplify the update process on a daily basis, ensuring the stability and consistency of the infrastructure.
We tried Platform Engineering. The prank was a success
Alexander Kozhemyakin
VK
The story is about how the development of platforms was approached from different angles. What are the pitfalls when you have a heterogeneous infrastructure? How to learn yourself and teach others to negotiate technical solutions? Why build a platform?
The answers to these and other questions are in the report.
How to build a Development Platform from scratch in a single company
Sergey Kiselev
MTS Web Services
Sergey and his colleagues are developing MTS Web Services (the new MTS Cloud) and solving issues related to building a unified development culture. The goal is to create a transparent and understandable architecture to reduce the time it takes to onboard new developers. They want to build a solid ecosystem of libraries and approaches for reusing the Cloud in all development teams.
The story will be from the Development Platform and will cover aspects of building everything from scratch. Let's talk about design documents (ADR) and how they are used. We will definitely touch on the topic of internal open source (innersource) and the cultural aspects of its preparation. In conclusion, we will discuss the fight against boilerplate through code generation. All this is in the format of stories, as the new Cloud is being written right now.
Removing damage from development team resources
Alexander Krylov
Bimeister
Let's discuss approaches to solving problems of resource redistribution in teams participating in the development cycle. It would seem, why do this? To free up the resources of some teams and increase the competencies of others with a change in focus to targeted activities.
Alexander will share what obstacles you may encounter on the way to implementing or changing processes, what arguments you can come to terms with resistance, and what profit you can get as a result.
The path from “IT standards” to “technical capabilities”
Evgeniy Kharchenko
Raiffeisen Bank
The story of how DevOps practices were introduced at Raiffeisen Bank, how they transformed from mandatory standards to an engineering culture and subsequently turned into “technical capabilities” with maturity levels, multiple criteria and automated checks in IT on the scale of 258 teams employing about 3,700 IT specialists.
The report touches on issues of engineering culture and motivation of engineers and teams to develop in this direction, and also offers a solution to the problem of implementing and measuring technical practices in the enterprise.
Safety
Vulnerabilities as data streams
Yulia Volkova
CodeScoring
A report on how the world of vulnerabilities works from a data point of view. Julia will talk about NVD, FSTEC, GitHub Advisory, OSV, newsletters, and how they all live in a single (not always) life cycle.
Why can't we just magically create one tool for all systems and languages. Why do different tools sometimes produce different results, what do PURL and CPE have to do with it.
Lev Khakimov
MTS Web Services
From year to year, more and more network solutions based on BPF and eBPF appear: the development of Cilium, the transition of Calico to eBPF, the emergence of Service Mesh solutions based on this technology. For most engineers, this was a transition from the classic network stack to the magical “black box”. Today we will lift the veil on this technology and understand how popular networking solutions work.
Andrey Moiseev
MTS Web Services
In a company, it often happens that you are a one-man team and it is necessary to ensure the security of software development. During the report, Andrey will analyze the basic pipeline for checking software for security. We will use and customize GitLab security templates as a pipeline. Let's look at how to quickly build a minimal DevSecOps pipeline, apply the practices of SCA, SAST, secret management and think about what we will have to do with it next.
Features of certificate management in container environments
Anna Archer
Clearway Integration
To provide a secure communication channel and reliable authentication, certificates are needed. And failure to update at least one certificate in a timely manner can lead to serious failures. In container environments, where certificates can appear in the thousands per day, automation is essential.
In the report, we will look at the sensational failures due to problems with certificates and how we learned from the mistakes of others to manage millions of certificates without failures.
We patch flaws in application images before, during and after runtime
Anatoly Karpenko
Luntry
The usual situation is that you only received the image itself (provided by the vendor, legacy or open source). You scanned it and – “surprise, surprise” – it turned out that it does not comply with best security practices at all: a large number of vulnerabilities, misconfigurations, hard-coded secrets.
And you will have to work with this image, and the project source files and Dockerfile are not available. This is sad! But we will make sure that the image is safe to use.
Let's make changes at the level of the image itself, applying layer modifications using docker-squash, mint, etc. Let's tweak the runtime at the operating system and Kubernetes level: AppArmor, capabilities, privilege management and other “handles”. Let's consider observing the anomalous behavior of containers in runtime: Falco, NeuVector.
GOSTBUSTERS. How to now prepare static analysis after GOST R 71207-2024
Anton Tretyakov
PVS-Studio
In the first half of the 21st century, it turns out that not only ordinary jobs live in pipelines, but also… ghosts. Loaded clusters cannot withstand the onslaught of the supernatural.
But if we move away from references to the famous film, then in the report we will talk about GOST R 71207-2024. There will be theoretical and practical sections. Let's look at what is written in the document, and then at how this is reflected in practice.
The main topics are:
How static analysis is defined in GOST.
Examples of code with errors according to GOST.
How to implement static analysis according to GOST.
An example of implementing static analysis according to GOST.
Gentle migration and adaptation of the project in the cloud
Anton Chernousov
Yandex Cloud
In the report we will look at several successful moves/arrivals to the cloud. Let's discuss the stages of migration and adaptation of IT infrastructure in the cloud.
We will touch on the issues of preparation, audit, development of a migration plan and discuss the roadmap. Let's touch on aspects of information security and measures to ensure business continuity during migration.
Back to Basics. Certificates, TLS and mutual authentication of services
Anna Archer
Clearway Integration
Many people use certificates out of choice or for security reasons, but not everyone understands how certificates actually work. During the report, we will look at the basics of how certificates work, cryptographic algorithms and protocols that use certificates. Let's discuss how to avoid basic mistakes when setting up mutual authentication (mTLS) of containers.
Is it possible to access services securely?
Georg Gaal
AEnix
Alexey Fedulaev
MTS Web Services
What is Privileged Access Management (PAM) and secure access to various services. Is this necessary? What solutions are on the market now and how they compare. Why you should use one of them, and not use Ansible playbook to configure servers and users.
The report will show what you can do well and not spend your whole life on it or sell your soul to the devil.
DexExp
MS-DOS Shells: Beyond Norton Commander
Dmitry Moiseev
Circuit
For many, MS-DOS is still associated with a black background, the command line and incomprehensible commands, while the revolutionary macOS and Windows are associated with the advent of convenient user interfaces. But in reality, working under MS-DOS very quickly became convenient thanks to shells and file managers, the most famous of which is Norton Commander. The most famous – but not the only one! And in this report we will look at what else was interesting and unexpected on the market for similar products.
Platforms and other toys for adults
Vasily Kutsenko
Pochtatech
Building your platform is a natural development of the DevOps culture. In his report, Vasily will tell you how Pochtatekh approached the development of its platform (spoiler – in two steps), what tasks it should solve and how these goals are achievable.
Decomposing GitOps. How to Upgrade Your CIOps to GitOps with Minimal Effort
Oleg Voznesensky
VK Tech
Let's discuss the essence of the GitOps approach, its pitfalls, and make our own GitOps implementation from scratch using available tools.
Back to Basics: OOM Killer. Survival Basics
Alexey Tsykunov
Hilbert Team
As part of the report, we will analyze how memory works in Linux and why the OOM (Out Of Memory) situation occurs. You will learn how OOM Killer selects processes to terminate, how to avoid its “visit” and maintain system stability. We will also discuss how OOM Killer is used in Kubernetes.
Our Never-Ending Journey of GitOps Transformation with Flux CD
Tung Nan Kwong
TalkHub
The report is dedicated to how the speaker's company switched to GitOps over the years. Challenges faced, important lessons learned and plans for the future. Of course, this affected the workload in production, but in the long run it was worth it.
Andrey Sukhorukov
Kaspersky
In pursuit of automation, we have stopped asking a number of questions that affect business. This report is a study that is designed to answer the question of how much a devops “head” really costs.
During the presentation, the “toxic tech director” will present a probable case of “destruction” of a competitor company with calculations and carried out scenarios of an attack on target engineers.
K8s
Java, Spring Boot and Kubernetes: how to speed up application startup and save cluster resources
Alexey Ignatov
SberTech
Java is a convenient language for developing business applications. The Spring Boot framework is still popular and used by many developers. The nature of Spring Boot and the JVM creates some challenges when used in a Kubernetes environment. You have to choose between slow application startup and increased resource usage. The report will tell you how to speed up the start of Java applications in Kubernetes and save cluster resources.
Alexander Shinkarev
Tourmaline Core
Without fear, we will launch locally… a microservice product that will be deployed to Kubernetes in production.
It will be useful for those who struggle with debugging and running microservices on their computer. A method that works on small and medium-sized products. Let's discuss when this is appropriate, what restrictions and requirements there are, which bigwigs in the speaker's company were allowed to play the jam. We will connect the deployment in production and locally.
All approaches and examples that will be shown to you will be publicly available in repositories on GitHub. You can simply take and start new projects on these rails.
4 Ways to Detect Node Failures in Kubernetes: Current Workload Recovery Strategies
Dmitry Rybalka
Cooper (ex-SberMarket)
The failure of a worker node in a Kubernetes cluster is always an unpredictable event, with varying impacts on the workload.
Dmitry will tell you how to make such situations not just less stressful, but also as manageable as possible.
Consider:
How Kubernetes detects node failures. What can you do to improve this process?
Node-problem-detector (NPD) and the possibilities of its customization.
Alternatives to NPD: Their Strengths and Weaknesses.
Failure domain-aware load placement planning strategies to minimize affect.
Maxim Chudnovsky
SberTech
Alexander Kozlov
SberTech
Let's consider the Governance as a Code approach. What solutions already exist and how can you manage the configurations of a large number of microservices in a multi-cluster environment.
The report is intended for practicing engineers who are familiar with cloud infrastructure and the phenomenon of Service Mesh.
Reproducible bare metal environments using Talos Linux and Cozystack
Georg Gaal
AEnix
A fascinating story about how AEnix came to Talos Linux and what it gave.
They are developing Cozystack, an open-source platform for cloud providers that runs virtual machines, Kubernetes on Kubernetes, and managed services. The main platform for them is bare metal. Despite the fact that each server has distinctive features, the company strives to ensure the stability of the platform and each of its components.
Georg will share his experience: he will tell you exactly how it works, the problems encountered during development, and the solutions found.
Cloud technologies
Infrastructure from Code: the next stage of IaC development using the example of Serverless
Victor Kuzenny
Yandex Cloud
In the report we will look in more detail at what IfC is, what its advantages and disadvantages are, as well as how it differs from IaC and how it complements it. Using the example of one of the frameworks and serverless computing ecosystem Yandex Cloud, let's see how IfC helps developers create applications based on Serverless faster and more efficiently.
Expanding the capabilities of the Cluster API: how to write your own infra provider and not go crazy
Ivan Gulakov
MTS Web Services
Ivan will tell you how he collected the best results while writing his infra-provider for managing hybrid infrastructure.
During the report we will cover the following topics:
What is an infra provider from the inside?
Business issues and how bare metal turned into a hybrid.
Why dragging too much business logic into a provider is bad, or how to make your own small monolithic operator.
How hype immutability hit the forehead with a rake.
Adventures with Envoy: how to build your Service Mesh and not step on a rake
Denis Zolotarev
Yandex Plus Fantech
Denis will tell you how Yandex is building a Service Mesh based on Envoy as the base layer of interservice interaction.
They have come a long way from a small startup within Plus to the infrastructure level of the entire company. Let's briefly talk about the theory and standard architecture of Service Mesh; we will devote most of our attention and time to solving practical problems using Envoy and unobvious problems that may lie in wait along the way. The speaker will show examples of code, graphs, and fatal errors in production. He will tell you how to protect yourself from such errors in your own projects.
Creation and management of infrastructure for developers. Terraform CDK
Anton Ermak
Independent expert
Let's talk about using Infrastructure as Code within the Terraform CDK. As part of the report, we will consider the general idea of the applicability of this approach, the pros and cons. Using examples, we will create entire architectural infrastructure patterns and discuss how they are beautifully expressed in languages: through classes, objects, variables.
Other
What is the structure of enabling commands, what are their methods of interaction and how to avoid mistakes when forming them? As part of the discussion, we will discuss the first experience of launching enabling teams in well-known companies, the history of the emergence of such teams, their composition, skills and roles, differences from other teams, activities and interactions, successful and unsuccessful cases, development plans.
Try yourself as a speaker and talk about everything that worries you right at the conference.
Give a short presentation on a free topic in any format. Each participant will have 20 minutes to share their stories. Sign up for a performance right on site!
Please note: only participants of the offline part of the conference can speak. There will be no video recording.
Conclusion
We've dealt with the reports – let's finally deal with the rest:
The conference has a non-standard format. The first day (November 6) is online, but November 12-13 is up to the participant to choose from: you can come to the conference in person in St. Petersburg, or you can connect remotely.
Of course, the conference is not limited to presentations: there will probably be a lot of communication between participants offline. But this can no longer be described in a habrapost, everything is in your hands.
The remaining information about the conference (such as the schedule) is on the official websitetickets are there.