Kubernetes and more

VK is one of the leading IT corporations in Russia, whose services are used by about 95% of the RuNet audience. Our product portfolio includes more than 200 projects, the creation and development of which is carried out by a large number of teams. To solve product problems, we use a wide stack of tools and technologies, including active work with Kubernetes. Moreover, our approaches and patterns for working with K8s often differ from typical solutions.

My name is Alexey Sharapov. I am the Head of Innovation and Infrastructure Development at VK. In this article, I want to talk about VK from the point of view of technologies and working with them using Kubernetes as an example: about our technical department, stack, key areas and integrations, as well as planned development vectors.

The article is based on my talk at VK Kubernetes Conf 2024. You can watch it Here.

Technical department and basic principles of development

There are about 6,000 users of VK's internal platforms and solutions. Most of them are specialists who develop and administer our projects and services (VKontakte, Odnoklassniki, Zen, Mail, RuStore, VK Pay and others). Moreover, many teams use wide stacks of tools and employ significant capacities in their work.

In order to provide resources for each team's requests in a timely manner and in full, we are developing an internal cloud, including through the technical department. It consists of several areas:

  • Department of Unified Cloudwhich is responsible for managing the platform as a whole and allocating the necessary resources;

  • Department of Network Technologieswhose tasks include ensuring the interaction of all systems and components of the network;

  • Big Data Departmentwhich is engaged in building infrastructure for all cycles of work with Big Data;

  • Department of Information Technologywhich is responsible for providing security and management services (e.g. authorization).

Thus, the coordinated work of all teams actually solves the main layer of Platform Engineering tasks.

At the same time, in order to cover all requests from teams and flexibly adapt to the specifics of each of them, we adhere to several basic principles in our work with Platform Engineering.

  1. Construction and optimization of Cloud-Agnostic architecture. The core components are built to work equally well on both bare-metal and cloud.

  2. Deep optimization and RnD. We research and implement the latest technologies, focusing on stability and performance. We pay special attention to compatibility, seamless migrations, and rational use of the budget.

  3. Developing our own solutions. Our own development teams and advanced programming skills of SRE engineers allow us to develop our own solutions where ready-made ones are lacking.

  4. Pumping up and applying internal expertise. The teams have expertise in building a full cycle of technologies and apply it at different stages: from refinement at the core level of the system to building complex multi-stage pipelines.

That is, we focus on closing all options for interaction with the infrastructure and giving specialists the opportunity to equally freely apply different approaches to the implementation of their projects.

Key Integrations

According to current trends, approaches to development are beginning to change. And the priority is becoming the scheme in which developers work exclusively on their own projects and use platforms that are prepared and administered for them by other specialists.

This is consistent with our paradigm: we prefer to focus developers' efforts on achieving core business goals rather than on infrastructure tasks. Therefore, we actively use external integrations, especially in terms of:

  • information security and building DevSecOps;

  • development tools;

  • monitoring systems and SOC (Security Operations Center, information security centers);

  • network and data center management systems.

Among other things, in the context of our Kubernetes, we use external integrations to automate deployment and cluster lifecycle management. This allows us to quickly provide infrastructure for development and, as a result, reduce the teams' dependence on approvals, reduce the impact on the value delivery cycle, and reduce Time-to-market.

We also use external services for logging, working with secrets, deployment, building pipelines and other processes for our tasks.

Of course, we have projects and scenarios that require specific tools and approaches, but we try to minimize them in order to reduce the “zoo of technologies” and move away from Bus Factor in terms of highly specialized expertise.

There are two points worth noting here.

  • We strive to integrate all services and manage them from one point.

  • We work with multi-cluster solutions to combine geo-distributed clusters into a single network to effectively use all data center resources, but maintain full control over available resources and track their consumption.

Our K8s stack

When working with Kubernetes, we use a classic set of tools and solutions.

  • Kubernetes 1.29. We are working with the latest version of K8s, actively researching and implementing new features (for example, Static Provisioners).

  • Cilium. We use most of the new opportunities. We build multi-clusters. We apply network solutions in a single stack, accumulating unique expertise in development and operation.

  • MultiDNS (Core-DNS modified by us). With its help we build multi-cluster solutions using different CNIs simultaneously (Calico, Cilium).

  • GatewayAPI. We implement it as a single method of traffic delivery. We build typical configurations for everyone, without being tied to a specific vendor for Ingress.

  • Storage (OpenEBS, Maya). We implement in different variations, develop the equipment scheme in the DC for fully synchronous replicas.

  • CoreLogs. Its own logging system, which allows you to collect and sanitize logs from the entire cluster, processing them depending on the application and its type.

Vectors of development

The VK group of companies' services are actively developing, and their audience is growing. In turn, this increases the complexity of development and administration, and also becomes a driver for regular revision of approaches to building internal systems (including platform ones). Moreover, first of all, we do not form technological trends, but highlight the directions of development and changes from the point of view of ideology.

We currently identify four main tracks for further development:

  1. Product approach to the platform.

  2. Formation of a common strategy.

  3. Transforming infrastructure into product.

  4. Full Cycle Platform Engineering.

Product approach to the platform

VK has many units and, accordingly, teams with different competencies, development areas and requests. That is, we cannot always provide standard solutions and configurations, but must process the needs of each business unit independently.

Therefore, the most optimal scenario for us is the transition to a product approach in development, within which all internal customers are actually considered as external b2b users.

The concept also assumes that each element of the internal development platform becomes a product: capacity, network, metrics collection, etc. That is, areas should be worked with as external users, right up to the allocation of the Product Owner role for each product.

In addition to control over allocated resources and the ability to distribute them transparently, this also provides the ability to collect metrics from users and evaluate the efficiency and justification of the use of capacities and tools.

This means a complete departure from the model of “we get everything we want on demand and use it as we see fit.”

Formation of a common strategy

The IT services that our business units develop are used by millions of external users and thousands of internal users. This significantly increases the maturity requirements for each released product, including in the context of fault tolerance and security. The “let’s launch the service and then see” model is not our case.

Therefore, one of the key development vectors for us is planning a general development strategy. To do this, we focus on several points.

  • Short-term and long-term planning. It is important for us that development is not just a “road”, but has interpretable goals, the results of which can be assessed, including from the point of view of business value. In turn, teams must be sure that they will have the necessary resources and tools at their disposal.

  • List of features, not requirements. We are moving to a model where the priority is a set of services available to everyone, rather than the “wants” of some individual team. That is, teams should develop projects taking into account what they can use, rather than writing code “in a vacuum” and then demanding resources and solutions to implement a product based on it.

  • Unit strategies are the foundation. Each external project of our business units is a separate product with its own target audience. Moreover, sometimes the target audiences of the products do not intersect or intersect minimally. Therefore, instead of imposing strict rules and requirements for development, we adhere to an approach in which each unit itself determines the strategy for development and movement towards achieving common goals for the entire VK group of companies.

Transforming infrastructure into a product

With the transition to a product approach, the entire infrastructure should also become a product. That is, hardware, network, and connectivity are transformed into a product that business units receive either in a certain (fixed) volume (for example, by CPU and RAM) or by time. This model of working with infrastructure allows for more efficient utilization of resources available in different data centers, increases the accuracy and transparency of planning, and also makes it possible to distribute teams across different data centers as needed. Teams, in turn, gain an understanding of how many resources they are guaranteed to have.

At the same time, the entire available power pool can be managed from one point.

Full Cycle Platform Engineering

One of the main trends for us is the transition to Full Cycle Platform Engineering, within which developers will not be distracted by “secondary” tasks, using almost ready-made and pre-configured solutions. This also implies that the platform team will be fully responsible for delivery, assembly, monitoring, metrics, logs, management and billing.

With the transition to such a paradigm, developers will be able to fully concentrate on solving their immediate problems, increase productivity, speed up the rollout of new features and, as a result, bring more benefit to the business.

At the same time, this increases the demands on the platform team and DevOps specialists, who in this concept are responsible for preparing the entire “package”, its administration, ensuring compatibility, availability, integrations, fault tolerance of the development platform itself and other aspects.

Our specifics of working with Kubernetes: tools and entities

In our work with Kubernetes, Kubernetes-Native services and Platform Engineering, we use several key approaches, tools and solutions.

Geoclusters

We are building a geographically distributed network of local clusters linked by global WAN channels. Accordingly, the entire development platform is “stretched” between available sites.

On the one hand, this implies a complex process of integration and ensuring the availability of all platform resources, as well as linking them into single pipelines and cycles. On the other hand, it provides a comprehensive profit. Including:

  • combining multiple clusters into a Mesh. Provides increased fault tolerance and fast service recovery;

  • the load is distributed more evenly. Failure of key components does not result in unavailability of the workload in the cluster;

  • CI/CD makes the deployment process transparent. No need to think about which cluster to deploy another replica in. It is enough to describe deployment policies once.

At the same time, we actively conduct billing in order to understand what resources, how much and from where each business unit consumes and what its IT landscape is built on as a whole.

Log collection and delivery system

We have our own system for collecting and delivering logs. With its help, we can work stably and transparently with a large volume of records from different sources. The tool is fully adapted to our workloads and specifics of work, contains a sanitization system, that is, it allows you to record cleaned logs. It also has an advanced markup system, which significantly speeds up searches by raw data.

With this system we have achieved compression optimization and save 70% of disk space (compared to logs without compression).

What is especially important is that our system is faster than free Open-Source analogues. According to the results of our internal measurements, with the transition to it we were able to speed up reading and delivery of logs by 3.7 times, which is a huge increase in our data scale.

Nanny

We have an internal tool called Nanny, which is globally responsible for automating cluster control. Service:

  • can troubleshoot hosts based on metrics and the type of load running;

  • uses a distributed network of agents to monitor each host and cluster, providing a single point of control over them.

This means that Nanny takes on a significant portion of engineers' tasks, allowing them to focus on solving more fundamental issues and reducing the overall workload – engineers can work on a schedule and wake up much less at night.

API

In our case, API is a single entry point for all data and integrations. Therefore, an entire pool of processes is built through API, including:

  • end-to-end integration of all systems used;

  • collecting data and building convenient dashboards for users and administrators;

  • complete management of the system life cycle from one point;

  • division of roles – everyone sees and uses the tools they need.

Thus, the API is the “heart” of the User Space, to which new features and abstractions can be added if necessary.

A short afterword

The VK case is definitely atypical for the entire market (due to the resources available to us, the scale of services, expertise, and more). But our case once again confirms that one of the main trends in IT is the transition to Platform Engineering, as well as a paradigm shift in working with tools and technologies, including Kubernetes. This is a justified and evolutionary path, moving along which will potentially allow us to solve more complex problems with available resources and, as a result, gain competitive advantages.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *