From a “startup” to thousands of servers in a dozen data centers. How we chased the growth of Linux infrastructure

If your IT infrastructure is growing too fast, you will sooner or later come up with a choice – linearly increase the human resources to support it or start automation. Until a certain moment, we lived in the first paradigm, and then the long road to Infrastructure-as-Code began.

Of course, NSPK is not a startup, but such an atmosphere reigned in the company in the first years of its existence, and these were very interesting years. My name is Dmitry Kornyakov. For more than 10 years I have been supporting the Linux infrastructure with high availability requirements. He joined the NSPK team in January 2016 and, unfortunately, did not find the very beginning of the company’s existence, but came at the stage of major changes.

In general, we can say that our team supplies 2 products for the company. The first is infrastructure. Mail should go, DNS should work, and domain controllers should let you onto servers that should not fall. The company’s IT landscape is huge! This is a business & mission critical system, the requirements for the availability of some are 99,999. The second product is the servers themselves, physical and virtual. You need to monitor existing ones, and regularly supply new ones to customers from many departments. In this article, I want to focus on how we developed the infrastructure that is responsible for the server life cycle.

The beginning of the way

At the beginning of the journey, our technology stack looked like this:
OS CentOS 7
FreeIPA Domain Controllers
Automation – Ansible (+ Tower), Cobbler

All this was located in 3 domains, spread over several data centers. In one data center – office systems and test sites, in the rest of the PROD.

Creating servers at some point looked like this:

In the VM template CentOS minimal and the necessary minimum like the correct /etc/resolv.conf, the rest comes through Ansible.

CMDB – Excel.

If the server is physical, then instead of copying the virtual machine, the OS was installed using Cobbler – the MAC addresses of the target server are added to the Cobbler config, the server receives the IP address via DHCP, and then the OS is loaded.

At first, we even tried to do some kind of configuration management in Cobbler. But over time, this began to bring problems with configuration portability to both other data centers and Ansible code for preparing the VM.

At that time, many of us perceived Ansible as a convenient extension of Bash and did not skimp on designs using shell, sed. In general, Bashsible. This ultimately led to the fact that if for some reason the playbook did not work on the server, it was easier to remove the server, fix the playbook and roll again. In fact, there was no versioning of scripts, nor portability of configurations either.

For example, we wanted to change some kind of config on all servers:

We change the configuration on existing servers in the logical segment / data center. Sometimes not in one day – the requirements for accessibility and the law of large numbers do not allow all changes to be applied at once. And some changes are potentially destructive and require a restart of something – from services to the OS itself.
Fix in Ansible
Fix in Cobbler
Repeat N times for each logical segment / data center

In order for all changes to go smoothly, it was necessary to take into account many factors, and changes occur constantly.

Refactoring ansible code, configuration files
Change internal best practice
Changes following the analysis of incidents / accidents
Changing security standards, both internal and external. For example, PCI DSS is updated every year with new requirements.

Infrastructure growth and the start of the journey

The number of servers / logical domains / data centers grew, and with them the number of errors in the configurations. At some point, we came to three areas in the direction of which we need to develop configuration management:

Automation. As far as possible, the human factor should be avoided in repeated operations.
Repeatability Managing infrastructure is much easier when it is predictable. The configuration of the servers and the tools for their preparation should be the same everywhere. This is also important for product teams – the application must be guaranteed after testing to get into a productive environment configured similarly to the test one.
Simplicity and transparency of changes to configuration management.

It remains to add a couple of tools.

We chose GitLab CE as the code repository, not least for the availability of built-in CI / CD modules.

Storage of secrets – Hashicorp Vault, incl. for a great API.

Testing configurations and ansible roles – Molecule + Testinfra. Tests are much faster if you connect to ansible mitogen. At the same time, we began to write our own CMDB and orchestrator for automatic deployment (in the picture above Cobbler), but this is a completely different story, which my colleague and chief developer of these systems will tell about in the future.

Our choice:

Molecule + Testinfra
Ansible + Tower + AWX
Server World + DITNET (Own development)
Cobbler
Gitlab + gitlab runner
Hashicorp vault

Speaking of ansible roles. At first she was alone, after several refactoring they became 17. I categorically recommend breaking the monolith into idempotent roles, which can then be launched separately, in addition, tags can be added. We divided the roles by functionality – network, logging, packages, hardware, molecule etc. In general, we adhered to the strategy below. I do not insist that this is the truth in a single instance, but it worked for us.

Copying servers from the “golden image” is evil!
Of the main shortcomings – you do not know exactly what state the images are in now, and that all changes will come in all images in all virtualization farms.
Use the default configuration files to a minimum and agree with other departments that you are responsible for the main system files, eg:
1. Leave /etc/sysctl.conf empty, the settings should only be in /etc/sysctl.d/. Your default in one file, custom for the application in another.
2. Use override files to edit systemd units.
Template all configs and embed as a whole, if possible no sed and its analogues in playbooks
Refactory configuration management system code:
1. Break tasks into logical entities and rewrite the monolith into roles
2. Use linter! Ansible-lint, yaml-lint, etc
3. Change the approach! No bashsible. It is necessary to describe the state of the system
For all Ansible roles, you need to write tests in the molecule and generate reports once a day.
In our case, after preparing the tests (there are more than 100), there were about 70,000 errors. Corrected for several months.

Our implementation

So, ansible roles were ready, templated and checked by linters. And even gitas are everywhere raised. But the question of reliable code delivery to different segments remained open. We decided to synchronize with scripts. Looks like that:

After the change has arrived, CI is launched, a test server is created, roles are rolled, tested by the molecule. If everything is ok, the code goes to the branch. But we do not apply the new code to existing servers in the machine. This is a kind of stopper, which is necessary for the high availability of our systems. And when the infrastructure becomes huge, the law of large numbers comes into play – even if you are sure that the change is harmless, it can lead to sad consequences.

There are many options for creating servers too. We ended up choosing custom python scripts. And for CI ansible:

- name: create1.yml - Create a VM from a template
  vmware_guest:
    hostname: "{{datacenter}}".domain.ru
    username: "{{ username_vc }}"
    password: "{{ password_vc }}"
    validate_certs: no
    cluster: "{{cluster}}"
    datacenter: "{{datacenter}}"
    name: "{{ name }}"
    state: poweredon
    folder: "/{{folder}}"
    template: "{{template}}"
    customization:
      hostname: "{{ name }}"
      domain: domain.ru
      dns_servers:
        - "{{ ipa1_dns }}"
        - "{{ ipa2_dns }}"
    networks:
      - name: "{{ network }}"
        type: static
        ip: "{{ip}}"
        netmask: "{{netmask}}"
        gateway: "{{gateway}}"
        wake_on_lan: True
        start_connected: True
        allow_guest_control: True
    wait_for_ip_address: yes
    disk:
      - size_gb: 1
        type: thin
        datastore: "{{datastore}}"
      - size_gb: 20
        type: thin
        datastore: "{{datastore}}"

This is what we have come to, the system continues to live and develop.

17 ansible roles to configure the server. Each of the roles is designed to solve a separate logical problem (logging, auditing, user authorization, monitoring, etc.).
Role testing. Molecule + TestInfra.
Own development: CMDB + Orchestra.
Server creation time ~ 30 minutes, automated and almost independent of the task queue.
The same state / name of the infrastructure in all segments – playbooks, repositories, virtualization elements.
Daily check of server status with generation of reports on discrepancies with the standard.

I hope my story will be useful to those who are at the beginning of the journey. What automation stack are you using?

From a “startup” to thousands of servers in a dozen data centers. How we chased the growth of Linux infrastructure

The story of a group of 414 – Milwaukee teenagers who pioneered the hacking

US military plans to deliver cargo to different parts of the world using Starship missiles

Checking the relevance of go.mod and go.sum

How I passed AWS Cloud Practitioner in 2024

Government procurement, “Growth Point” and a 3D printer for triple the price. Four years later

Introduction to the graphics library

Leave a Reply Cancel reply

Similar Posts

Leave a Reply Cancel reply