how we manage infrastructure and application lifecycle

What changed after the move from a technical point of view?

Not so much: before the move, we used a set of motley servers, and now we have switched to a blade platform on top of which the Proxmox hypervisor is running. There are slightly more than 100 virtual machines deployed on the hypervisor.

In total, we had at our disposal:

  • Server chassis: HP BladeSystem c9000;

  • Typical server configuration: 64 vCPU, 192 or 256 GB RAM;

  • Disk storage:

  • Network: 10 Gbps.

But not with hardware alone – let’s return to infrastructure issues.

Who lives on VM

For each developer, a small copy of the product is deployed in a separate namespace – therefore the cluster on the dev infrastructure is larger. It has more than 60 virtual machines and a total of more than 3,000 pods.

An example is stateful applications that require disk resources. In our case, Kubernetes is more suitable for stateless applications, since running stateful requires reliable cluster storage such as CEPH. We have few Stateful applications, so supporting such a storage is not justified for us.

We reserve all services that are deployed on virtual machines, but not in Kubernetes clusters. In most cases, we use the Keepalived daemon for backup. When one of the IP addresses of the machines in the cluster fails, the floating IP quickly switches to a new node – as a result, we will lose less than a second in the event of a failure.

CarPrice infrastructure block diagram

The CarPrice infrastructure can be simplified as follows:

As we can see from the diagram, the CarPrice infrastructure uses protection against attacks. From the public, traffic goes to the servers of an external provider, which protects us from DDoS, bots, and also scans the traffic for malicious requests. Already filtered, it is sent to our VMs with Nginx balancers. Next, Nginx distributes traffic between consumers: Kubernetes clusters, Docker services and monoliths.

A little about deployment

As I’ve shared in this blog, we are CI/CD advocates. The scheme and tools used for deployment depend on which servers this deployment is directed to.

GitOps is a methodology that focuses on using git repositories to manage infrastructure and deploy application code. Essentially, this is a continuous synchronization process from Git that helps ensure that our environment maintains the desired state.

Thanks to GitOps, we can automate the entire process of maintaining an application in a Kubernetes cluster: deployment, monitoring, deletion. As a result, the business receives a more stable product and an accurate understanding of its condition.

Argo CD is an open source operator that supports continuous delivery for Kubernetes. The job of Argo CD is to monitor the Git repository. Whatever changes we make to the manifests, it will pick them up from Git. It will then apply the manifest to the Kubernetes cluster and monitor the application startup process. And at the end it will send a notification to the release chat about the status of the deployment.

How we work with secrets

Separately, it is worth mentioning the process of working with various secrets, internal or external, such as access to a database.

For businesses, keeping secrets is a matter of information security. But even in large IT companies, security is often neglected. Secrets are not just visible at the infrastructure level – they can lie directly in the git repository.

We tried to avoid these mistakes: our secrets are stored in Vault and are included in application manifests only at the time of deployment.

How the process of injecting a secret from Vault works:

When we store manifests in Git, instead of a key or password, we add a special instruction to them that contains a reference to the secret storage path in Vault, like this: “vault:secret/#”.

In Kubernetes, there is a special Bank-Vaults operator, which at the time of deployment sees in the manifest text a reference to the path to Vault. He goes this route and adds the actual meaning of the secret instead of the reference.

Apart from the operator, no one during the deployment process can find out about the secret. In addition, you cannot see secrets by logging into the container through kubectl: variables with secrets are visible only to the process with pid 1 and its descendants.

Code review and monitoring

To check the code, we use a statistical analyzer, linters and auto-tests:

SonarQube is a static analyzer. It checks that the developer has not made a mistake in the code; and also finds vulnerabilities in it.

Linter is about formatting code. You can write it in one line, or you can break it up so that it is convenient to work with. A linter, without human intervention, verifies that the code is written according to a specific set of rules.

We use Smog End-to-End tests in production, which simulate the actions of a living person. And unit tests, which help check whether the developer did not break the entire bunch of methods while making changes. The plans are to scale auto-testing to a larger number of services – to all client and dealer applications.

So, we checked the code with a statistical analyzer and linters, deployed it to a cluster, ran an auto-test – the application works. But its work needs further monitoring: we need to monitor VMs, physical servers, metrics.

Standard solutions help us here.

In order to collect metrics, we use Prometheus. Prometheus works in federation mode: there are 2 instances, the first collects all metrics with a short polling time interval and stores them for a limited time, the second takes metrics with the longterm label from the first and stores them for at least a year. Thus, by marking important metrics, we can automatically save them for a long time.

To visualize metrics and build analytics based on them – Grafana.

To collect logs, we use the classic ELK stack: elasticsearch for storage, followed by logstash for processing logs. We use filebeat agents to collect logs directly from servers and Kubernetes.

But there is also something not as popular as Prometheus and ELK:

  • To monitor infrastructure incidents, we use Alerta. This is an application that collects all incidents, displays them on a single panel and additionally sends notifications to RocketChat or Telegram.

  • Sentry helps us collect incidents by application. Many services are configured to work with it, and in case of failures they send messages there.

Sentry is an aggregator: it records all errors, sees how many times they are repeated and can detail the reason for their occurrence. We also do release linking in Sentry: we send information about the release so that future bugs can be linked to it.

A little about the results and plans

A new data center is a “new chapter” in the life of infrastructure (and sometimes new opportunities).

Here are a few changes we are planning:

  • Continue working on improving dev environments and transferring them to ArgoCD, so that not only application containers, but also Kubernetes manifests are the same as production.

  • Implement your own identity provider based on Keycloak in the company and set up authorization for it in corporate and service services.

  • Add stages to CI/CD that scan for vulnerabilities in code and Docker images.

  • Launch a full-fledged Stage on which to run functional tests of the application before deploying them to production.

  • Cover even more applications with autotests.

On the other hand, we have always paid special attention to monitoring – and we plan not to slow down in the future and optimize our work with incidents and their life cycle.

The only thing that is unlikely to change over time is that like-minded people and new cool specialists are always welcome in our team 🙂

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *