Backup Rabbit brains in case of nuclear wars


Don't worry about them, we took care of their backup

Don’t worry about them, we took care of their backup

Someday your country will ban IaC and you will remember my backups…
© Jason Statham

Not so long ago, we in the company faced a small problem – RabbitMQ (hereinafter just a crawl, etc.) fell on the dev cluster, we revived it, and for definitions.json to restore users, queues, etc. I had to run to the developer, who, by pure chance, often removed these files. This was the first call.

The second call was DR (Disaster Recovery) – Scenario/exercise for emergency raising of our product in the cloud in case of explosion and destruction of our physical data center. Then the need for backups of our rabbit became obvious and we started solving this problem

Briefly about what ways to manage the configuration (creating users, queues, etc.) of RabbitMQ are in place:

  • By hand, manually: Not our choice, of course (if only rarely and on a test environment)

  • From the application: Some of our applications have the rights to create and work with queues

  • Ansible : A very good option, it is possible to automate some points, restore the configuration. IaC.

  • StatefulSet in k8s

  • topology operator k8s : More Kuber for the Kuber gods

Considering our case, on the test environment, we have a rabbit in the k8s cluster as a StatefulSet. It is enough for tests – it rises, it is killed quickly – everyone is happy.

Rabbits on staging and production are installed on machines and configured by Ansible, which allows you to track the current settings in the code, quickly apply them with a simple command, and it is possible to rollback, because the ansible is stored in the git repository.

We have dismantled the configuration options and, having outlined specifically our case, which is quite production ready, let’s start solving the problem of backups. Let’s start with storage.

Where to put good?

Cyber ​​rabbits think hard

Cyber ​​rabbits think hard

Of course, our piece of iron is not an option – we have already agreed that this is a task for D.R. therefore, the piece of iron burned down, write it down in the accounting. Where to put json? IN cloud. Why?

  • The cloud of the same Google is a distributed system that has long been tested at high loads, the probability of burning our configs in it is “extremely small!”

Having dealt with the storage location of the rabbit settings, we can start preparing all the necessary environment and tools for our backup

In the beginning was the word and the word was Terraform

We need a bucket and access to it (service account). Terraform allows you to raise this into three files (for beauty, you can cram it into one) and deploy / kill with one command in the terminal. The choice of tool is obvious, let’s prepare everything without further ado:

-/* Ресурс нашего бакета, в котором будем хранить бэкапы*/
resource "google_storage_bucket" "configs_backup" {
  project = var.project
  name          = "configs-backup"
  location      = "EUROPE-WEST4"
  storage_class = "Standard"
  labels = {}
  uniform_bucket_level_access = false
  force_destroy = true
  versioning {
    enabled = true /* Бэкапы будут каждый день, 
    так что включаем версии для более тонкого восстановления */
  }
  lifecycle_rule {
    condition {
      num_newer_versions = 5
    }
    action {
      type = "Delete"
    }
  }
}

The vault is ready, but we also need access to it. Go ahead and create a service account with the necessary rights

/* Сам сервис аккаунт */
resource "google_service_account" "configs_backup_sa" {
  account_id   = "configs-backup-sa"
  display_name = "Created by terraform configs-backup for control in configs_backup bucket"
  project = "${var.project}"
}

/* Выдаем ему права на бакет */
resource "google_storage_bucket_iam_member" "configs_backup_sa" {
  bucket  = "${google_storage_bucket.configs_backup.name}"
  role    = "roles/storage.objectAdmin"
  member  = "serviceAccount:${google_service_account.configs_backup_sa.email}"
}

And a variable with the project, purely for beauty and possible use inside your other Terraform modules, etc.:

variable "project" {
  description = "Google Project to create resources in"
  type        = string
  default     = "<your_project_id>"
}

To check, we initialize the state and run plan, which will display what we are going to do based on this code:

terraform init
terraform plan

The console output should produce three objects: bucket, SA and IAM Member (An example of the output is in the repository attached at the end). If it converges, we launch the infernal IaC machine and deploy objects:

terraform apply

Done – you are adorable!

What’s next?

Naxt?

Naxt?

The storage is ready, we have access, what’s next? We need a backup performer, for which we raised all this beauty. What are our requirements for it?

  • Must be isolated from the host environment, its dependencies, etc., to reduce the likelihood of interfering with work and data forgery, etc.

The list is not so huge, but at the same time, it effectively reduces the number of options for us. A person who knows at this moment begins to slowly understand what I am leading to and quietly hates another lover of a well-known instrument.

Who will we choose to play the role of this important cog in the system, covering us from losing crawl settings?…

The road to Kubernetes is paved with good intentions © Jacque Fresco

CronJob, my k8s dudes

Qua?

Qua?

Why are we dragging our solutions into K8s again:

  • To annoy everyone

  • Containers/Pods managed by K8s isolated

  • You can choose an image with only the tools you need and not load our infrastructure with their dependencies.

  • CronJob in Kuber leaves traces as logs pods that can help in debug possible errors and all this – kubernetes objects are quietly monitored and can be supplemented with alerts that come straight to Slack

  • In my case, our product is mainly in clusters, including CronJobs, and I don’t want to complicate the system with anything else

After we have decided on our backuper, we need to understand what he will need to complete all the work

We need bucket access to throw backups there, and RabbitMQ access:

  1. In the case of access to the bucket, we have already agreed to use SA, that is, we need to throw a token into our pod for login through gcloud. This is what we’ll do through the Volume and VolumeMounts of the secret, into which this very token arrives from Vault

            volumeMounts:
            - mountPath: /var/secrets/google
              name: google-cloud-key
          volumes:
          - name: google-cloud-key
            secret:
              secretName: configbackuper-gcp-sa
  1. To access RabbitMQ, we will not be smart and create users in our clusters with the rights necessary for downloading the configuration.

    Note: in the example, our copy of the backer is used, so there are two users – for the rabbit of internal services and the rabbit of clients

              envFrom:
              - secretRef:
                  name: job-rabbitconfigbackuper-rabbit-secret

We figured out access (do not forget, of course, that the firewall should let the kuber cluster into the rabbit hosts). Now we need backup tools. Everything will be as simple as possible – curl to get definitions.json rabbits, gcloud for authorization in GCP and gsutil to copy the backup to our bucket, where it will lie in case DR Or the coming of Cthulhu.

An image from Google immediately catches my eye – cloud-sdk. But do not rush to take it! After starting and pooling, I saw a terrible figure – 3 GB! This clearly does not look like a utility for a pure backup. You might consider building the image yourself, but for a quick fix, we did some digging and noticed the same official image, but based on Alpine. The result is clearly better – 900 MB. Let’s stop there, but if you wish, you can assemble the image yourself.

The cron algorithm looks like this:

  1. Install set -eso that errors drop the crown and do not throw old files or some garbage into the bucket.

  2. curl‘ohm we get definitions.json from hosts and save to local files.

  3. Login to the project with gcloud.

  4. We throw backup files into the bucket using gsutil.

At this point, we can already start writing the yaml itself, parts of which I showed above:

---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: job-rabbitconfigbackuper
  labels:
    helm.sh/chart: job-rabbitconfigbackuper-0.1.0
    app.kubernetes.io/name: job-rabbitconfigbackuper
    app.kubernetes.io/instance: release-name
    app: job-rabbitconfigbackuper
    project: habr
    type: job
    app.kubernetes.io/version: "latest"
    app.kubernetes.io/managed-by: Helm
spec:
  schedule: "@daily"
  concurrencyPolicy: Forbid
  failedJobsHistoryLimit: 5
  successfulJobsHistoryLimit: 3
  startingDeadlineSeconds: 180
  jobTemplate:
    spec:
      backoffLimit: 0
      template:
        metadata:
          labels:
            helm.sh/chart: job-rabbitconfigbackuper-0.1.0
            app.kubernetes.io/name: job-rabbitconfigbackuper
            app.kubernetes.io/instance: release-name
            app: job-rabbitconfigbackuper
            project: habr
            type: job
            app.kubernetes.io/version: "latest"
            app.kubernetes.io/managed-by: Helm
        spec:
          serviceAccountName: common-sa
          securityContext:
            runAsUser: 1000
            runAsGroup: 3000
            fsGroup: 2000
          restartPolicy: OnFailure
          containers:
          - name: job-rabbitconfigbackuper
            image: "google/cloud-sdk:alpine"
            command:
              - /bin/sh
            args:
              - -c
              - set -e; 
                curl http://$Rabbit__Host:$Rabbit__Port$Rabbit__JSON__Query$(echo -n $RabbitMQ__Creds | base64 -w 0) > /tmp/$Environment.json; 
                curl http://$Rabbit__External__Host:$Rabbit__Port$Rabbit__JSON__Query__External$(echo -n $RabbitMQ__Creds__External | base64 -w 0) > /tmp/$Environment\_external.json;
                yes | gcloud auth login --cred-file=$GOOGLE_APPLICATION_CREDENTIALS; 
                gsutil cp /tmp/$Environment.json gs://configs-backup/rabbitmq/$Environment.json; 
                gsutil cp /tmp/$Environment\_external.json gs://configs-backup/rabbitmq/$Environment\_external.json;
            env:
              - name: "Environment"
                value: "stage"
              - name: "GOOGLE_APPLICATION_CREDENTIALS"
                value: "/var/secrets/google/key.json"
              - name: "Rabbit__External__Host"
                value: "<your_external_rabbit_host>"
              - name: "Rabbit__Host"
                value: "<your_internal_rabbit_host>"
              - name: "Rabbit__JSON__Query"
                value: "/api/definitions?download=rabbit_<your_internal_rabbit_host>.json&auth="
              - name: "Rabbit__JSON__Query__External"
                value: "/api/definitions?download=rabbit_<your_external_rabbit_host>.json&auth="
              - name: "Rabbit__Port"
                value: "15672"
            envFrom:
            - secretRef:
                name: job-rabbitconfigbackuper-rabbit-secret
            volumeMounts:
            - mountPath: /var/secrets/google
              name: google-cloud-key
          volumes:
          - name: google-cloud-key
            secret:
              secretName: configbackuper-gcp-sa

The encode in base64 is due to the fact that the rabbit host accepts authorization through curl only in this way. And this fad was especially painful, because the secrets in the cuber are also initially stored in base64, and the situation came out when the rabbit host coded one way, the cuber another, and I debugged it on the local machine in the third way.

Watch your hands!

  1. Our login password comes from the secret of the cuber, which in encoded form looks like this:

  2. Loading into the pod, it is decoded, we get our initial rabbit user login password:

And now we need to send it to the rabbit for authorization in encoded form, under base64. We enter the cherished command and …. What?!!!

This clearly doesn’t look like the original code from the secret and the same auth token from the upload request. Maybe we don’t understand something? Let’s decode the resulting code:

It’s definitely not our password. What to do?

$(echo -n $RabbitMQ__Creds | base64 -w 0 )

In our case, this line fixed everything, but the difference in encodings, in contrast to the parameters and the same quotes, which also affect decoding and encoding, can confuse and break authorization at the beginning

PS In normal manual encoding/decoding, this problem is repaired with simple quotes, but in the case of our backer, we operate with environment variables and secrets, so we have what we have.

In addition, it was necessary to keep in mind that you need to operate with files in a folder like /tmp/, otherwise the pod would crash with an error due to lack of rights to other directories (security, after all).

Deploy our cronJob to the cluster and voila! – Mission complete, Boss !

Our files are in the bucket, they have versions, and if the rabbits fall, we simply re-deploy them and import these files there (which, if desired, can also be done automatically).

Results and Happy End

The rabbits are safe, everyone is happy. Not only IaC are united, back up everything you can, someday it will help you, and colleagues will look at you as an infrastructure god.

I hope that if you read the article to the end, then it turned out to be useful or at least interesting for you. All the best and automation!

PS We use Helm to deploy to Kuber, but the templates are self-written, so the article uses raw yaml generated by helm template.

PPS My first article on this resource and the first article in principle, I will be glad to any adequate feedback

And here is a link to the git with the sources and a small readme for deploying

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *