SSL Certificate Management: From Chaos on Hundreds of Servers to a Centralized Solution

What can be behind the words “Europe’s largest online school”? On the one hand, this is 1 thousand lessons per hour, 10 thousand teachers, 100 thousand students. And for me, an infrastructure engineer, this also includes 200+ servers, hundreds of services (micro and not very), domain names from the 2nd to the 6th level. Everywhere you need SSL and, accordingly, a certificate for it.

For the most part, we use Let’s Encrypt certificates. Their advantages are that they are free, and the receipt is fully automated. On the other hand, they have a feature: short – only three months – validity. Accordingly, they have to be updated frequently. We tried to automate it somehow, but still there was a lot of manual work, and something always broke. A year ago, we came up with a simple and reliable method for updating this pile of certificates and since then we forgot about this problem.

From one certificate on one server to hundreds in several data centers

Once upon a time there was only one server. And on it lived a certbot, which worked from under the crown. Then one server ceased to cope with the load, so another server appeared. And then more and more. Each of them had its own certificates with its own unique set of names, and everywhere it was necessary to configure their update. Somewhere during the extension, they copied existing certificates, but forgot about the update.

In order to receive a Let’s Encrypt certificate, you must confirm ownership of the domain name specified in the certificate. This is usually done with a reverse HTTP request.

Here are a couple of standard difficulties we encountered as we grew:

  • Not all new servers were accessible from the outside: some were removed for the balancer of incoming traffic and are no longer available from the Internet. On them certificates had to be copied manually.
  • Also there were servers without HTTP at all. Say with mail. Or with databases. Or with some kind of LDAP. Or something else strange. They also had to copy certificates manually.

In some places self-signed certificates have been used for quite some time, and this seemed like a good solution in those places where authentication is not needed – for example, for internal testing. To prevent the browser from constantly reporting a “suspicious site”, just add our root certificate to the list of trusted ones, and the point is in the hat. But later difficulties arose here too.

The trouble is that in BrowserStack, which testers use, it is impossible to add a certificate to the trusted list for at least iPad, Mac, iPhone. So testers had to put up with constantly pop-up warnings about dangerous sites.

Search for a solution

Of course, first of all, you need to do monitoring to find out about certificates that are ending not when they have already ended, but a little earlier. Oh well. Monitoring is, we now know that certificates will end soon here and there. And now what i can do?


Big Ear is an old bot that won’t ruin a certificate.

And let’s use wildcard certificates? Let’s! Let’s Encrypt already issues them. True, you will have to configure the confirmation of domain ownership through DNS. And our DNS lives in AWS Route53. And you have to decompose the access details in AWS across all servers. And with the advent of new servers, copy all this economy there too.

Well, 3rd level names are covered by wildcard. And what to do with names of the 4th level and higher? We have many teams that are engaged in the development of various services. Now it is customary to divide the frontend and backend. And if the frontend gets a 3rd level name like service.skyeng.ruthen backend try to give a name api.service.skyeng.ru. Hmm, maybe they forbid them from doing this again? Great idea! And what to do with dozens of existing ones? Could it be with an iron hand to drive them all into one domain name? Replace all these names of different levels with URLs of type skyeng.ru/service. Technically, this is an option, but how long does it take? And how can business justify the need for such actions? We have 30+ development teams, persuade everyone – it will take at least six months. And we are creating a single point of failure. Like it or not, this is a controversial decision.

What other ideas are there? .. Maybe make one certificate, where we include all-all-all? And we will install it on all servers. This might be the solution to our problems, but Let’s Encrypt allows you to have only 100 names in the certificate, and we already have more than one microservice.

What to do with testers? They didn’t come up with anything, but they constantly complain. All bullshit except the bees. Bees are also garbage, but there are a lot of them. Each developer or tester is given a test server – we call them testing. Testings are not bees, but there are already over a hundred of them. And for each all projects are deployed. That’s all. And if for sale you need N certificates, then there is the same amount for each testing. So far, they are self-signed. It would be great to replace them with real ones …

Two playbooks and one source of truth

The swan, cancer, and pike will not bring the cart anywhere. Need a single control center flying servers. In our case, this is Ansible. Certbot on every server is evil. Let all certificates be stored in one place. If somewhere someone needs a certificate, then come to this place and take the latest version from the shelf. And we will make sure that certificates are always up-to-date in this store.

AWS access details are also present in only one place. Accordingly, questions disappear, such as setting up AWS CLI on a new server, who has access to Route53 and the like.

All required certificates are described in one file in Ansible in YAML format:


    certificates:
      - common_name: skyeng.ru
        alt_names:
          - *.skyeng.ru
      - common_name: olympiad.skyeng.ru
        alt_names:
          - *.olympiad.skyeng.ru
          - api.content.olympiad.skyeng.ru
          - games.skyeng.ru
      - common_name: skyeng.tech
        alt_names:
          - *.skyeng.tech

      .  .  .

One playbook is launched periodically, which goes through this list and does its hard work – essentially the same thing as certbot does:

  • creates an account with Let’s Encrypt Certificate Authority
  • generates a private key
  • generates a (not yet signed) certificate – the so-called certificate signing request
  • sends a signing request
  • receives a DNS challenge
  • puts received records in DNS
  • sends a signing request again
  • and, having finally received the signed certificate, puts it in the store.

Playbook is performed once a day. If he couldn’t renew any certificates for any reason – be it network problems or some kind of errors on the Let’s Encrypt side – this is not a problem. Will be updated next time.

Now, when SSL is needed on a service host, you can go to this repository and take a few files from there – the simplest operation that the second playbook performs … What certificates are needed on this host are described in the parameters of this host, in inventories / host_vars / server.yml:


    certificates:
      - common_name: skyeng.ru
        handler: reload nginx
      - common_name: crm.skyeng.ru

      .  .  .

If the files have changed, then Ansible pulls a hook – it is typical to restart Nginx (in our case, this is the default action). And in the same way, you can obtain certificates from other CAs that use the ACME protocol.

Total

  • We had many different configurations. Something constantly broke. Often I had to climb servers and figure out what had fallen off again.
  • Now we have two playbooks and everything is recorded in one place. Everything works like a clock. Life has become more boring.

Testing

Yes, what about testers with their testing? Each developer or tester is given a personal test server – testing. There are currently about 200 of them. They have species names test-y123.skyeng.linkwhere 123 Is the testing number. Creating and removing testing is automated. One of the steps is to install an SSL certificate on it. An SSL certificate is generated in advance, with names by template:

    ssl_cert_pattern:
      - *
      - *.auth
      - *.bill

      .  .  .

Only about 30 names. So the certificate includes names

    test-y123.skyeng.link
    *.test-y123.skyeng.link
    *.auth.test-y123.skyeng.link
    *.bill.test-y123.skyeng.link

etc.

After the dismissal of the developer or tester, his testing is deleted. The certificate remains ready for use. It’s all that is stored. You yourself know where and it is decomposed into hosts. You yourself know how.

P.S.

It may also be interesting to read on this topic, how Stack Overflow switched to HTTPS:

  • Hundreds of domains of different levels
  • Websockets
  • Lots of HTTP APIs (proxy issues)
  • Do everything and not drop performance

If you have any questions, write in the comments, I will be happy to answer.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *