How we contributed an entire line to the HashiCorp Vault

Hello! My name is Pyotr Zhuchkov, I am the head of the secrets and configurations storage group in the Message Bus department at Ozon. We are responsible for supporting and developing the system for storing and using secrets, and actively cooperate with the guys from the information security department so that all services can safely work with secrets.

Our main tool for managing secrets is Vault. It has good functionality and also has detailed documentation so you can quickly start using it. Of course, launching Vault and connecting it to your service is not at all the same as reliably and securely providing platform access to more than 6,000 services and other infrastructure systems. It is extremely important for us to provide data quickly and store it securely.

For example, in normal times we have 1000-2000 requests per second, but there are also more – up to 7000.

If you want to keep secrets safely or just dive into gRPC and Go, then I think you will find it interesting and useful to avoid repeating our mistakes.

Next, I'll tell the story of how, during standard Vault maintenance, we were able to put it down and spent a lot of time and nerve cells getting it back into working order. In order to better understand the reasons, let's get acquainted with Vault and its design. If you know what Vault is, you can skip straight to the section with action.

Background

We at Ozon use HashiСorp Vault to store various secrets, such as database access, API keys, certificates, Service2Service secrets. It has become almost a standard for most large companies that pay attention to information security. And now an open-source product, which has about 30,000 stars on GitHub and is supported by the community, broke without any reason or explanation.

Below I will tell you about all our steps, the stones we stumbled over, as well as how we found the problem and proposed to solve it.

I’ll say right away that we use the community version. It is provided on terms that suit us.

A little about Vault. This is a tool/service for storing secrets such as database access, tokens and other data that should not fall into the wrong hands. Vault has a huge range of options for differentiating access rights. It also provides an HTTP interface for interaction, which allows you to use its API even in bash scripts. All data is encrypted in RAM and cannot be obtained just like that. Thanks to the available interfaces for interaction, it is possible to integrate Vault with different systems, as well as many ways to deliver secrets to your service: configs, environment variables, etc.

Ozon has platform libraries that simplify development and increase the stability of projects. One of the features they provide is quick and secure access to secrets. The general scheme for delivering secrets is as follows:

  1. Pod starts in the K8s cluster.

  2. He receives K8s service account token.

  3. The pod token goes to Vault.

  4. Vault checks the token and, if all is well, exchanges it for a token from Vault.

  5. With the new token, the service gains access to secrets.

Vault also supports many other authorization methods besides kubernetes authorization, for example using JWT, which is what we use on GitLab Runner.

HashiCorp Vault is designed in such a way that all secrets are sufficiently abstracted from real storage, which allows you to store data in any key/value storage. Here Full list of supported databases, the most common below:

We have chosen etcd as a backend for storing data. One of its important features is the ability to create a cluster in high availability mode. Our choice was based on our extensive expertise in etcd, as well as the results of our load tests.

Configuration

Connection type

Number of requests

Response time 98 quantile

Response time 50 quantile

Raft

65% close

4777 RPS

180 ms

20 ms

Raft

keepalive

6809 RPS

344 ms

2.79 ms

Raft

close

4087 RPS

135 ms

36 ms

etcd

65% close

5200RPS

94 ms

44 ms

etcd

keepalive

8145 RPS

33 ms

11 ms

etcd

close

4053 RPS

81 ms

39 ms

As you can see, etcd provides the maximum load and minimum response time.

etcd is, as stated on the website:

A distributed, reliable key-value store for the most critical data of a distributed system A three member etcd cluster finishes a request in less than one millisecond under light load, and can complete more than 30,000 requests per second under heavy load.

That is, it is a key/value database that uses Raft.

All Vault and etcd instances are distributed across all data centers to ensure maximum fault tolerance. Clients themselves, as a rule, use DNS balancing, in which the DNS itself queries the Vault nodes and gives the address of the master node.

Vault is equipped with a data protection mechanism, that is, all secrets are stored in the backend, as well as in RAM in encrypted form. This ensures maximum data security. At startup, Vault is “sealed”, that is, it is not yet ready to work with secrets. To “print” it, you need to carry out the procedure unsealed — enter several master keys (the so-called Shamir's secrets).

That is, the steps should be as follows:

  1. Start etcd.

  2. Start Vault.

  3. Entering keys one by one.

  4. Getting started with Vault.

The keys are held by different employees who must keep them separately.

In general, Vault can “seal” itself due to various situations:

  • the storage leader has changed,

  • long storage replies,

  • The leader of Vault has changed.

In order to reduce the risks of Vault being “sealed” and speed up its “unsealing”, you can use a scheme with another Vault – Vault transit. That is, to “unseal” a Vault, another Vault is needed =) Usually the second (transit) Vault has no load, and its task is only to “unseal” the main Vault.

This scheme allows you to quickly “print” the main Vault without the participation of people (key custodians).

But Vault transit still needs to be “printed” manually. It looks like this:

As a result, we are able to quickly launch and “print” the main Vault and safely store the keys for it in Vault Transit.

Basic Vault Entities

Before we begin describing our story, I’ll tell you a little about how Vault works. Let's start with the fact that this is quite flexible software that can do a lot, and also scales very well with the help of plugins. For example, it is very accessible many different authentication methods. Almost all communication within Vault occurs through gRPC messages, which gives us the ability to integrate with various systems. Below I will talk about some important features of Vault. I won't cover all of its capabilities – there is a lot of information about them on the Internet. If you are familiar with the general design of Vault, then you can safely skip this part.

Paths

Working with Vault occurs by reading and writing data located along different paths, for example:

secret-mount/group/project-2377212/
  • secret-mount – this is the mount point, it can have different engines, and therefore can perform various operations, such as working with secrets or generating certificates;

  • group/project-2377212/ – this is the path to our directory where secrets are stored.

Tokens

This is one of the key entitieswhich is indicated in each request to Vault, by which it identifies who came.

The token contains a lot of different information, for example:

  • creation date,

  • lifetime – determined on the basis of leases,

  • access policies are a separate entity,

  • usage counter,

  • a sign of the possibility of token renewal,

  • etc.

There are tokens several types:

  • service – the most commonly used tokens, have wide functionality;

  • batch are lighter tokens; essentially, they are encrypted BLOBs.

These are completely different tokens. To use it correctly, you need to understand what they are possibilities will be used for authorization and access administration, and select the desired type. For example, batch tokens are self-sufficient, but there is no way to revoke such a token, unlike service tokens. But service tokens are stored in Vault and take up space. That is why it is imperative to keep track of them: how many there are, how much data is in the storage.

Here is an example of the relationship between authorizations in Vault and keys in etcd:

It can be seen that during active visits to Vault, many tokens are created (and, as we remember, this is part of authorization), which are stored in etcd.

Rentals

Lease – can be issued to many entities in the vault that have a lifetime. An example is the lifetime of a token. You can bind a lease to data – and the data will be deleted as soon as the token’s lifetime expires, that is, this is a TTL that can be attached to almost any entity. Leases can also be extended and revoked.

Be sure to set TTL for tokens, otherwise they will be stored in etcd indefinitely, and sooner or later this will lead to a problem – as happened with us.

Roles

These entities are somewhat similar to user groups. Typically, when authorizing in Vault, you need to specify the role with which you want to authorize, and the role itself already has various parameters, for example, the lifetime of the token (the leases described above are created), policies, authorization namespaces, and others that affect how what settings the token will ultimately receive.

Access Policies

Essentially this is set of ruleswhich describe access to secrets. They look like this:

path "secret-mount/group/project-2377212/*" {
    capabilities = ["read", "list", "create", "update", "delete"]
}

So, let’s summarize, to differentiate services’ access to secrets:

  1. Adding access policies.

  2. We create roles that contain one or more policies and one or more time restrictions.

  3. We lay out the secrets according to the “purchased tickets”.

Profit!

Using Kubernetes Pod Authorization in Vault

So, the standard process looks like this:

  1. A service is created. At this stage, a role with a policy is also created, the role has access to a specific path in Vault.

  2. Database accesses are also recorded in this path.

  3. When starting, the service, using the platform library, reads the secret:

    1. When starting, the service receives a K8s service account token.

    2. With this token and role (which was created when creating the service) it goes to Vault.

    3. Vault validates this service in K8s – and, if all is well, then a token is issued, which specifies the role policy with access to the secrets of the service.

    4. Using the last token, the service receives all the necessary secrets and starts working.

As you can see, quite a lot of stages are associated with the work of our platform library, which is part of our framework for creating services. Thanks to unification in different languages, we get the same behavior of services, which greatly helps to develop the Ozon infrastructure.

Maint

It all started with one routine service to our secret vault.

As I wrote above, if you do not specify TTL for tokens, then sooner or later the amount of stored data will increase significantly. This is what happened with us too.

We decided to change the policy of working with tokens, reducing the lifetime of the token, since most services only need to read the secret at the time of start, which means that long-lived tokens are not needed. We also decided to remove all infinite tokens that were issued for infrastructure systems and did not have the ability to authorize in a standard way.

In general, everything looked like a normal procedure that should not affect the operation of Vault.

So, first we needed to find long-lived tokens. To do this, we found all the roles that had policies with a long TTL of their tokens (and, as we remember, tokens are issued for roles). Here we need to clarify: the tokens themselves, of course, cannot be found, even by administrators, but you can find aliases that allow you to obtain all the necessary information about the tokens. This is done for security purposes – so that even administrators cannot do anything on behalf of the user. More details can be found in the section “Token management”.

Well, then the revocation procedure was launched in relation to the tokens:

vault token revoke -accessor ....

All tokens were successfully revoked – everything went well.

We also found roles where it was possible to create infinite tokens, and changed their policies so that tokens would not live indefinitely in the future.

As I wrote above, the lifetime of a token is based on leases, that is, the revocation procedure is essentially an operation that marks leases as expired, namely moving all leases to a directory:

/vault/sys/expire/id/auth/ 

We ended up moving about 400,000 records, or about 30% of all rentals.

It’s worth noting here that Vault does not immediately read data from the database, but accesses it when a client arrives with a token and the lease is checked. Therefore, after the lease was revoked, everything was fine.

Since when deleting data from etcd, the volume on the disks, of course, does not decrease to free up space, etcd has an operation defragwhich fragments the data in the database. You can understand that this is a rather expensive operation, since at this moment the database ceases to be accessible. This operation is similar to VACUUM FULL in PostgreSQL. And if you perform it one by one on all nodes of the cluster, then there should be no problems. But by mistake, the operation was launched on all nodes at once.

This led to problems in Vault itself. Vault has a mechanism for switching from one node to another if the first one stops responding. But, since all the nodes were busy with defragmentation, after searching Vault “sealed” and was unavailable to clients. In general, this is standard behavior for Vault, since in order to ensure secure storage of secrets it must have a stable connection with etcd.

We immediately noticed problems with Vault – and attempts to “print” it began, with sweat and tears… Because all requests for secrets were no longer processed. But Vault confidently said that it could not print and produced errors like this:

[ERROR] expiration: error restoring leases: error="failed to scan for leases: list failed at path \"\": rpc error: code = ResourceExhaust ed desc = grpc: trying to send message larger than max (2513015459 vs. 2147483647)

As I wrote above, we use vault transitwhich allows you to do this automatically, but in this case it did not help – the main Vault was not “printed out”. At the same time, it was clear in the logs that it was performing the “printing” procedure, but literally five seconds later the Vault was again “sealed” and all requests to it were blocked.

We started making hypotheses and changing configurations, e.g. max_receive_size And max_send_size. But, unfortunately, this did not bring any results. At the same time, we found in repositories HashiCorp information that similar errors have already occurred and the reason was the increased number of tokens. This issue was resolved by increasing the maximum response size for the gRPC client.

In general, from this error it is clear that leases do not fit in somewhere when Vault starts. And we started looking for errors from our database – etcd.

etcd[20536]: read-only range request "key:\"/vault/sys/expire/id/\" range_end:\"/vault/sys/expire/id0\" " with result "range_response_count:1475648 size:2504927582" took too long (6.962914353s) to execute

After analyzing this problem, it was suggested that after deleting a large number of tokens in /vault/sys/expire/id There are still entities that are deleted deferred rather than immediately. That is, they are marked in the database as expired, and then Vault deletes them.

This section stores tokens for different auth points, for example /vault/sys/expire/id/auth/o-dev/login for dev environment, /vault/sys/expire/id/auth/o-stg/login – for the stage environment, etc.

And then we decided to remove entries from the dev and stage environments.

After this, we were able to “unseal” Vault – and it was able to correctly give secrets to all services.

What was it?

Of course, after such an incident, we took a large number of actions to ensure that the situation did not repeat itself.

First we needed to understand what Vault actually does when we try to “print” it. Since the logs did not give us an exact answer to the question of where everything fell, we decided to find out step by step.

Found a function Restorewhich launched just when Vault started. Our rentals are being collected in the process:

m.logger.Debug("collecting leases")
existing, leaseCount, err := m.collectLeases()
if err != nil {
	return err
}

And then these leases are processed and all associated tokens are deleted. Vault takes each lease and refreshes the associated tokens in multiple threads.

// Distribute the collected keys to the workers in a go routine
wg.Add(1)
go func() {
	defer wg.Done()
	i := 0
	for ns := range existing {
		for _, leaseID := range existing[ns] {
			i++
			if i%500 == 0 {
				m.logger.Debug("leases loading", "progress", i)
			}

			select {
			case <-quit:
				return
				case <-m.quitCh:
				return
				default:
				broker <- &lease{
					namespace: ns,
					id:        leaseID,
				}
			}
		}
	}
	// Close the broker, causing worker routines to exit
	close(broker)
}()

The update process can be viewed in the function processRestore. But, in short, we upload separately again every rental and monitor it.

After analyzing the code, it became clear that the problem was collectLeasesbecause it was at this stage that we received the error. Inside the function there is an abstraction for working with various databases (in our case, as you remember, this is etcd). And it was necessary to find out what exactly was happening and what actions lead to the error.

Vault works with etcd using the gRPC protocol. collectLeases uses the function Listwhich has a common interface for all databases, and the implementation of the interface uses GET request:

etcdctl get /vault/sys/expire/id/auth/ 

That is, the library tries to unload all records from a given path, and, as I wrote above, there are more than 400,000 of them. This is the reason that blocks Vault from “printing”.

As you know, it is possible to configure gRPC requests, even for the amount of data in the requests themselves. For example, Go has a limit on the number of bytes that can be received:

func MaxCallRecvMsgSize(bytes int) CallOption

and which can be sent:

func MaxCallSendMsgSize(bytes int) CallOption

All of these settings take a value that defaults to int, where the maximum is math.MaxInt32. That is, in essence, at some point we are trying to pump out a lot of data in one request, and we are not succeeding.

What did we come up with?

The first thought we had was to increase the number of bytes and try restarting etcd. But we won’t be able to increase int, and it’s not entirely correct that we pump out all the data in one request.

Another option was pagination – simple but reliable. It will allow you to avoid a thick query. But, as I wrote above, Vault unloads leases, and then processes them in several threads. At the time of processing, it once again unloads data from the database, but this time one record at a time. That is, double work and unnecessary requests.

It looked like we didn't need to download all the data, but just needed the keys. The easiest way was to see how work works in other drivers. IN PostgreSQL, MySQL and in all the others, only the keys are uploaded, and not all the data.

Since we have etcd, there is an option in the GET method:

--keys-only[=false]		Get only the keys

It allows you to download only the keys that we need. And during the tests it turned out that the volume of data decreased by about 20 times. And, of course, we started going through the MaxCallSendMsgSize limitation.

It’s strange that such a miscalculation was made by the guys from HashiCorp. I think this is due to the fact that in the first versions of etcd there was no –keys-only option, and the drivers were modified based on previous versions.

Since Vault is an open source product, we decided to make issuewhere they proposed changes for the pull request. He was “very complex” in function List:

-	resp, err := c.etcd.Get(ctx, prefix, clientv3.WithPrefix())
+	resp, err := c.etcd.Get(ctx, prefix, clientv3.WithPrefix(), clientv3.WithKeysOnly())

After a short discussion pull request was accepted and is now a master.

We decided not to do pagination now, since the data volume needs to be 20 times larger in order for us to catch the same problem and we have some time for the data volumes to grow. According to rough estimates, this is four to five years. I think we'll send another pull request, but that will be a different story.

What have we learned?

Of course, no one is immune from mistakes, and no matter what resources we have, there is a chance of encountering a problem. But nevertheless, we decided to make several approaches to the projectile in order to minimize the risks.

  • We added steps to “unseal” Vault with problematic backups to training exercises.

  • We added a procedure for automatically deploying Vault from backups on a daily basis to make sure that a fresh backup was deployed without any problems.

  • Reduced the amount of data in Vault by four times by reducing the TTL of tokens and revoking long-lived tokens.

  • Contribute or the necessary lines of code.

In general, in order not to repeat such mistakes in the future, it is necessary to conduct stress testing and delve into the operating features of the system itself and related components. Even the Vault developers themselves took some time to figure out the problem and accept the proposed changes.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *