testing Vault stability using Gremlin

Chaos engineering is an approach to testing application resilience. Roughly speaking, we deliberately break something in a system to see how it will behave, and from this experiment we draw useful conclusions about reliability and vulnerabilities.

We translated an article on how to apply this approach to HashiCorp Vault, a secrets management system.

What is HashiCorp Vault

HashiCorp Vault is an identity-based secrets management and encryption system. A secret is anything you want to restrict access to, such as API encryption keys, passwords, and certificates.

HashiCorp Vault Architecture

Vault supports multi-server mode, where multiple Vault servers are run to ensure high availability. High Availability Mode (HA) is enabled automatically when using data storage.

When running in HA mode, Vault servers have two states: standby and active. Only one instance is active at all times. All backup instances are in hot standbys. Only the active server processes requests, while the backup server redirects all requests to the active one. If the active server is blocked for some reason, goes down, or loses network connectivity, one of the backup servers becomes active. The Vault service can continue to operate as long as a majority of the servers (quorum) remain online. Read more about the performance of backup nodes in our documentation.

What is chaos engineering?

Chaos engineering is the practice of finding reliability risks in systems by deliberately introducing faults. This practice helps identify weaknesses in systems, services, and architecture before actual failure occurs. You can improve availability, lower mean time to resolution (MTTR), lower mean time to detect (MTTD), reduce the number of bugs getting into the product, and reduce the number of failures. For teams that frequently conduct Chaos engineering experiments, availability can reach 99.9%.

When conducting Chaos engineering experiments you:

increase system performance and stability;
identify blind spots using monitoring, observability and alerting;
check the stability of the system in case of failure;
study how systems cope with various failures;
help the engineering team prepare for real failures;
improve the architecture for handling failures.

You can learn more about Chaos Engineering practices and tools find out in Slurm's article or after watching the webinar.

Chaos engineering and Vault

Because Vault stores and processes sensitive application secrets, it can become a target for attackers. If all Vault instances fail, applications that receive secrets from Vault will be unable to function. Any breach or unavailability of Vault could result in serious damage to an organization's business, reputation, and finances. Here are the main types of threats for Vault:

code and configuration changes that affect application performance;
loss of node leader;
loss of quorum in the Vault cluster;
unavailability of the main cluster;
high load on Vault clusters.

To mitigate these risks, teams need to test and verify the resiliency of Vault. This is where Chaos engineering comes to the rescue. Let's consider experiments using Gremlin — platforms for Chaos engineering.

The goal of Chaos engineering

Despite the name, the goal of Chaos engineering is not to create chaos, but to reduce it. After all, ultimately you have to identify and fix the problems. Chaos engineering is not random or uncontrolled testing. This is a methodical approach, so all experiments should be planned and carefully thought through. You must have a good understanding of when and how to stop an experiment, how to monitor health checks and the state of systems.

Remember that Chaos engineering is not an alternative to unit tests, integration tests, or performance benchmarking. It complements them and can be carried out in parallel. For example, simultaneous experiments in Chaos engineering and performance tests can help identify problems that only arise under load. This increases the likelihood of detecting reliability problems that may arise during operation.

Five stages of Chaos engineering

The Chaos engineering experiment consists of five main stages:

Creating a Hypothesis. A hypothesis is an educated guess about how your system will behave under certain conditions. That is, this is the expected reaction to a certain type of failure. For example, if Vault loses a leader node in a cluster of three nodes, Vault must continue to respond to requests and another node must be selected as the leader node. When forming a hypothesis, start small and focus on one part of your system. This will make it easier to test that specific part without affecting others.
Definition of Steady State. The steady state of a system is its performance and behavior under normal conditions. Determine the metrics that best indicate the reliability of your system and track them under normal conditions. This is the baseline against which you will compare the results of the experiment. Examples of steady state indicators include Vault.core.handle_login_request And vault.core.handle_request. Additional Key Metrics can be found here.
Creating and conducting an experiment. At this stage, you determine the parameters of the experiment. How will you test your hypothesis? For example, when testing the response time of the Vault application, you can simulate a slow connection and create a delay.

Here you need to determine the conditions under which you will interrupt the experiment. For example, if the Vault application's latency exceeds the experiment's thresholds, you should stop it immediately. Please note that an aborted experiment does not equal a failed experiment. It simply means that you have identified a security risk.

Once you define your experiment and interrupt conditions, you can create experimental systems using Gremlin.

Track results. During the experiment, track key metrics for your application. See how they compare to the baseline and draw conclusions about the test results. For example, if “black hole” In your cluster, Vault is rapidly increasing CPU usage, and your response time to API requests may be too fast. Or the web application might start giving users HTTP 500s instead of clear error messages. In both cases, these are undesirable results that need to be addressed.
Make changes and improvements. After analyzing the results and comparing the indicators, fix the problem. Make the necessary changes to the application or system, deploy the changes, and then verify that the changes fixed the problem by repeating the process. This way you will gradually increase the stability of the system. This is a more efficient approach than trying to make large changes to the entire application at once.

Implementation

This section describes four experiments to test the Vault cluster. Before you can perform these experiments, you will need:

Experiment 1: the impact of losing a leader node

In the first experiment, you will test whether Vault can continue to respond to requests if the leader node becomes unavailable. If the active server is blocked, goes down, or loses network connectivity, one of the standby Vault servers becomes the active instance. You will use black hole experimentto block network traffic to and from the leader node, and then monitor the health of the cluster.

Hypothesis

If Vault loses a leader node in a three-node cluster, then Vault must continue to respond to requests and another node must become the leader.

Determining Steady State Using a Monitoring Tool

Our sustainable state is based on three indicators:

the sum of all requests processed by Vault;
vault.core.handle_login_request;
vault.core.handle_request.

The graph below shows that the amount of requests fluctuates around 20 thousand, while handle_login_request And handle_request fluctuate between indicators 1 and 3.

Conducting an experiment

In this experiment, a “black hole” experiment is carried out on the leader node for 300 seconds (5 minutes). Blackhole experiments block network traffic from a host and are great for simulating any number of network failures, including misconfigured firewalls, network hardware failures, etc. Five minutes is enough time to measure the impact and observe the Vault's response.

In the screenshot you can see the current status of the experiment in Gremlin:

Observation

In this experiment, Datadog is used to track metrics. The graphs below show that Vault responds to requests with negligible impact on throughput. This means that the Vault backup node has become the leader node:

You can verify this by checking the nodes in the cluster using the Vault operator raft command:

Improving Cluster Resilience

Based on these results, no immediate changes are required, but there is scope to expand the scope of this test. What happens if two nodes fail? Or all three? If this really bothers your team, try repeating the experiment and removing a few more nodes. You can try increasing the cluster scale to four nodes instead of three and see how that changes your results. Don't forget that Gremlin has a Halt button to stop the current experiment if something unexpected happens. Be aware of the interruption conditions and don't be afraid to stop the experiment if these conditions are met.

Experiment 2: Impact of Quorum Loss

The next experiment tests whether Vault can continue to respond to requests without quorum. To do this, in the “black hole” experiment, two nodes out of three will be disconnected from the network. In this scenario, Vault will not be able to add or remove a node or commit additional log entries. IN this instruction from HashiCorp describes the steps required to restore the cluster, and this experiment will help test them.