Retry and Circuit Breaker in Kubernetes with Istio and Spring Boot
Every service mesh framework absolutely needs to be able to handle interservice communication failures. These also include timeouts and HTTP error codes. I will show you how to set up mechanisms using Istio retries (retries) and circuit breaker (automatic shutdown). We will analyze the interaction between two simple Spring Boot services deployed to Kubernetes. But instead of basics, let’s look at more complex issues.
For demonstration of using Istio and Spring Boot, I created Github repository with two services: callme-service
and caller-service
…
Architecture
The system architecture is very similar to the one discussed in my previous article “Service mesh on Kubernetes with Istio and Spring Boot“, but with some differences. We’re not adding an error or delay using Istio components, but right in the source code of the service. Why? This way we can handle rules for callme-service
directly and not on the client. We will also launch two pods callme-service v2
to check how circuit breaker works with multiple pods of the same Deployment
…
This is what architecture looks like:
Spring Boot Services
Let’s start by implementing services. callme-service
provides two endpoints that return version information and ID
instance. Call GET /ping-with-random-error
gives an error message HTTP 504
in response to about half of the requests. AND GET /ping-with-random-delay
responds with a random delay in the range of 0 … 3 s. So implemented @RestController
on the side callme-service
:
@RestController
@RequestMapping("/callme")
public class CallmeController {
private static final Logger LOGGER = LoggerFactory.getLogger(CallmeController.class);
private static final String INSTANCE_ID = UUID.randomUUID().toString();
private Random random = new Random();
@Autowired
BuildProperties buildProperties;
@Value("${VERSION}")
private String version;
@GetMapping("/ping-with-random-error")
public ResponseEntity<String> pingWithRandomError() {
int r = random.nextInt(100);
if (r % 2 == 0) {
LOGGER.info("Ping with random error: name={}, version={}, random={}, httpCode={}",
buildProperties.getName(), version, r, HttpStatus.GATEWAY_TIMEOUT);
return new ResponseEntity<>("Surprise " + INSTANCE_ID + " " + version, HttpStatus.GATEWAY_TIMEOUT);
} else {
LOGGER.info("Ping with random error: name={}, version={}, random={}, httpCode={}",
buildProperties.getName(), version, r, HttpStatus.OK);
return new ResponseEntity<>("I'm callme-service" + INSTANCE_ID + " " + version, HttpStatus.OK);
}
}
@GetMapping("/ping-with-random-delay")
public String pingWithRandomDelay() throws InterruptedException {
int r = new Random().nextInt(3000);
LOGGER.info("Ping with random delay: name={}, version={}, delay={}", buildProperties.getName(), version, r);
Thread.sleep(r);
return "I'm callme-service " + version;
}
}
Service caller-service
also provides two endpoints GET
… Through RestTemplate
it calls the corresponding GET callme-service
… The service also returns the version caller-service
, he only has one Deployment
, it is marked as version=v1
…
@RestController
@RequestMapping("/caller")
public class CallerController {
private static final Logger LOGGER = LoggerFactory.getLogger(CallerController.class);
@Autowired
BuildProperties buildProperties;
@Autowired
RestTemplate restTemplate;
@Value("${VERSION}")
private String version;
@GetMapping("/ping-with-random-error")
public ResponseEntity<String> pingWithRandomError() {
LOGGER.info("Ping with random error: name={}, version={}", buildProperties.getName(), version);
ResponseEntity<String> responseEntity =
restTemplate.getForEntity("http://callme-service:8080/callme/ping-with-random-error", String.class);
LOGGER.info("Calling: responseCode={}, response={}", responseEntity.getStatusCode(), responseEntity.getBody());
return new ResponseEntity<>("I'm caller-service " + version + ". Calling... " + responseEntity.getBody(), responseEntity.getStatusCode());
}
@GetMapping("/ping-with-random-delay")
public String pingWithRandomDelay() {
LOGGER.info("Ping with random delay: name={}, version={}", buildProperties.getName(), version);
String response = restTemplate.getForObject("http://callme-service:8080/callme/ping-with-random-delay", String.class);
LOGGER.info("Calling: response={}", response);
return "I'm caller-service " + version + ". Calling... " + response;
}
}
Istio Retries Handling
Object definition DestinationRule
in Istio is the same as in my previous article. Created two subsets for pods marked as version=v1
and version=v2
… Retries and timeouts can be customized in VirtualService
… We can set the number of retries and the conditions for their execution (by a list of enum strings). The code below also sets a timeout of 3 seconds. for the whole request. Both of these settings are available inside the object HTTPRoute
… At the same time, we need to set the timeout duration for one attempt, I set it to 1 s. How does it work in practice? Let’s look at a simple example:
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: callme-service-destination
spec:
host: callme-service
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: callme-service-route
spec:
hosts:
- callme-service
http:
- route:
- destination:
host: callme-service
subset: v2
weight: 80
- destination:
host: callme-service
subset: v1
weight: 20
retries:
attempts: 3
perTryTimeout: 1s
retryOn: 5xx
timeout: 3s
Before deploying services, you need to raise the logging level. We can easily enable call logs in Istio. Then Envoy proxies will display logs for all incoming requests and outgoing responses. Analyzing these records will be especially helpful in identifying retries.
$ istioctl manifest apply --set profile=default --set meshConfig.accessLogFile="/dev/stdout"
Let’s run a test request GET /caller/ping-with-random-delay
… He will contact the responder with a random delay. GET /callme/ping-with-random-delay
service callme-service
… Here is the request and the answer:
Everything seems to be clear. But let’s see what’s going on under the hood. I have highlighted the sequence of retries. As you can see, Istio made two attempts because the two calls took longer than the one second specified in perTryTimeout
… The first two calls timed out due to Istio, which can be seen in the call log. The third attempt was successful because it took about 400ms to process.
Timeout retries are not the only feature of this mechanism in Istio. We can set them with any codes 5хх
and 4хх
… Use VirtualService
it is much easier to test the error codes alone, because we do not need to configure timeouts.
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: callme-service-route
spec:
hosts:
- callme-service
http:
- route:
- destination:
host: callme-service
subset: v2
weight: 80
- destination:
host: callme-service
subset: v1
weight: 20
retries:
attempts: 3
retryOn: gateway-error,connect-failure,refused-stream
Let’s call GET /caller/ping-with-random-error
which will appeal to GET /callme/ping-with-random-error
service callme-service
… She returns HTTP 504
in response to about half of the incoming requests. Here is the request and successful response with code 200 OK
…
And here is a log that shows what is happening on the side callme-service
… There were two retries, because on the first two calls we got an error code.
Circuit breaker in Istio
Automatic shutdown configurable on site DestinationRule
… For this we will use TrafficPolicy
… We will not ask retries from the previous example, so you need to remove them from the definition VirtualService
… You should also disable all replay settings in connectionPool
inside TrafficPolicy
… And now the most important thing. For settings circuit breaker in Istio we will use the object OutlierDetection
… The automatic shutdown mechanism is implemented based on sequential errors returned by the end service. The number of errors can be set using the property consecutive5xxErrors
or consecutiveGatewayErrors
… They differ only in that they can handle different sets of errors. consecutiveGatewayErrors
only handles 502, 503 and 504, while consecutive5xxErrors
applies to all 5xx codes. Below in the configuration callme-service-destination
I asked consecutive5xxErrors
value 3. This means that after three errors in a row, the service is removed from the load balancing for one minute (baseEjectionTime=1m
). Since we have two pods running callme-service
version v2, we also need to override the set for 100% maxEjectionPercent
the default value, which is 10%, is the maximum proportion of hosts in the load balancing pool that can be excluded.
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: callme-service-destination
spec:
host: callme-service
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2
trafficPolicy:
connectionPool:
http:
http1MaxPendingRequests: 1
maxRequestsPerConnection: 1
maxRetries: 0
outlierDetection:
consecutive5xxErrors: 3
interval: 30s
baseEjectionTime: 1m
maxEjectionPercent: 100
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: callme-service-route
spec:
hosts:
- callme-service
http:
- route:
- destination:
host: callme-service
subset: v2
weight: 80
- destination:
host: callme-service
subset: v1
weight: 20
Both services are fastest to deploy using Jib and Skaffold. First, go to the directory callme-service
and execute the command skaffold dev
with optional parameter --port-forward
…
$ cd callme-service
$ skaffold dev --port-forward
Then we do the same for caller-service
…
$ cd caller-service
$ skaffold dev --port-forward
Before sending test requests, let’s run a second sub callme-service
version v2 because Deployment
assigns to parameter replicas
value 1. To do this, run the command:
$ kubectl scale --replicas=2 deployment/callme-service-v2
Let’s check the status of the deployment in Kubernetes. Three deployments, two running pods callme-service-v2
…
Now you can test. Let’s call GET /caller/ping-with-random-error
the caller-service that accesses the endpoint GET /callme/ping-with-random-error
service callme-service
… Let me remind you that it returns an error HTTP 504
in response to half of the requests. I have already configured for callme-service
redirect to port 8080, so the service call command looks like this:
curl http://localhost:8080/caller/ping-with-random-error
Let’s analyze the answer. I have highlighted the error responses from the pod callme-service
version v2 and ID 98c068bb-8d02-4d2a-9999-23951bbed6ad
… After three responses with an error in a row from this pod, it was immediately removed from the load balancing pool, and as a result, all subsequent requests were sent to the second pod. callme-service v2
from ID 00653617-58e1-4d59-9e36-3f98f9d403b8
… Of course there is another one under callme-service v1
, which receives 20% of all requests from caller-service
…
Let’s see what happens if the only one under callme-service v1
will return three errors in a row. I have highlighted such answers in the screenshot. Since there is only one, there is nowhere else to redirect incoming traffic. Therefore Istio returns HTTP 503
on the next request to callme-service v1
… The same response is repeated for the next minute because the circuit is still open.