Retry and Circuit Breaker in Kubernetes with Istio and Spring Boot

Every service mesh framework absolutely needs to be able to handle interservice communication failures. These also include timeouts and HTTP error codes. I will show you how to set up mechanisms using Istio retries (retries) and circuit breaker (automatic shutdown). We will analyze the interaction between two simple Spring Boot services deployed to Kubernetes. But instead of basics, let’s look at more complex issues.

For demonstration of using Istio and Spring Boot, I created Github repository with two services: callme-service and caller-service

Architecture

The system architecture is very similar to the one discussed in my previous article “Service mesh on Kubernetes with Istio and Spring Boot“, but with some differences. We’re not adding an error or delay using Istio components, but right in the source code of the service. Why? This way we can handle rules for callme-service directly and not on the client. We will also launch two pods callme-service v2to check how circuit breaker works with multiple pods of the same Deployment
This is what architecture looks like:

Spring Boot Services

Let’s start by implementing services. callme-service provides two endpoints that return version information and ID instance. Call GET /ping-with-random-error gives an error message HTTP 504 in response to about half of the requests. AND GET /ping-with-random-delay responds with a random delay in the range of 0 … 3 s. So implemented @RestController on the side callme-service:

@RestController
@RequestMapping("/callme")
public class CallmeController {

    private static final Logger LOGGER = LoggerFactory.getLogger(CallmeController.class);
    private static final String INSTANCE_ID = UUID.randomUUID().toString();
    private Random random = new Random();

    @Autowired
    BuildProperties buildProperties;
    @Value("${VERSION}")
    private String version;

    @GetMapping("/ping-with-random-error")
    public ResponseEntity<String> pingWithRandomError() {
        int r = random.nextInt(100);
        if (r % 2 == 0) {
            LOGGER.info("Ping with random error: name={}, version={}, random={}, httpCode={}",
                    buildProperties.getName(), version, r, HttpStatus.GATEWAY_TIMEOUT);
            return new ResponseEntity<>("Surprise " + INSTANCE_ID + " " + version, HttpStatus.GATEWAY_TIMEOUT);
        } else {
            LOGGER.info("Ping with random error: name={}, version={}, random={}, httpCode={}",
                    buildProperties.getName(), version, r, HttpStatus.OK);
            return new ResponseEntity<>("I'm callme-service" + INSTANCE_ID + " " + version, HttpStatus.OK);
        }
    }

    @GetMapping("/ping-with-random-delay")
    public String pingWithRandomDelay() throws InterruptedException {
        int r = new Random().nextInt(3000);
        LOGGER.info("Ping with random delay: name={}, version={}, delay={}", buildProperties.getName(), version, r);
        Thread.sleep(r);
        return "I'm callme-service " + version;
    }

}

Service caller-service also provides two endpoints GET… Through RestTemplate it calls the corresponding GET callme-service… The service also returns the version caller-service, he only has one Deployment, it is marked as version=v1

@RestController
@RequestMapping("/caller")
public class CallerController {

    private static final Logger LOGGER = LoggerFactory.getLogger(CallerController.class);

    @Autowired
    BuildProperties buildProperties;
    @Autowired
    RestTemplate restTemplate;
    @Value("${VERSION}")
    private String version;

    @GetMapping("/ping-with-random-error")
    public ResponseEntity<String> pingWithRandomError() {
        LOGGER.info("Ping with random error: name={}, version={}", buildProperties.getName(), version);
        ResponseEntity<String> responseEntity =
                restTemplate.getForEntity("http://callme-service:8080/callme/ping-with-random-error", String.class);
        LOGGER.info("Calling: responseCode={}, response={}", responseEntity.getStatusCode(), responseEntity.getBody());
        return new ResponseEntity<>("I'm caller-service " + version + ". Calling... " + responseEntity.getBody(), responseEntity.getStatusCode());
    }

    @GetMapping("/ping-with-random-delay")
    public String pingWithRandomDelay() {
        LOGGER.info("Ping with random delay: name={}, version={}", buildProperties.getName(), version);
        String response = restTemplate.getForObject("http://callme-service:8080/callme/ping-with-random-delay", String.class);
        LOGGER.info("Calling: response={}", response);
        return "I'm caller-service " + version + ". Calling... " + response;
    }

}

Istio Retries Handling

Object definition DestinationRule in Istio is the same as in my previous article. Created two subsets for pods marked as version=v1 and version=v2Retries and timeouts can be customized in VirtualService… We can set the number of retries and the conditions for their execution (by a list of enum strings). The code below also sets a timeout of 3 seconds. for the whole request. Both of these settings are available inside the object HTTPRoute… At the same time, we need to set the timeout duration for one attempt, I set it to 1 s. How does it work in practice? Let’s look at a simple example:

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: callme-service-destination
spec:
  host: callme-service
  subsets:
    - name: v1
      labels:
        version: v1
    - name: v2
      labels:
        version: v2
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: callme-service-route
spec:
  hosts:
    - callme-service
  http:
    - route:
      - destination:
          host: callme-service
          subset: v2
        weight: 80
      - destination:
          host: callme-service
          subset: v1
        weight: 20
      retries:
        attempts: 3
        perTryTimeout: 1s
        retryOn: 5xx
      timeout: 3s

Before deploying services, you need to raise the logging level. We can easily enable call logs in Istio. Then Envoy proxies will display logs for all incoming requests and outgoing responses. Analyzing these records will be especially helpful in identifying retries.

$ istioctl manifest apply --set profile=default --set meshConfig.accessLogFile="/dev/stdout"

Let’s run a test request GET /caller/ping-with-random-delay… He will contact the responder with a random delay. GET /callme/ping-with-random-delay service callme-service… Here is the request and the answer:

Everything seems to be clear. But let’s see what’s going on under the hood. I have highlighted the sequence of retries. As you can see, Istio made two attempts because the two calls took longer than the one second specified in perTryTimeout… The first two calls timed out due to Istio, which can be seen in the call log. The third attempt was successful because it took about 400ms to process.

Timeout retries are not the only feature of this mechanism in Istio. We can set them with any codes 5хх and 4хх… Use VirtualService it is much easier to test the error codes alone, because we do not need to configure timeouts.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: callme-service-route
spec:
  hosts:
    - callme-service
  http:
    - route:
      - destination:
          host: callme-service
          subset: v2
        weight: 80
      - destination:
          host: callme-service
          subset: v1
        weight: 20
      retries:
        attempts: 3
        retryOn: gateway-error,connect-failure,refused-stream

Let’s call GET /caller/ping-with-random-errorwhich will appeal to GET /callme/ping-with-random-error service callme-service… She returns HTTP 504 in response to about half of the incoming requests. Here is the request and successful response with code 200 OK

And here is a log that shows what is happening on the side callme-service… There were two retries, because on the first two calls we got an error code.

Circuit breaker in Istio

Automatic shutdown configurable on site DestinationRule… For this we will use TrafficPolicy… We will not ask retries from the previous example, so you need to remove them from the definition VirtualService… You should also disable all replay settings in connectionPool inside TrafficPolicy… And now the most important thing. For settings circuit breaker in Istio we will use the object OutlierDetection… The automatic shutdown mechanism is implemented based on sequential errors returned by the end service. The number of errors can be set using the property consecutive5xxErrors or consecutiveGatewayErrors… They differ only in that they can handle different sets of errors. consecutiveGatewayErrors only handles 502, 503 and 504, while consecutive5xxErrors applies to all 5xx codes. Below in the configuration callme-service-destination I asked consecutive5xxErrors value 3. This means that after three errors in a row, the service is removed from the load balancing for one minute (baseEjectionTime=1m). Since we have two pods running callme-service version v2, we also need to override the set for 100% maxEjectionPercent the default value, which is 10%, is the maximum proportion of hosts in the load balancing pool that can be excluded.

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: callme-service-destination
spec:
  host: callme-service
  subsets:
    - name: v1
      labels:
        version: v1
    - name: v2
      labels:
        version: v2
  trafficPolicy:
    connectionPool:
      http:
        http1MaxPendingRequests: 1
        maxRequestsPerConnection: 1
        maxRetries: 0
    outlierDetection:
      consecutive5xxErrors: 3
      interval: 30s
      baseEjectionTime: 1m
      maxEjectionPercent: 100
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: callme-service-route
spec:
  hosts:
    - callme-service
  http:
    - route:
      - destination:
          host: callme-service
          subset: v2
        weight: 80
      - destination:
          host: callme-service
          subset: v1
        weight: 20

Both services are fastest to deploy using Jib and Skaffold. First, go to the directory callme-service and execute the command skaffold dev with optional parameter --port-forward

$ cd callme-service
$ skaffold dev --port-forward

Then we do the same for caller-service

$ cd caller-service
$ skaffold dev --port-forward

Before sending test requests, let’s run a second sub callme-service version v2 because Deployment assigns to parameter replicas value 1. To do this, run the command:

$ kubectl scale --replicas=2 deployment/callme-service-v2

Let’s check the status of the deployment in Kubernetes. Three deployments, two running pods callme-service-v2

Now you can test. Let’s call GET /caller/ping-with-random-error the caller-service that accesses the endpoint GET /callme/ping-with-random-error service callme-service… Let me remind you that it returns an error HTTP 504 in response to half of the requests. I have already configured for callme-service redirect to port 8080, so the service call command looks like this:

curl http://localhost:8080/caller/ping-with-random-error

Let’s analyze the answer. I have highlighted the error responses from the pod callme-service version v2 and ID 98c068bb-8d02-4d2a-9999-23951bbed6ad… After three responses with an error in a row from this pod, it was immediately removed from the load balancing pool, and as a result, all subsequent requests were sent to the second pod. callme-service v2 from ID 00653617-58e1-4d59-9e36-3f98f9d403b8… Of course there is another one under callme-service v1, which receives 20% of all requests from caller-service

Let’s see what happens if the only one under callme-service v1 will return three errors in a row. I have highlighted such answers in the screenshot. Since there is only one, there is nowhere else to redirect incoming traffic. Therefore Istio returns HTTP 503 on the next request to callme-service v1… The same response is repeated for the next minute because the circuit is still open.

Similar Posts

Leave a Reply