Istio Tracing and Monitoring: Microservices and the Uncertainty Principle

The Heisenberg uncertainty principle states that it is impossible to simultaneously measure the position of an object and its speed. If an object moves, then it has no location. And if the location is, it means he has no speed.

As for microservices on the Red Hat OpenShift platform (and running Kubernetes), thanks to the corresponding open-source software, they can simultaneously report both their performance and health. Of course, this does not refute the old Heisenberg, but it eliminates the uncertainty when working with cloud applications. Istio makes it easy to organize tracking (tracing) and monitoring of such applications to keep everything under control.

Define terminology

Under trace (Tracing) we understand the logging of system activity. It sounds pretty general, but in fact one of the main rules here is to dump trace data to the appropriate storage without worrying about formatting it. And all the work of searching and analyzing data is entrusted to their consumer. Istio uses the Jaeger trace system, which implements the OpenTracing data model.

Tracks (Traces, and the word “traces” is used here to mean “traces”, as in ballistic examination, for example) we will call the data that fully describes the progress of the request or unit of work, as they say, “from and to.” For example, everything that happens from the moment the user presses a button on a web page to the moment the data is returned, including all the microservices involved in this. We can say that one trace fully describes (or simulates) the passage of the request back and forth. In the Jaeger interface, tracks are decomposed into components along the time axis, such as how a chain can be decomposed into separate links. Only instead of links the track consists of the so-called spans.

Span – this is the interval from the beginning of the execution of a unit of work to its completion. Continuing the analogy, we can say that each span is a separate link in the chain. A span may or may not have one or more child spans. As a result, the top-level span (root span) will have the same total duration as the trace to which it belongs.

Monitoring – This is, in fact, the very observation of your system – through the eyes, through the UI or automation tools. Monitoring is based on trace data. In Istio, monitoring is implemented using Prometheus tools and has a corresponding UI. Prometheus supports automatic monitoring using Alerts and Alert Managers alerts.

Leave the nicks

For tracing to be possible, the application must create a collection of spans. Then they must be exported to Jaeger, so that he, in turn, creates a visual representation of the trace. Among other things, these spans’s mark the name of the operation, as well as the timestamps of its start and end. Spans are sent by forwarding Jaeger-specific HTTP request headers from incoming requests to outgoing requests. Depending on the programming language used, this may require a slight modification of the application source code. The following is an example of Java code (when using the Spring Boot framework) that adds B3 (Zipkin-style) headers to your request in the Spring configuration class:

The following header settings are used:

If you use Java, you can leave the code untouched, just add a few lines to the Maven POM file and set environment variables. Here are the lines you need to add to the POM.XML file to implement the Jaeger Tracer Resolver:

And the corresponding environment variables are set in the Dockerfile:

That’s it, now it’s all set up, and our microservices will start generating trace data.

We look in general terms

Istio includes a simple control panel based on Grafana. When everything is configured and running on the Red Hat OpenShift PaaS platform (in our example, Red Hat OpenShift and Kubernetes are deployed on minishift), this panel is launched with the following command:

open "$(minishift openshift service grafana -u)/d/1/istio-dashboard?refresh=5⩝Id=1"

Grafana panel allows you to quickly evaluate the system. A fragment of this panel is shown in the figure below:

Here you can see that the microservice customer calls the microservice preference v1, and that in turn calls the microservices recommendation v1 and v2. The Grafana panel has a Dashboard Row block for high-level metrics, such as the total number of requests (Global Request Volume), percentage of successful requests (success rates), 4xx errors. In addition, there is a Server Mesh view with graphs for each service and a Services Row block to view detailed information for each container for each service.

Now dig deeper

With a properly configured trace, Istio, as they say, right out of the box allows you to delve into the analysis of system performance. In the Jaeger’s UI, you can view traces and see how far and deep they go, as well as visually locate performance bottlenecks. When using Red Hat OpenShift on the minishift platform, launch Jaeger UI using the following command:

minishift openshift service jaeger-query --in-browser

What can be said about the trace on this screen:

  • It is divided into 7 spans.
  • The total execution time is 6.99 ms.
  • The microservice recommendation, which is the last in the chain, takes 0.69 ms.

Diagrams of this type allow you to quickly figure out a situation where the performance of the entire system suffers due to a single poorly working service.

Now let’s complicate the task and launch two instances of the recommendation microservice: v2 with the oc scale –replicas = 2 deployment / recommendation-v2 command. Here are the pods we’ll have after that:

If we now switch back to Jaeger and deploy the span for the recommendation service, we will see which pod requests are routed to. Thus, we can easily localize the brakes at a specific pod‘a level. You should look at the node_id field:

Where and how everything goes

Now we go to the Prometheus interface and quite expectedly we see there that requests between the second and first versions of the recommendation service are divided in a 2: 1 ratio, strictly by the number of working pods. Moreover, this graph will dynamically change when scaling pod’s up and down, which will be especially useful with Canary Deployment (we will examine this deployment scheme in more detail next time).

It’s only the beginning

In fact, today, as they say, we only slightly touched a storehouse of useful information about Jaeger, Grafana and Prometheus. In general, this was our goal – to guide you in the right direction and open up the prospects of Istio.

And remember, all this is already built into Istio. When using certain programming languages ​​(for example, Java) and frameworks (for example, Spring Boot) all this can be realized without completely touching the application code itself. Yes, the code will have to be slightly modified if you use other languages, primarily Nodejs or C #. But since traceability (read, “tracing”) is one of the prerequisites for creating reliable cloud systems, in any case, you will have to edit the code whether you have Istio or not. So why not spend the effort more profitably?

At least in order to always answer the questions “where?” And “how fast?” With 100% certainty.

Chaos Engineering at Istio: It Was Conceived

The ability to break things helps to ensure that they do not break

Software testing is not only a complicated thing, but also an important one. At the same time, testing for correctness (for example, whether a function returns the correct result) is one thing, and testing in an unreliable network is a completely different task (it is often believed that the network always works without failures, and this is the first of eight misconceptions regarding distributed computing). One of the difficulties in solving this problem is how to simulate failures in the system or introduce them intentionally by performing the so-called fault injection. This can be done by modifying the source code of the application itself. But then you will not test your original code, but its version, which specifically simulates failures. As a result, you run the risk of getting into a fatal embrace of fault injection and colliding with the heisenbags – failures that disappear when you try to detect them.

And now we will show how Istio helps to cope with these difficulties one-two.

How it looks when everything is fine

Consider the following scenario: we have two pods for our microservice recommendation, which we took from the Istio tutorial. One pod is marked as v1 and the other as v2. As you can see, while everything works fine:

(By the way, the number on the right is just a call counter for each pod’s)

But we do not need this, is it? Well, let’s try to break everything without touching the source code at all.

We arrange interruptions in the work of the microservice

Below is the yaml file for the Istio routing rule, which in half cases will fail (server error 503):

Please note that we explicitly prescribe that in half the cases error 503 should be returned.

And here is a screenshot of the curl command launched in the loop after we activate this rule to simulate failures. As you can see, half of the requests returns error 503, and regardless of which pod – v1 or v2 – they go to:

To restore normal operation, it is enough to remove this rule, in our case, the istioctl delete routerule recommendation-503 -n tutorial command. Here Tutorial is the name of the Red Hat OpenShift project that runs our Istio tutorial.

Making Artificial Delays

Artificial errors 503 help test the system for fault tolerance, but the ability to predict and handle delays should impress you even more. And delays in real life happen more often than failures. A slow-running microservice is the poison that the entire system suffers from. Thanks to Istio, you can test code related to delay processing without changing it at all. To begin with, we will show how to do this in the case of artificially introduced network delays.

Please note that after such testing, you may need (or want) to refine your code. The good news is, in this case you will act proactively, not reactively. That is how the development cycle should be built: coding-testing-feedback-coding-testing …

This is what the rule looks like … Although you know what? Istio is so simple, and this yaml file is so clear that everything in this example speaks for itself, just take a look:

In half the cases, we will have a 7-second delay. And this is not at all the same as if we inserted the sleep command in the source code, since Istio really delays the request for 7 seconds. Since Istio supports Jaeger tracing, this delay is excellent in Jaeger’s oskom UI, as shown in the screenshot below. Pay attention to the long request in the upper right corner of the diagram – its duration is 7.02 seconds:

This scenario allows you to test the code under network latency conditions. And it is clear that by removing this rule, we will remove the artificial delay. We repeat, but again we did all this without touching the source code.

Don’t back down and don’t give up

Another feature of Istio that is useful for chaos engineering is repeated calls to the service a specified number of times. The point here is not to stop trying, when the first request ends with error 503 – and then, perhaps, for the N-eleventh time we are lucky. Maybe the service just lay down for a short time for one reason or another. Yes, this reason should be unearthed and eliminated. But this is later, but for now let’s try to make the system continue to work.

So, we want the service to give a 503 error from time to time, and after that Istio will try to contact him again. And here we clearly need a way to generate error 503, without touching the code itself …

Stop wait! We just did it.

This file will make the recommendation-v2 service generate a 503 error in half the case:

Obviously, part of the requests will fail:

And now we’ll use the Retry Istio function:

This routing rule does three retries with a two-second interval and should reduce (and ideally completely remove from the radar) 503 errors:

We summarize: we made it so that Istio, firstly, generated a 503 error for half of the requests. And secondly, the same Istio makes three attempts to reconnect to the service if a 503 error occurs. As a result, everything works just fine. Thus, using the Retry function, we fulfilled our promise not to back down and not give up.

And yes, we did it again without touching the code at all. All we needed were two Istio routing rules:

How not to let a user down or seven do not wait

And now we turn the situation inside out and consider the scenario when you do not have to retreat and give up only some fixed time. And then you just need to stop trying to process the request so as not to force everyone to wait for any one braking service. In other words, we will not protect the lost position, but will move to the reserve line so as not to let the site user down and not force him to languish in ignorance.

In Istio, you can set the request timeout. If the service exceeds this timeout, error 504 (Gateway Timeout) is returned – again, all this is done through the Istio configuration. But we will have to add the sleep command to the service source code (and then, of course, execute rebuild and redeploy) to simulate the slow operation of the service. Alas, it will not work otherwise.

So, we inserted a three-second sleep into the recommendation v2 service code, rebuilt the corresponding image and made a container re-mode, and now we will add a timeout using the following Istio routing rule:

The screenshot above shows that we are trying to contact the recommendation service if we do not receive a response within one second, that is, before error 504 occurs. After applying this routing rule (and adding a three-second sleep to the recommendation service code : v2), we get this:

We repeat again, but the timeout can be set without touching the source code. And the additional bonus here is that now you can modify your code so that it responds to a timeout, and it is easy to test these improvements using Istio.

And now all together

Making a little chaos with Istio is a great way to test your code and the reliability of your system as a whole. Fallback, bulkhead and circuit breaker patterns, artificial failures and delays mechanisms, and also repeated calls and timeouts will be very useful when creating fault-tolerant cloud systems. Combined with Kubernetes and Red Hat OpenShift, these tools help you confidently face the future.

Similar Posts

Leave a Reply Cancel reply