An introduction to the distributed tracing pattern

When it comes to health and monitoring, a distributed architecture can give you a couple of problems. You can deal with dozens, if not hundreds of microservices, each of which may have been created by different development teams.

When working with a large-scale system, it is very important to monitor the key metrics of that system, the health of the applications, and enough data to be able to quickly track and fix problems. Being in a CLOUD environment like AWS, Google Cloud, Azure further exacerbates the problem and makes it difficult to detect, fix and isolate problems due to the dynamic nature of the infrastructure (scale up, temporary machines, dynamic IPs, etc.).

Observability basis:

  • Metrics – Application metrics, host / system metrics, network metrics, etc.

  • Logs (logs) – logs of applications and supporting infrastructure

  • Traces (traces) – tracking the passage of a request through a distributed system

In this article, I will focus on two aspects of observability: logs (only generated by the application) and traces.

Logs

Using a centralized logging system in a distributed architecture to be able to collect logs and visualize them has become common practice. Logs generated by various microservices will be automatically sent to a centralized database for storage and analysis. Here are some of the popular centralized logging solutions for applications:

Splunk

Datadog

Logstash

Fluentd

Logs provide useful information about events that have occurred in your application. These can be INFO-level logs or error logs with detailed exception stack traces.

Correlation of logs between microservices

A centralized logging system in a large distributed application can take gigabytes of data per hour, if not more. Given that a request can traverse multiple microservices, one way to get all the logs associated with a request that spans multiple microservices is to assign a unique identifier (id) to each request.

In most cases, this can be the userId associated with the request, or some unique UUID generated at the first entry point into the application. These identifiers will be attached to each message in the log and will be transmitted sequentially from one microservice to another in the request header (in case the identifier is not part of a sequentially processed request). So you can easily use requestId or userId to query the logging system to find all the logs associated with a request across multiple services !!!

Figure 1. Centralized logging.
Figure 1. Centralized logging.

Below are some examples of how to tag your logs with the necessary information in Java using RequestFilters.

Figure 2: Log4J2 configuration and sample log
Figure 2: Log4J2 configuration and sample log

Figure 3: Query Filters by UUID or UserId
Figure 3: Query Filters by UUID or UserId

Tracing

Tracing is a technique that allows you to profile and monitor applications while they are running. Traces provide useful information such as:

  1. The path of the request through a distributed system.

  2. Delay of the request for each transfer / call (for example, from one service to another).

Below is an example trace for a request that interacts with two microservices (an ad auction service and an ad integrator service).

Figure 4. Trace
Figure 4. Trace

In the above example, the data was captured and rendered with the tool DataDog… There are several other ways to capture traces, which I will discuss in the next section.

Trace components

A trace is a tree structure with a parent trace and child spans. The request trace covers several services and is further broken down into smaller fragments by operations / functions, called spans. For example, a span can span a call from one microservice to another. Within one microservice, there can be several spans (depending on how many levels of classes / functions or dependent microservices are called to service the request).

Tracing relies on creating a unique identifier for each request at the entry point and propagating it to downstream systems as the trace context in the request headers. This allows you to link different trace information from multiple services in one place for analysis and visualization.

Correlation of logs and traces

As a result, we can filter the logs by userId or other unique identifier (for example, generated UUID) and can track the performance / behavior of an individual request by traces. It would be nice if we could tie this together and be able to match logs and traces for a specific request !!

The presence of such a correlation between logs and requests allows:

  1. Compare performance metrics directly to logs.

  2. Send a special request to the system for troubleshooting.

  3. Perform artificial transactions with the system at different points in time and be able to compare current traces with historical ones, as well as automatically collect system logs associated with these requests.

Implementing Distributed Tracing Using Log and Trace Correlation

APPROACH # 1: Instrumentation with 3rd party solutions like DATADOG

Link: DataDog APM

With this approach, we instrument services in distributed systems with DataDog APM (application performance monitors). Datadog performs 100% request tracing and can also collect logs generated by your applications.

Datadog essentially takes care of the centralized logging and collection of trace information. Datadog generates unique trace IDs and automatically propagates them to all instrumented downstream microservices. The only thing that is required of us is to associate DD traceId with the logs, and we can get the correlation of logs and traces.

Figure 6: Instrumentation of the Application with DataDog
Figure 6: Instrumentation of the Application with DataDog

Figure 7: Correlation of Logs and Traces in DataDog
Figure 7: Correlation of Logs and Traces in DataDog

APPROACH # 2: ZIPKINS, CLOUD-SLEUTH WITH SPRING BOOT

Link:

Zipkins, Cloud sleep

Benefits:

  1. Full integration into SPRING boot

  2. Ease of use

  3. Traces can be visualized using the Zipkins user interface.

  4. Supports OpenTracing standards through external libraries.

  5. Supports log correlation across Log4j2 MDC contexts.

Disadvantages:

  1. There is no solution for automatic collection of trace-related logs. We’ll have to send the logs to ElasticSearch ourselves and search using the trace IDs generated by cloud-sleuth (like the X-B3-TraceId header).

Design:

Figure 8: Zipkins, Cloud Sleuth and Spring Boot.
Figure 8: Zipkins, Cloud Sleuth and Spring Boot.

APPROACH # 3: AMAZON XRAY

Link: AmazonXRAY

Benefits:

  1. Supports all AWS resources natively, which is great if your distributed services are deployed and running on AWS

  2. AWS Load Balancers automatically generate REQUEST IDs for every incoming request, relieving the application of this worry. (Link: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-request-tracing.html)

  3. Allows you to trace all the way from the API gateway to the load balancer, service, and other dependent AWS resources.

  4. Implements log correlation using logs in CLOUDWATCH logs

Disadvantages:

  1. Cloudwatch log can get very expensive with large log volumes

APPROACH # 4: JAGER

Link: Jager

Benefits:

  1. Supports opentracing by default

  2. Has libraries that work with Spring

  3. Supports Jager Agent, which can be installed as a tool for distributing traces and logs.

Disadvantages:

From the point of view of maintenance and configuration of the infrastructure, it is quite complex.

CONCLUSION

Logs and traces are certainly useful in themselves. But when they are linked together through correlation, they become a powerful tool for accelerating problem resolution in a production environment, while giving devops insight into the health, performance, and behavior of distributed systems. As you saw above, there are several ways to implement this solution. The choice is yours 🙂


Translation of the article was prepared on the eve of the start of the course “Architecture and Design Patterns”Learn more about the course


GET A DISCOUNT

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *