An introduction to the distributed tracing pattern
When it comes to health and monitoring, a distributed architecture can give you a couple of problems. You can deal with dozens, if not hundreds of microservices, each of which may have been created by different development teams.
When working with a large-scale system, it is very important to monitor the key metrics of that system, the health of the applications, and enough data to be able to quickly track and fix problems. Being in a CLOUD environment like AWS, Google Cloud, Azure further exacerbates the problem and makes it difficult to detect, fix and isolate problems due to the dynamic nature of the infrastructure (scale up, temporary machines, dynamic IPs, etc.).
Observability basis:
Metrics – Application metrics, host / system metrics, network metrics, etc.
Logs (logs) – logs of applications and supporting infrastructure
Traces (traces) – tracking the passage of a request through a distributed system
In this article, I will focus on two aspects of observability: logs (only generated by the application) and traces.
Logs
Using a centralized logging system in a distributed architecture to be able to collect logs and visualize them has become common practice. Logs generated by various microservices will be automatically sent to a centralized database for storage and analysis. Here are some of the popular centralized logging solutions for applications:
Logs provide useful information about events that have occurred in your application. These can be INFO-level logs or error logs with detailed exception stack traces.
Correlation of logs between microservices
A centralized logging system in a large distributed application can take gigabytes of data per hour, if not more. Given that a request can traverse multiple microservices, one way to get all the logs associated with a request that spans multiple microservices is to assign a unique identifier (id) to each request.
In most cases, this can be the userId associated with the request, or some unique UUID generated at the first entry point into the application. These identifiers will be attached to each message in the log and will be transmitted sequentially from one microservice to another in the request header (in case the identifier is not part of a sequentially processed request). So you can easily use requestId or userId to query the logging system to find all the logs associated with a request across multiple services !!!
Below are some examples of how to tag your logs with the necessary information in Java using RequestFilters.
Tracing
Tracing is a technique that allows you to profile and monitor applications while they are running. Traces provide useful information such as:
The path of the request through a distributed system.
Delay of the request for each transfer / call (for example, from one service to another).
Below is an example trace for a request that interacts with two microservices (an ad auction service and an ad integrator service).
In the above example, the data was captured and rendered with the tool DataDog… There are several other ways to capture traces, which I will discuss in the next section.
Trace components
A trace is a tree structure with a parent trace and child spans. The request trace covers several services and is further broken down into smaller fragments by operations / functions, called spans. For example, a span can span a call from one microservice to another. Within one microservice, there can be several spans (depending on how many levels of classes / functions or dependent microservices are called to service the request).
Tracing relies on creating a unique identifier for each request at the entry point and propagating it to downstream systems as the trace context in the request headers. This allows you to link different trace information from multiple services in one place for analysis and visualization.
Correlation of logs and traces
As a result, we can filter the logs by userId or other unique identifier (for example, generated UUID) and can track the performance / behavior of an individual request by traces. It would be nice if we could tie this together and be able to match logs and traces for a specific request !!
The presence of such a correlation between logs and requests allows:
Compare performance metrics directly to logs.
Send a special request to the system for troubleshooting.
Perform artificial transactions with the system at different points in time and be able to compare current traces with historical ones, as well as automatically collect system logs associated with these requests.
Implementing Distributed Tracing Using Log and Trace Correlation
APPROACH # 1: Instrumentation with 3rd party solutions like DATADOG
Link: DataDog APM
With this approach, we instrument services in distributed systems with DataDog APM (application performance monitors). Datadog performs 100% request tracing and can also collect logs generated by your applications.
Datadog essentially takes care of the centralized logging and collection of trace information. Datadog generates unique trace IDs and automatically propagates them to all instrumented downstream microservices. The only thing that is required of us is to associate DD traceId with the logs, and we can get the correlation of logs and traces.
APPROACH # 2: ZIPKINS, CLOUD-SLEUTH WITH SPRING BOOT
Link:
Benefits:
Full integration into SPRING boot
Ease of use
Traces can be visualized using the Zipkins user interface.
Supports OpenTracing standards through external libraries.
Supports log correlation across Log4j2 MDC contexts.
Disadvantages:
There is no solution for automatic collection of trace-related logs. We’ll have to send the logs to ElasticSearch ourselves and search using the trace IDs generated by cloud-sleuth (like the X-B3-TraceId header).
Design:
APPROACH # 3: AMAZON XRAY
Link: AmazonXRAY
Benefits:
Supports all AWS resources natively, which is great if your distributed services are deployed and running on AWS
AWS Load Balancers automatically generate REQUEST IDs for every incoming request, relieving the application of this worry. (Link: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-request-tracing.html)
Allows you to trace all the way from the API gateway to the load balancer, service, and other dependent AWS resources.
Implements log correlation using logs in CLOUDWATCH logs
Disadvantages:
Cloudwatch log can get very expensive with large log volumes
APPROACH # 4: JAGER
Link: Jager
Benefits:
Supports opentracing by default
Has libraries that work with Spring
Supports Jager Agent, which can be installed as a tool for distributing traces and logs.
Disadvantages:
From the point of view of maintenance and configuration of the infrastructure, it is quite complex.
CONCLUSION
Logs and traces are certainly useful in themselves. But when they are linked together through correlation, they become a powerful tool for accelerating problem resolution in a production environment, while giving devops insight into the health, performance, and behavior of distributed systems. As you saw above, there are several ways to implement this solution. The choice is yours 🙂
Translation of the article was prepared on the eve of the start of the course “Architecture and Design Patterns”… Learn more about the course…