How We Implemented Tracetest to Improve Observability in EDA

Observability — is the ability of a system to provide a complete picture of its internal state based on data received from the system itself. In the context of a microservice architecture, and especially an event-driven architecture, observability includes three key components:

  • Logs — text records about the system’s operation, containing data about events, errors and other significant moments.

  • Metrics — numerical data that allows you to monitor the performance and health of your system in real time.

  • Traces — information about the paths of requests and events through various components of the system.

EDA specifics

In EDA, interactions between components are carried out via asynchronous messages (events). This creates the following difficulties:

  • Asynchrony – difficulties in tracking the sequence of events and their dependencies.

  • Dynamism — frequent changes in the system configuration and topology of interactions between microservices.

  • Variety of events – various types and formats of messages, which complicates their monitoring and analysis.

The Role of Observability

Let's imagine that we have a system of 16 microservices, each of which processes its own specific events. Interaction between these services can occur through message brokers such as Kafka or RabbitMQ.

Our microservices:

  • User interface

  • Payment processing service

  • Fraud Checking Service

  • Notification service

  • Reporting service

  • And others.

The conditional purchase will be made through the following process:

Step 1: User interface dispatches the “OrderPlaced” event.

Step 2: Payment processing service receives the “OrderPlaced” event and processes the payment.

Step 3: Fraud Checking Service receives the “PaymentProcessed” event and performs a check.

Step 4: Notification service receives the “FraudCheckCompleted” event and sends a notification to the user.

Step 5: Reporting service receives all events and updates reports.

Observability helps us in the described process. This is done by logs, metrics And traces.

Logs

Centralized logging: collecting all logs in one place to simplify analysis. For example, we collect them in OpenTelemetry and go look at them in Jaeger.

Event correlation: Using unique identifiers (trace IDs) to track a chain of events.

Metrics

Performance Monitoring: Collect metrics such as payment processing time, OrderPlaced event rate, etc.

Bottleneck Analysis: Identify microservices that may be a bottleneck in the system by analyzing their response times.

Traces

Detailed traces: Visualize the event path through all microservices, from “OrderPlaced” to “FraudCheckCompleted”.

Debugging Problems: Easily identify problem points and delays at various stages of event processing.

The Critical Importance of Observability

As you can see from the example above, Observability in EDA is a critical element for several reasons:

  • Improving reliability: the ability to quickly identify and resolve problems.

  • Improving performance: identifying and eliminating bottlenecks.

  • Flexibility and scalability: ease of adaptation to changes and scaling of the system.

  • Reduced debugging time: quickly find the causes of errors and failures.

Observability in the context of EDA provides complete transparency of the system. This is especially relevant for complex systems with many microservices, where any bottleneck or problem can significantly affect the overall performance and reliability of the system.

Our system

We had 16 microservices connected by Kafka, a monolithic database, and target databases for each microservice (PostgreSQL). Logging was based on OpenTelemetry, using Jaeger. There were autotests on Jest and Playwright, and before releases we ran tests locally using k6.

Schematic representation of our system

Schematic representation of our system

Problem

Our goal was to increase the speed of feature delivery. It's simple – the faster a company creates and launches a product, the faster it starts to benefit from it. To do this, we needed to get metrics on the performance of our release build (or not, namely, when bugs are found, to localize the error location as accurately as possible for its quickest fix) in order to reduce the speed of regression testing.

To solve this problem, of course, we could write a huge number of tests using a combination of Jest + Allure and Playwright (more tests for the God of tests), but we needed a less resource-intensive solution. In addition, we had to take into account that we already had some tests that we did not want to ignore.

Then we decided to study how colleagues test on the global market, as well as what “best practices” exist. To our surprise, it turned out that there are practically no ready-made solutions (we will be glad if you share your experience in the comments).

Solution

A relatively new open-source tool called Tracetest.ioIt suited our needs perfectly.

Here you can use existing tests (Cypress, Playwright, k6, Postman and others), and view the entire process from the web to the backend through the trace captured at each test run, and cover the entire system with one set of tests.

The time for creating tests, according to them, was reduced by 98%: from 12 hours to 15 minutes. We immediately understood that such miracles should not be expected, but we could not pass by.

And so began a fascinating 20-minute adventure called “R&D”.

Implementing Tracetest

What kind of beast is this of yours? Tracetest?

Given the scale of the system, which we wrote about earlier, manual testing would require the involvement of the entire department. Of course, such a scenario is not convenient for us. That is why we chose Tracetest, an open-source tool that automates testing and monitoring of microservices using traces from OpenTelemetry. Thanks to this, you can quickly cover all microservices with tests, automating routine tasks.

To implement the tool, we needed only two QA engineers, and these tests were supported by specialists who mainly do manual testing (after a short onboarding).

It is possible to assume that after reading this part, many people immediately thought: “Is everything really that simple?!”

Let's look at the process in more detail.

The first thing we did was to download the logs for all microservices for the previous day. They were JSON stored in OpenSearch, which allowed us to work with their keys and values. Based on all this, a script was written to generate tests in YAML format for their further use in Tracetest.

We will not dwell on this topic in detail, as this would require a separate article. However, you can see a clear example in official documentation.

Example test

Below we present an example of the Test entity itself in Tracetest, along with the tests (Testspecs) that we planned to cover:

type: Test
spec:
  name: DEMO Import - Import an Entity
  description: "Import an entity"
  trigger:
    type: http
    httpRequest:
      url: http://demo-api.demo/entity/import
      method: POST
      headers:
      - key: Content-Type
      - value: application/json
      body: '{ "id": 52 }'
  specs:
    - selector: span[name = "POST /entity/import"]
      assertions:
        - attr:tracetest.span.duration <= 500ms
        - attr:http.status_code = 200
    - selector: span[name = "send message to queue"]
      assertions:
        - attr:messaging.message.payload contains 52
    - selector: span[name = "consume message from queue"]:last
      assertions:
        - attr:messaging.message.payload contains 52
    - selector: span[name = "consume message from queue"]:last span[name = "import entity  from externalapi"]
      assertions:
        - attr:http.status_code = 200
    - selector: span[name = "consume message from queue"]:last span[name = "save entity
        on database"]
      assertions:
        - attr:db.repository.operation = "create"
        - attr:tracetest.span.duration <= 500ms
  outputs:
    - name: ENTITY_ID
      selector: span[name = "POST /entity/import"]
      value: attr:http.response.body | json_path '.id'

We generated test files and then integrated their execution into CI/CD for all 16 microservices (that's over 500 tests with over 20,000 checks). This may sound scary in text, but it's not that scary because Tracetest easily integrates into GitLab pipelines, making things easier.

Example of the output of our pipelines in UI Tracetest

Example of the output of our pipelines in UI Tracetest

Our checks were divided into four categories:

  • SQL Queries (Text and Query Type)

  • REST requests (headers, body and status)

  • All switches (true/false)

  • Other

This was all great, but we also had the task of integrating with Jest. Of course, Tracetest didn't have native integration with Jest…

We needed some API to communicate between Tracetest and Jest.

Integration with Jest

We came to the conclusion that we need to take the x-request-id from already called requests, and then find the necessary trace in Jaeger and parse there trace-idwhich acts as a trigger in Tracetest. This was implemented as a custom matcher“under the hood” of which everything happened. Thus, we successfully embedded Tracetest tests as additional checks into the already existing Jest tests.

Integration with Jest

Integration with Jest

Now about what exactly was happening “under the hood”:

By using swagger-typescript-api We generated a set of APIs (JS classes with our API methods and all the interfaces we use) from the OpenAPI 3.0 Swagger documentation. Then we made friends with Jest and Tracetest using Axios.

Now we had Jest integration tests running, which contained exceptions with our custom matcher. As a result, when Tracetest was checked successfully, Jest began to output the “passed” status, and in case of an error – “failed” with subsequent parsing of the error with the problem location in Tracetest.

Conclusion

Why did we choose Tracetest:

  1. Open Source solution. Provides technological independence of development, no mandatory fee for the right to use the product, the ability to make feature requests for Tracetest via an issue on Github.

  2. Based on international standards (such as OpenTelemetry, which is valuable in our scenario).

  3. Availability of integrations with tools that we already use within the project (Playwright, k6).

  4. Lack of suitable alternatives.

  5. This is a fresh product that doesn't require you to deal with outdated modules and integrations.

The implementation of Tracetest helped us significantly improve observability and automate the testing process. The time of regression testing was reduced and the speed of delivery of new features increased. If earlier we released 1 release once a month (or even longer), then after six months we were able to organize the delivery of 4 release candidates to production per month (on average, we could test 2 release candidates per two-week sprint). Tracetest proved itself with its flexibility and scalability option.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *