Antipatterns of event-oriented architecture

Hello again! In anticipation of the start of the course “Software Architect” prepared a translation of another interesting material.


The past few years have seen an increase in the popularity of microservice architecture. There are many resources that teach you how to implement it correctly, but quite often people talk about it as a silver pool. There are many arguments against the use of microservices, but the most significant of them is that this type of architecture is fraught with uncertain complexity, the level of which depends on how you manage the relationship between your services and teams. You can find a lot of literature that will explain why (maybe) in your case, microservices will not be the best choice.

We at letgo migrated from a monolith to microservices to satisfy the need for scalability, and immediately became convinced of its beneficial effect on the work of teams. When used correctly, microservices have given us several advantages, namely:

  • Application scalability: From personal experience, the most painful moment of application scalability lies in its architecture. Microservices assume the modularity of the code and the entire infrastructure (databases, etc.). In a correctly implemented microservice architecture, each service has its own infrastructure. Access to the user database (read and write) can be carried out only by the Users service.
  • Organization scalability: Microservices help solve organizational problems and give us the ability to manage the code base, which is changed by several teams. The separation of the code base prevents conflicts when making changes. In our experience, working in large teams does not allow for efficient scaling, therefore, as soon as we divided the programmers into small teams, our code base was also able to split into small components. In addition, the organization of work in small teams contributes to the development of responsibility.

Event Oriented Architecture

Not all microservice architectures are event-driven. Some people favor synchronous communication between services in this architecture using HTTP (gRPC, REST, etc.). At letgo, we try not to follow this pattern and asynchronously associate our services with domain events. Here are the reasons we do this:

  • Improving scalability and resilience: dividing a large system into smaller subsystems allows you to control the impact of failures. For example, DDoS or a surge in traffic on one of the services will not affect the others. If your services work synchronously, then with DDoS, one service will also suffer another. In this case, we can say that the services are too interconnected. For us, the key concept to increase scalability and resilience is to designate the hard boundaries of services and to ensure communication between them.

  • Bulkheads are vertical walls in a ship’s hull that create waterproof compartments that can hold water in the event of a breakdown in a ship’s hull or leaking.

  • Disconnection: Changing one service should not affect others. We believe that practices such as synchronizing the deployment of multiple services smell bad because they add complexity to operations. There are several ways to reduce this complexity, for example, versioning the API, however, from the personal experience of the company, I will say that using domain events as a public contract of a service helps to model its domain so that the operation of one service does not affect the work of others. The user entity in our Users service does not have to be the same entity as the Chat service.

Based on this, we at letgo try to adhere to asynchronous communication between services, and synchronous only works in such exceptional cases as feature MVP. We do this because we want each service to generate its own entities based on domain events published by other services in our Message Bus.

In our opinion, success or failure in the implementation of microservice architecture depends on how you deal with its inherent complexity and how your services interact with each other. Dividing code without transferring the communication infrastructure to asynchronous will turn your application into a distributed monolith.

Event-driven architecture in letgo

Today I want to share an example of how we use domain events and asynchronous communication in letgo: our User entity exists in many services, but its creation and editing is initially processed by the Users service. In the database of the Users service, we store a lot of data, such as name, email address, avatar, country, etc. In our Chat service, we also have a user concept, but we do not need the data that the User entity from the Users service has. The user’s name, avatar and ID (link to the profile) are displayed in the list of dialogs. We say that in a chat there is only a projection of the User entity, which contains partial data. In fact, in Chat we are not talking about users, we call them “talkers”. This projection refers to the Chat service and is built on the events that Chat receives from the Users service.

We do the same with listings. In the Products service, we store n pictures of each listing, but in the list view of dialogs, we show one main one, so our projection from Products to Chat requires only one picture instead of n.


View a list of conversations in our chat. It shows which specific service on the backend provides information.

If you look at the list of dialogs again, you will see that almost all the data that we show is not created by the Chat service, but belongs to it, because the User and Chat projections are owned by Chat. There is a trade-off between accessibility and consistency of projections, which we will not discuss in this article, but I will only say that it is clearly easier to scale many small databases than one large one.


Simplified view of letgo architecture

Antipatterns

Some intuitive solutions often became bugs. Here is a list of the most important antipatterns we have encountered in our domain-related architecture.

1. Thick events

We try to make our domain events as small as possible while not losing domain value. We should have been careful when refactoring legacy codebases with large entities and switching to event architecture. Such entities can lead us to fatal events, but since our domain events transformed into a public contract, we needed to make them as simple as possible. In this case, refactoring is best viewed from the side. To start, we design our events using technology event stormand then refactor the service code to adapt it to our events.

We should also be more careful with the “product and user” problem: many systems use product and user entities, and these entities, as a rule, pull all the logic behind them, and this means that all domain events are associated with them.

2. Events as intentions

A domain event, by definition, is an event that has already occurred. If you publish something to the message bus to request what happened in some other service, you are most likely running an asynchronous command rather than creating a domain event. As a rule, we refer to past domain events: ser_registered, product_published etc. The less one service knows about the other, the better. Using events as commands links services and increases the likelihood that a change in one service will affect other services.

3. Lack of independent serialization or compression

Systems for serializing and compressing events in our subject area should not depend on a programming language. You don’t even need to know in what language consumer services are written. That is why we can use Java or PHP serializers, for example. Let your team spend time discussing and choosing a serializer, because changing it in the future will be difficult and time-consuming. We at letgo use JSON, however there are many other serialization formats with good performance.

4. Lack of standard structure

When we started porting the letgo backend to an event-oriented architecture, we agreed on a common structure for domain events. It looks something like this:

{
  “data”: {
    “id”: [uuid], // event id.
    “type”: “user_registered”,
    “attributes”: {
      “id”: [uuid], // aggregate/entity id, in this case user_id
      “user_name”: “John Doe”,
      …
    }
  },
  “meta” : {
    “created_at”: timestamp, // when was the event created?
    “host”: “users-service” // where was the event created?
    …
  }
}

Having a common structure for our domain events allows us to quickly integrate services and implement some libraries with abstractions.

5. Lack of validation scheme

During serialization, we at letgo experienced problems with programming languages ​​without strong typing.


{
  “null_value_one”: null, // thank god
  “null_value_two”: “null”,
  “null_value_three”: “”,
}

A well-established testing culture that guarantees the serialization of our events, and an understanding of how the serialization library works, helps to cope with this. We at letgo are switching to Avro and the Confluent Schema Registry, which provides us with a single point of determining the structure of the events of our domain and avoids errors of this type, as well as obsolete documentation.

6. Anemic domain events

As I said before, and as the name implies, domain events must have a value at the domain level. Just as we try to avoid the inconsistency of states in our entities, we must avoid this in domain events. Let’s illustrate this with the following example: a product in our system has geolocation with latitude and longitude, which are stored in two different fields of the products table of the Products service. All products can be “moved”, so we will have domain events to present this update. Previously, we had two events for this: product_latitude_updated and product_longitude_updatedthat didn’t make much sense if you weren’t a rook on a chessboard. In this case, more meaning will be in the events. product_location_updated or product_moved.


A rook is a chess piece. It used to be called a tour. A rook can only move vertically or horizontally through any number of unoccupied fields.

7. Lack of debugging tools

We at letgo produce thousands of domain events per minute. All these events become an extremely useful resource for understanding what is happening in our system, registering user activity, or even reconstructing the state of a system at a specific point in time using event search. We need to skillfully use this resource, and for this we need tools to check and debug our events. Requests like “show me all the events generated by John Doe in the last 3 hours” can also be useful in detecting fraud. For these purposes, we have developed some tools on ElasticSearch, Kibana and S3.

8. Lack of event monitoring

We can use domain events to check the health of the system. When we deploy something (which happens several times a day depending on the service), we need tools to quickly verify the correct operation. For example, if we deploy a new version of the Products service on production and see a decrease in the number of events product_published 20%, it is safe to say that we broke something. We are currently using InfluxDB, Grafana, and Prometheus to achieve this with derived functions. If you recall the course of mathematics, you will understand that the derivative of the function f (x) at the point x is equal to the tangent of the slope of the tangent drawn to the graph of the function at this point. If you have a function for publishing the speed of a specific event in a subject area and you take a derivative from it, you will see peaks of this function and you can set notifications based on them. Using these notifications, you can avoid phrases such as “warn me if we publish less than 200 events per second for 5 minutes” and focus on a significant change in the speed of publication.


Something strange happened here … Or maybe it’s just a marketing campaign

9. The hope that everything will be fine

We try to create sustainable systems and reduce the cost of their restoration. In addition to infrastructure issues and the human factor, one of the most common things that can affect event architecture is the loss of events. We need a plan with which we can restore the correct state of the system by re-processing all the events that were lost. Here, our strategy is based on two points:

  • Saving all events: we should be able to carry out operations such as “re-processing all the events that occurred yesterday”, so we need some kind of event repository where we will put all the events that occur. In letgo, this is more the task of the Data command than the Backend command.
  • Consumer idempotency: our consumer services should be able to handle the event more than once without violating the internal state and without generating a lot of errors. This may be necessary if you need to recover from a failure and reprocess events, or because message bus delivers a message more than once. Idempotence – in our opinion, this is the cheapest solution to this problem. Imagine that we are listening to an event in our service user_registered from service Users, because we need a user projection, and we also have a MySQL table that uses user_id as a primary key. If when processing domain events user_registered, we will not check the existence before insertion, we may encounter a lot of key duplication errors. In this case, even if we check the user’s existence before insertion, we can still get an error due to the delay between the master and the slave MySQL (on average, about 30 ms). Since these projections can be represented as key-value records, we redirect them to DynamoDB. Even if you are trying to act idempotently, there are use cases, such as increasing or decreasing the counter, when it is very difficult to create an idempotent consumer. Depending on how critical the use case is at the domain level, you should determine how tolerant you are to failures and inconsistencies, and also decide whether the cost of the deduplication system pays off.

10. Lack of domain event documentation

Our domain events have become our public interface for all backend systems. Just as we document our REST APIs, we need to document domain events as well. Any employee of the organization should be able to view updated documentation for each domain event published by each service. If we use schemes for checking domain events, they can also be used as documentation.

11. Resistance to consumption of own events

You are allowed and even encouraged to use your own domain events to create projections in your system, which, for example, are optimized for reading. Some teams resisted this concept, because they were confined to the concept of consumption of other people’s events.

See you on the course!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *