How to manage a distributed system without attracting the attention of orderlies

Hello! My name is Alexander Popov, I am the tech lead of the 05.ru marketplace team. Now we are working on the back of the marketplace and some other services in the Dagestan market.

When developing the server part of the marketplace, we immediately decided to build it in a distributed architecture. This article is about how the chaos of distributed systems led us to choreography and why we are ultimately marching towards orchestration. We are still on the path to implementing changes, but I decided to talk about it now so that next time I can make a second article – with the results. At the same time, we’ll see if expectations match reality. Throughout the article there will be a lot of diagrams without cats, but hang in there.

Let's define what a distributed system is

In a simplified (for clarity, to the point of absurdity) diagram, it looks like this:

Image appears as text, screenshot, font<p>Automatically generated description" src="<a href=

We have a client (it could be a browser, a mobile application or something else), the client sends a request to the system. Inside the system, something is requested from somewhere, redirected somewhere, some kind of chaos and bacchanalia occurs. As a result, the generated response is returned to the client. From somewhere.

Now I’ll open the black box and show, using our example, how everything can be arranged there.

From opening to middlegame

All legends about development usually begin with the phrase “this is how it happened historically…”, and we are no exception.

It so happened historically that on the back of the 05.ru marketplace there are several services and one entry point of the entire system, which rules where each request should be redirected. Plus, there is a separate authorization service and cache. Each service has its own base, a common bus. For a long time, all this looked quite logical, natural and manageable.

This is what the initial version of our distributed system looks like in the diagram:

This is what the initial version of our distributed system looks like in the diagram:

The most important thing is that two well-known developers worked on the back of the marketplace, who not only understood each other, but also developed common methods.

In 2022, the project began to develop rapidly. The analytics department joined us, rolled out new requirements, and the development team increased. Explosive growth predictably began to generate inconsistencies.

The first problems looked like this

The first problems looked like this

Naturally, problems began due to the constantly growing number of services. Services listen to the bus, send something to it, work with databases and caches, and ask each other for something. All this begins to look like noodles, and it becomes more and more difficult to monitor the product.

In addition, there were added problems when communicating with analysts and testers. The guys don’t understand well the logic of what’s happening in the backend, and the inevitable emergence of new services only aggravates the problem.

Another difficulty is tons of documentation. Developers don’t really want to write it, and other teams don’t really want to read it. If two pages can still be mastered, then 20 pages have almost no chance, laziness wins. A lose-lose situation means everyone is a loser.

Onboarding new employees is also not going well. They immediately need to try to explain the logic of the chaos structure, and they must begin to more or less navigate it.

Then there are organizational problems. For example, the lack of clear criteria for developing new functionality. When developers are given the task of developing a scenario, it is very difficult to determine where exactly this scenario should be implemented. And the most important thing is how to check it later. As a result, we get a large load on the testing department. Testers, no less than developers, need to understand what and where to check; they also need to be immersed in the context of what is happening.

As a result, we end up with a very long period of time between a new idea and the deployment of a feature that implements this idea. In the credits we see quite strongly demotivated employees who need to constantly keep this entire system in their heads, document it, answer a million questions – where, what and how it works. Have you seen the film “The Mist”, based on the novel by Stephen King? Look.

In search of a magic pill

There are two fundamentally different approaches to managing distributed systems – choreography and orchestration. Each has pros and cons.

Choreography

This is an approach in which participants exchange events without centralized control. There is some input data or request. It comes to Service 1, some action occurs. After which an event indicating the completion of this action is sent to the bus. Other services asynchronously pick up this event, perform something on their own, and send an event to the bus indicating the completion of their work. Then, for example, Service 1 picks up events about the completion of previous services and produces some kind of resulting response.

pros

The main thing is minimal connectivity of services. That is, each service as a whole has no idea whether anyone is listening to it, who is doing what with its events, and so on. Their only point of communication is the contracts under which the events operate. It is easy to integrate new services into this chain or exclude existing ones.

Also a big plus is the good speed of operation, especially in the presence of parallel executions.

Minuses

The main thing is complex controls. There is no single point from which you can manage the process from start to finish; the scenario is “spread out” across all services. As a result, we have quite complex code debugging.

Finding the source of the error, for example at the last stage, may take an indefinite amount of time, because you will have to track every event, every service: what it sent, what it listened to.

Orchestration

In this approach, there is a single orchestrator that manages the entire scenario step by step. He receives requests, decides where to send what, where to request what from. And he collects all the data and gives an answer.

Big and bold plus quite obvious – this is a visual centralized management of the entire scenario: it is described in one place.

The disadvantages are a consequence of the main advantage

Since we have a single entry and exit point, this same point is also a possible point of failure of the entire system. If the orchestrator fails, the entire system crashes.

Choreography behaves much better in case of failure of one of the services or some errors, because the system as a whole continues to work.

The second drawback is the slightly greater connectivity of services with each other, because here the work of the orchestrator directly depends on the response of the service.

It's not like that with us. Wait a minute though…

You have already seen how our system is implemented:

Well, you've already seen it

Well, you've already seen it

The current design of our system is more like choreography, a little chaotic and with some modifications, but still. There is a bus, there is listening to events, there is sending events to the bus.

In general, the services operate independently of each other. But, as I mentioned at the beginning, the number of services began to grow, and communicating with the analytics and testing departments became increasingly difficult. So we realized that choreography did not suit us, and we began to look towards orchestration.

And the first question is: what to choose as an orchestrator. Doing something homemade in our case is a bad idea. We have no experience or qualifications in such developments – this is lower-level software than the services we write. The probability of writing a non-functional game instead of an orchestrator tends to infinity.

We chose Temporal, and now I’ll tell you why.

I will immediately attach links to materials where the basic concepts and approaches to working with distributed systems based on Temporal are explained in simple language:

Fault tolerant workflow orchestration on PHP

Orchestrate it! We describe complex business processes in PHP

Orchestration and Murphy's Law: Handling Business Process Errors

Under the spoiler I wrote a little about Temporal

The concept of Temporal is to execute scripts step by step and save states. The code is executed while saving each of its states, that is, in steps: a step is executed – the state is saved, the next step is executed – the state is saved again. This behavior allows, in the event of a failure, error, or even some technical failure, to continue executing the script, starting exactly from the point where it stopped. That is, each step is a conditionally atomic operation; after its execution, the state is considered to have been changed, and this operation will not be performed again.

There are two main concepts in Temporal. A workflow is a script that is written in a programming language, and an activity is an atomic operation that is either executed completely or not executed. An activity can be either a request to the database, or some kind of calculation, or, as in our case, a call to a service. There is also the concept of “signal” – this is an external request that can also change the state of the script (workflow). For example, a workflow may wait for a specific signal to continue, or a specific signal may interrupt the workflow.

The main advantage of Temporal is its phenomenal fault tolerance due to the event sourcing approach, that is, state saving. Let's add to this almost unlimited horizontal scalability: the workflow itself does not store any data, it only stores the script and its states. You can raise an unlimited number of workflow processes, the load between which will be distributed by some kind of external balancer.

These two advantages compensate for the main disadvantage of orchestration, its bottleneck – the point of failure.

I am not afraid of such a statement that Temporal is the most reliable place in the entire system. Because even in the event of some external failures, the execution script will not disappear anywhere and will not produce any critical error. Temporal will relatively easily survive even a “cleaning lady incident”, that is, a pulled plug from the socket. When the problem is fixed, the script will start executing from where it left off.

Also a big advantage are ready-made SDKs for different programming languages: PHP, Go, Java, TypeScript, Python – choose to your taste. Temporal is a noticeably developing product and has a fairly active community, including a Russian-speaking one. There is a place to look for answers, ask questions, discuss.

Stupid services – smart orchestrator

We want to use all the advantages of the orchestrator to solve problems in the marketplace backend.

This is what the new architecture looks like (we haven’t implemented it yet)

Now the new architecture has services and Temporal as an orchestrator with workflow and activity. To obtain data, you can contact the service directly, bypassing the orchestrator. You can obtain data easily, quickly, without using aggregators. Changing, adding or deleting data occurs only through the orchestrator.

Essentially, the scheme is the same (single entry point gateway, authorization), but between the single entry point (gateway) and the service there is always a facade that controls the order of request execution and is responsible for the connectivity of data between services. The execution of the script can be extended in time for an indefinite period (this is not only minutes or hours, but even months and years). Workflow can remain running for an unlimited amount of time.

Also, the bus is involved in the whole scheme, but it already has more of an auxiliary role for some non-critical and asynchronous events.

And most importantly, the services do not communicate with each other and do not listen to the bus. All business logic is placed exclusively in the façade. Services store entities, are responsible for adding, deleting and changing them, and also provide the requested data as quickly as possible. Services validate all incoming data and are responsible for the coherence of data within themselves. Services are not responsible for anything outside of their entities.

At the initial stages, access rights to data will be managed by a separate service, but there are plans to move this item to an external system. Well, services also report all changes to their entities to the bus, to the event storage, and to the logs.

How an orchestrator will make us friends with analysts

The services almost completely replicate the models provided by the analytics department with restrictions, connections, and access rights provided. They are just a wrapper for the data.

As you can easily see, in this case almost all the service code will be identical. And if so, then it is logical to move the main service code somewhere into an external shared plug-in library, the so-called service skeleton, or Service Template, and each specific service will essentially be a description of the entities that come from the analytics department.

A big bonus is immediately noticeable. Our services almost completely correspond to the data coming from analytics. They are easy to test, they are easy to validate, and they are very fast and easy to develop. That is, a skeleton written once allows you to develop new services in just a couple of days, and if you have good code generation, even in a couple of hours.

It was – it became

I will show the changes using a specific example, but it is from the future. We will implement it in the first quarter of 2024, and I will tell you about the results in a new article.

What does the case look like now?

This is a summary of the average product and store rating based on user ratings.

Ratings are provided by store customers along with reviews. You can leave a review for both the product and the store.

We have a moderator who will approve a specific review. Next, the review service updates its data and sends a response to the moderator. In this case, the review service sends an event to the bus indicating that a specific review has been published.

This event listens to the catalog service, which stores products, and the merchant service, which stores stores. In turn, each of these services, upon receiving an event, requests all reviews for its entity from the review service in order to calculate the average rating of a product or store, and updates its data. The script works, it is understandable, but it has a number of shortcomings.

Firstly, the entire case is spread across three services. Review approval occurs in one service, product ratings are calculated in a second, store ratings are calculated in a third.

Secondly, we do not have centralized error handling. For example, if an error occurred when requesting all product reviews, then we can only find out about this from the logs of the catalog service itself. Or from the system logs, where all the errors are poured. This, naturally, is not very convenient. Well, and most importantly, in this case the average product rating will not be updated. In a situation specifically with product ratings, this is not critical, but the cases are different, and the approach is now approximately the same.

Third, a bus event can in principle be lost if, for example, the directory service was unavailable at the time the event was sent. For any reason: a deployment occurred or the server node crashed. We will simply lose such an event. Again, in the case of ratings this is not critical, but there is still a problem.

How should such a scenario look in the new architecture?

The entire scenario is described in one façade. Essentially, it does roughly the same thing, but in a different order.

In the same way, first a request comes from the moderator to publish a specific review. The script is launched, the publishing process occurs – a request to the review service.

The review service responds that the publication was successful.

After this, the façade can respond to the moderator that everything is fine.

And after issuing the response, it continues updating the average ratings in the catalog service or in the merchant service, that is, it requests all reviews about a specific entity, calculates the average value and updates the corresponding entity.

What is the advantage

The most obvious thing is that the entire scenario is described in one place. It's easy to add, easy to fix, easy to check.

The second significant difference is that we can handle errors in almost any way in this scenario.

If there is an error, for example, at the last step (when updating the average rating value itself), there may be several different options: either you can roll back the entire review approval, or configure the retray policy in such a way that the value will try to be updated before a certain time, you can also always send a notification when an error occurs.

And the most important thing is that the error will be easy to debug, since Temporal, in addition to its phenomenal fault tolerance, writes an unrealistic number of extremely detailed and structured logs. That is, the error is very easy to track and fix. With a correctly configured retray policy, after fixing an error (for example, in a directory service), you won’t have to do anything else at all: the script itself will send the necessary request, seeing that the service has been changed, and will complete successfully.

Once again I want to convey the idea that a correctly written and configured script will be executed in any scenario. And when an error occurs, and in case of some kind of technical failure, both on the side of the facade itself and on the side of one of the services. Workflow can hang around waiting for a crash or error to be fixed for virtually unlimited time, and it ensures that every step is executed and completed successfully.

There must be a catch, but where?

What will we get as a result when we fully implement this entire architecture on the marketplace backend?

The most important thing is the fast and high-quality development of new services directly from artifacts from analytics

That is, services will be developed strictly according to the models or objects that analytics provides. User scenarios will also be directly implemented using use cases from analytics.

Low entry threshold for new employees to work with the creation and modification of existing services

All a new employee needs to learn is the same Service Template. Understand how it works, how to correctly describe entities, validation rules, and so on.

The new employee does not need to learn any new technologies or any relationships between different services. He receives an artifact from analytics, he takes it and describes it in terms of Service Template.

We will be able to develop both services and user scenarios completely independently

Moreover, this can be done even by different development teams and in different languages. Temporal has SDKs for many different languages, so no one forces scripts to be written strictly in one language.

Each service can be written by its own team, a separate developer, or outsourced. And in a language convenient to them. The most important thing is that the result meets certain contracts.

There can be several facades, there should even be several of them, because combining all user scenarios in one place is also not a good idea. Some scripts can be launched very rarely and do not take up a lot of time and resources. The other is, on the contrary, to be very loaded and resource-intensive. The third part may use some exotic dependencies. And in general, when there are a lot of scenarios, it will be difficult to find the right one if they are all located in one place.

We will get an easily validated system

That is, neither the analytics department nor the testing department will have to collect information bit by bit from the documentation of each service, somehow compare it, constantly clarify it, ask, find out, and supplement their knowledge.

There is one user scenario implemented in a diagram in a use case; strictly according to this scenario, you can check the operation of the system. And it is precisely in this same scenario that you can point out errors and shortcomings if a specific scenario does not work correctly. The corresponding workflow is opened, edited, finalized.

We get rid of the routine work of writing documentation

At the same time, we are confident that each service corresponds to what the analytics provided us, and each scenario exactly corresponds to what is drawn in the use-case diagrams.

You have to pay for convenience

If anyone remembers, the main advantage of choreography is the good speed of work. The same cannot be said about orchestration in general and Temporal in particular.

Temporal pays for its phenomenal fault tolerance and detailed set of logs in speed and consumed hardware resources. That is why in the scheme of the new architecture, direct requests to services without using Temporal were in first place.

Although Temporal is very fault-tolerant, it is slower than a script written in a vanilla programming language or running over a bus due to the overhead of both logging and state saving.

You can argue, but it seems to me that overhead costs in the form of a large amount of occupied space and some slowdown in speed when data changes are still better than the costs of demotivated employees, constant calls and clarifications, out of sync between different departments and data inconsistency errors in different services.

It would be great if you praise or criticize my plans for switching to a new architecture in the comments. We are now just beginning work on implementing orchestration, and your experience will help avoid mistakes in the future.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *