Kafka for the youngest developers, analysts and testers

A few years ago, there was a Kafka hype. Everyone wanted to use Kafka, not always understanding what exactly they needed it for. And today, many continue to take Kafka into their projects, often expecting that its use itself will make everything better.

On the one hand, this may be good. Such steps stimulate the industry. But it is still better to understand what you are doing, otherwise you can only make the project worse. In this article, I address developers, analysts and testers who have not yet encountered Kafka at work. I will help you understand why, in a microservice environment, many do not just go through REST, but use this tool – what exactly Kafka does and when it makes sense to use it.

The Problem of Two Generals

This model perfectly describes where all the pain of microservices comes from, i.e. it is partly related to Kafka. Let's start with it.

There's a lot of talk about this concept, but I recently read an article where the author seemed to have no idea that such a law of the universe existed and was trying to come up with a solution that didn't exist. So I'll start with the basics.

Let's say we have two generals on the field, each with his own army. There is an enemy between them. The generals want to attack him from both sides. But they need to agree on what time to attack and do it simultaneously, because otherwise they will simply be killed one by one.

Each general has his own messengers. The first general sends a messenger to the second to inform him that the attack begins tomorrow at 8 a.m. But the enemy is not asleep. He can intercept the messenger before he reaches the second general.

If the messenger arrives without incident, the second general has received the message. But he needs to send a messenger with a reply that the information has been received and the attack will take place. The first general is aware of the situation and is unlikely to launch an attack without this confirmation. However, as in the first case, the enemy can intercept the messenger and then the first general will never know that his information has been received.

The messenger can reach the first general. But for the attack to take place, he needs to send confirmation again (the second general doesn't want to go into hell alone). It turns out that the messenger needs to run not just there and back, but also run there again so that everyone can be sure that their messages have been delivered.

Link to a detailed description of the problem.

There is no way out of this situation – the restriction cannot be objectively bypassed. If the messenger does not return, all we can do is send the next one, and then the next one, until one of them returns and confirms that everyone is ready to attack.

In microservices, everything happens exactly the same, only the role of generals is played by these same microservices, and between them there is a hostile network – the Internet or a local area network, whatever.

Let's say we need to write off 100 rubles from a certain Vasya's card using a payment service. We access it from our application via REST. We can be sure of the result only if all confirmations have arrived (there are responses). The answer can be “yes, the money has been written off” or “no, it hasn't been written off because there is no money.” And this is a good scenario.

And there are also many possible situations when there is no answer.

Firstly, we may not be able to get through to the payment service with our request (network failures or the service on the other side is unavailable). The payment service could have received our message but not written off the money because it crashed while processing the transaction.

The service could have written off money from the client, but not responded to us due to network failures or because our own application was down, i.e. the write-off occurred, but there was no one to process it.

And our own application could also crash after the write-off was completed, but before a note about it was made.

In the last two situations, we wrote off the money, but we don't know about it on the side of our application – i.e. there is still no result. In general, this situation is unsolvable. As with the two generals, we can be sure of the result only if we clearly received an answer that everything is OK.

The Hard Way from Simple REST

In all unsuccessful scenarios, we are left to pull the payment service and find out what happened. But REST retries are also not the best solution. A typical case – the network blinked, and then it turned out that we made 10 write-offs.

This case can be treated with idempotency, i.e. on the payment service side we come up with a mechanism that does not allow the same request to write off money twice. For example, we can enter the request ID inside the message and if it is repeated, simply ignore this request. And in this case, it is the server for this message that must be able to live with duplications.

The next problem is performance. Let's say the payment service is down, and we continue to retrace. We will need to constantly maintain the flow for this. In essence, we will DDoS the network with such retraces.

All that remains is to think up additional techniques – for example, if the first retries do not work, repeat the request after 30 seconds, then after 2 minutes, then after 10… Such an extension of the request interval is called a backoff. Perhaps it is worth stopping retries for some time while the service is unavailable (this is a circuit breaker). You can simply kick it once an hour, finding out whether the service has recovered, and resume attempts to write off only after the payment service comes back to life.

The next problem is thread starvation. While the requests we are retracing are hanging in memory, the threads are busy. If new requests come to the same service and hang, sooner or later the threads will end. As a result – OOM, i.e. our service will stop responding for this reason.

If at this point our service crashes and then comes back up, then no one will remember what was in all those frozen threads. It is clear that some requests were processed, but which ones… In other words, we lose all context. We cannot continue working, we need to start all over again.

To deal with this problem, it is logical not to keep threads, but to organize something like a queue, i.e. to retrace from an intermediate base. It is convenient to create a background thread, which will retrace these messages.

So we go a long way with RESTs from a simple call to an interesting system that delivers messages at least once (more than once is possible, no big deal) and is eventually consistent, i.e. not consistent now, but still consistent sometime later.

And then the issue of load balancing arises. If the payment service crashed and then recovered, everyone who previously wanted to reach it will start flooding it with requests that are hanging in the queue. But perhaps it is only capable of processing one request per second. That is, we need to work more quietly, balancing the load, otherwise the payment service will simply have to throw out some of the requests with a limiter.

Somewhere at this stage, the idea arises that since we already have some kind of persistent queue, why not take Kafka. In essence, it solves the same problems, it’s just located outside of our service, not inside. So we can simply replace the previous scheme with queues with this one:

We will simply put messages from our application into Kafka, and the payment service will pick them up from there itself.

Unfortunately, Kafka will not solve these problems qualitatively, only quantitatively. In the end, if the queue is unavailable, then we find ourselves in a situation similar to the story with retries. If Kafka is unavailable, then the service can only shrug its shoulders and suspend work. That is why they try to guarantee very high availability for the queue. And by ensuring high availability of Kafka, we increase the availability of services, decoupling them from direct requests to each other.

Thus, we come to Kafka when we understand that even with REST everything is really complicated and there are a lot of questions that Kafka helps to solve. It simplifies the interaction from our application side – it is easier to put in Kafka than to build stories with queues. And it implements features that REST cannot provide – for example, balancing on the client side.

We also get normal decoupling. We don't care at all whether the service is currently available or not – we just throw everything into Kafka and the system becomes asynchronous and a little more stable. We move away from the anti-pattern of fragile architecture.

Anti-pattern of Fragile Architecture

Let's say we have 50 microservices that synchronously call each other via REST. If we turn off one microservice, the call chain will sooner or later come to it and it will not be able to process the request right now. We can say that all services are falling off at once.

If we replace REST with Kafka, we will not radically improve the situation – one of the microservices is still down. But we will definitely not make it worse. The other services are available and can at least try to respond from the cache or in some other way (and at least not create parasitic retrieval traffic).

Fragile architecture is a hypothetical example. Replacing everything with Kafka is not always advisable, since queues complicate maintenance. You lose a message, and you can’t even figure out which microservice and to which queue failed to send it. And you have to build distributed tracing to see how requests went, deploy monitoring, etc.

Overall, the industry consensus seems to me to be: if you can avoid using microservices, don't. Same goes for Kafka. If you don't know why you need Kafka, you probably don't need it. And it seems like not everyone understands the complexity they're introducing when they add Kafka to a project – they're cargo cults, switching to Kafka just for the sake of the tool itself.

It is important to assess the complexity of your system and the cost of an error for the business. There are very simple systems with a low cost of an error – a client came, you did not serve him, you lost 10 cents. Such risks can be ignored. It is only important not to find yourself writing another Kafka at some point. If the cost of an error is 10 thousand dollars and a lawsuit, the case takes a more serious turn. Then it makes sense to protect yourself from this error in all possible ways, including by increasing complexity.

Why is Kafka so popular?

There are actually many message brokers. But it seems that even with Rabbit, everyone is slowly moving to Kafka because it is a good open source solution that can be flexibly configured for different use cases. It is as if at some point everyone just believed in the concept behind Kafka and started using it everywhere.

Kafka is doing well in terms of performance. You can add partitions for the same topic and the performance will grow almost linearly. Millions of messages can go through it, the only question is whether you are ready to pay for the servers.

How Kafka Works

For ease of understanding, I have drawn the following diagram:

The workflow typically starts with a producer writing a message to Kafka via TCP/IP, making a blocking synchronous call to the cluster – to one of the servers that is currently the leader (in this sense, the situation is very similar to REST).

There are always several servers inside Kafka. One will be the Leader, the others will be replicas. In fact, they duplicate each other's data. We work (communicate) with the leader, and the replicas copy messages from it and are ready to pick up the situation if the leader falls. This provides a good guarantee of availability.

Like the two generals problem, the producer waits for Kafka to confirm that the message has been received. Although the code makes it look like each message is sent instantly, the producer actually sends them in batches, putting them in an intermediate buffer. And the first setting worth talking about is the way messages are sent:

Synchronous. The producer sends a message and waits for a response. With synchronous sending, we can guarantee that the order in which messages are sent is followed.
Asynchronous. The message is sent, but the producer does not wait for confirmation of delivery and continues sending subsequent messages. Thus, the sending order may be disrupted.

By default, many use asynchronous sending.

Another producer setting is the confirmation requirement, which controls the balance between speed and reliability:

0 – Kafka cluster doesn't have to tell us whether it received the message. This is the fastest option.
1 – Leader – the server in the cluster we are contacting – must acknowledge receipt.
All – Leader sends a confirmation after all replicas have received the message. This is necessary in case the Leader suddenly stops. In fact, this is a 100% guarantee that the message will be received and not lost, but at the same time it is the slowest way to send.

The choice between a confirmation method (or lack thereof) is another trade off.

To send messages, the following data is required:

Addresses of all Kafka cluster servers, as they are specified in the config. If one replica goes down, you can connect to another and work with it.
Authentication scheme in this Kafka cluster (login/password, certificate or something else). It is possible to configure it so that sending to Kafka is anonymous, but serious guys usually do authentication.
Topic name. There are often conventions for topic names. Changing these names is difficult and painful, so it is better to come up with normal ones right away, so that you don't have to rename them later.
Maximum message size. Usually it is 1 MB. It can be increased, but it seems that it is better not to try to play with it. For large messages, you can use S3, throwing only a link to the file into Kafka. This is what happens on my current project – we have huge gigabyte JSONs compressed using a columnar format and put into S3. A link to these compressed files is added to the Kafka message.

Inside the server there is a topic, which is divided into partitions. In essence, partitions are our queues. In them, messages are stored as a log, which you can move back and forth – read sequentially or jump to the right place to read something again.

The number of partitions is configurable. You can create a topic with one partition, but then consumers will be able to read it only in one thread (in one instance). Different services can read in parallel from one partition only if they are in different consumer groups. This means that several instances of the same service definitely cannot read from one partition, i.e. scaling will be closed. If we need load balancing on the consumer, we need to create the required number of partitions.

The number of partitions of an existing topic can be increased very easily. But it is most likely impossible to reduce this number. It is almost easier to abandon one topic and create another one if suddenly the number of partitions needs to be reduced.

Sometimes the order of reading messages is important – they must be read strictly in the same order in which they were sent. In this case, they must be sent to one partition using the correct key. User ID can be used as a key. It is important to understand all these subtleties at the stage of system design.

I mentioned above that Kafka writes messages to an infinite log. In theory, we can never delete anything from this log. But it will grow and grow, so it is worth setting up a policy for deleting messages that have already been processed. Unfortunately, Kafka logic does not allow us to “know” whether a message has been read or not (Kafka stores an offset in the log, but this does not mean that all previous messages have been read, since the offset can be shifted manually).

For each topic, you can set a maximum size or time period, messages older than which can be deleted. And this setting is a pain in the ass, because a message could have been written but not read, even if quite a lot of time has passed. It is important to understand what limits we set and whether they are enough. For example, if we set a limit on the volume, but we have a very intensive flow of messages (millions per day), then we will choose the topic size too quickly and all messages will start to be erased even before processing. It is better to base the approach here on business logic – how critical is it for the business that data older than a certain time will be lost? Is the business ready to pay for storing old messages?

There is also an interesting nuance in the message deletion settings related to European projects. They have GDPR, which can lead to the opposite situation – when storing messages with personal data for a long time is bad.

By default, consumers do not receive messages from Kafka via TCP/IP continuously, but once every 5 seconds. This period is configurable. By default, messages are also transmitted in batches of 500. The consumer must also regularly notify Kafka that it is alive. In one of my projects, both receiving messages and this healthcheck were performed in one thread. And we processed the received messages for so long that we did not have time to send a healthcheck – Kafka rebalanced and disabled the supposedly non-working consumer. And there are many similar nuances here, including those related to the offset commit in the partition. I recommend looking into it.

Cautions

If Kafka is introduced to a project, it's best to set aside time to learn more about it. Kafka can cause a lot of crazy things to happen. If there's something that can go wrong, it will go wrong sooner or later. So you need to understand what you're doing.

The advantage and disadvantage of Kafka is its flexibility. You can set up a lot of custom scenarios. But it is important to understand that everything here is like in Linux – you have to set it up. You can't just take it and use it for your own purposes, because there is a high probability that something will work wrong, and at first you will not even suspect it. You will find out later, through pain, mistakes and losses.

It's one thing to set up log collection via Kafka. In this case, we collect a lot of data that we are not afraid of losing. Another case is payment via Kafka in fintech. In this case, messages cannot be lost at all, you cannot withdraw payment twice or fall out of a certain timing. These are completely different cases, within which Kafka must be approached differently.

Kafka has a lot of settings that need to be tweaked. You need to get familiar with them. Or better yet, read a couple of books, because a hypothetical “Hello world” can be cobbled together in an hour. But if you need to understand how it works, you’ll have to spend at least a couple of weeks on initial training.

The article was written hot on the heels of the training on distributed transactions from Dmitry Litvin (@Captain_Jack).

P.S. We publish our articles on several RuNet platforms. Subscribe to our page in VK or on Telegram channelto stay up to date with all publications and other news from Maxilect.