We recently hosted a webinarFully screwed up. Files with Apache Kafka“. On it, the speaker Vsevolod Sevostyanov, Engineering Manager at HelloFresh, shared some fails from his personal practice, and also told how to skillfully walk on thin Kafka ice and upgrade your backend. For those who missed it or prefer to read rather than watch, we have prepared a text version.
Fail #0: Kafka is harder than it looks
Apache Kafka is presented as a super simple tool, which is why many experts do not fully understand how it works. For example, there is a misconception that Kafka is very easy to set up, so it shouldn’t be a problem. In fact, it has a lot of computer science under the hood.
There is topics – something like containers in which messages are stored, united by a topic or format. To scale within Kafka, topics are divided into partitions — message storage units. And often there is a problem associated with the fact that messages are split in a random way: some fall into Partition1, and some – into Partition2, some – into Partition3, etc. As a result, services bound to Kafka start reading data from topics in the wrong order.
For Kafka, partitioning is normal. But I have often seen teams trying hard to bypass topics, making just one partition, and setting everything up the wrong way. Although here you can go the easy way: make a Partition Key, which will allow you to store messages that depend on each other in the order in which you want to transmit them.
Fail #1: We created 150 topics, but we had to replace Kafka with Rabbit
I work for a company that does Meal Kit, a food delivery service for cooking recipes. And we have a tool for creating recipes. Recipes are created in it, data is written to Kafka and then sent somewhere. Who consumes this data, the tool should not know.
Problem: Kafka is listened to by the planner that makes up the menu for the week, Data Warehouse, which is responsible for data processing, Predictor, which estimates how many products we will have this week and how much we need to order for the next, and a bunch of other services. All services need different data. For example, the menu planner gets only ready-made recipes, while Predictor gets recipes that have been ordered at least 1 time.
Initially, we tried to solve the problem by making many different topics for each consumer. And this naturally led to the fact that Kafka “got a stake”. Kafka does not know how to understand from the content of the topic where and to what consumer it should be delivered. Everything works on the principle of a subscription model, when you say: “I want to receive a newspaper every day.” And every day you get a newspaper with all the news, not just the news that you are interested in.
So, for each topic, we had exactly one consumer. And the logic that should be performed on the consumer side (he filters what he is interested in) was performed on the Recipe Dev side. This was complete cuffing, because what we have is not a separated system, but a system that simply knows about the existence of all its components and exchanges messages between them, using Kafka as a data bus or persistence mechanism. This is quite expensive and inefficient.
Solution: we replaced Kafka with RabbitMQ, which can determine by the key which channel to send a particular message to and who should receive it. Consumers can also subscribe to updates using this key.
We had pressure from the infrastructure – the system was replicated many times, it had a multi-regional deployment and about 30-40 instances. At this scale, RabbitMQ came out cheaper than Kafka.
Fail #2: Leader not available
Problem: there is a consumer that consumes data. You happily send data to Kafka, but then you notice that the consumer is not consuming anything. Kafka is running, consumer is running, but nothing is happening.
One of the most common reasons why this happens is consumer rebalancing. Kafka has a mechanism that we have already discussed above: topics are divided into partitions. Let’s say you’ve added the first consumer. You have the concept of “group leader”. In this case, it listens and sequentially adds all partitions of a particular topic to the consumer:
If a second consumer is added for this topic, then P3 and P4 will be attached to it:
This process takes some time because the second consumer must inform the group leader that it is available. The group leader then processes this information and returns with an answer about which partitions the consumer should consume. And the point is that if your software is written incorrectly, or you, for example, have Java under the hood, or Spring Boot, which takes a long time to start, such rebalancing will occur regularly.
What did we have: the first consumer connected, started listening to the topic. Then the second and third consumers started, sent rebalancing signals and connected. But at some point, the third consumer said that he was falling off. Moreover, he only spoke like this – in reality, he did not fall off anywhere and continued to work.
It seemed mystical until we discovered the wonderful world of max.poll.interval and Session timeout parameters.
The fact is that Kafka is a passive reading system, it itself does not send anything to anyone. And in order to ensure the uniform distribution of partition messages across the consumer, she needs to know how many consumers there are in total. To do this, a Consumer Leader is selected, which listens to heartbeat – consumer messages that it exists and is ready to receive data. Heartbeats are sent by all consumers to Kafka, and the interval between two heartbeats must not exceed max.poll.interval.
It’s pretty obvious that we shouldn’t exceed the session timeout. But it is absolutely not obvious that the consumer is considered dead not only when it exceeded this interval, but also when it received the message, started processing it, and could not process it for max.poll.interval.
You need to be able to operate with the values of max.poll.interval and Session timeout. They are set by default to a not very long interval – about 2 seconds for the session timeout and 500 milliseconds for max.poll.interval. If you have a timeout session of 5 or 10 seconds, Kafka will say: “Didn’t have time – disconnect.” It’s the same with message processing: if you didn’t manage to process the message within the allotted interval, a problem arises.
Solution: we increased max.poll.interval to 2 hours and Session timeout to 5 hours. And we came up with the following scheme – the consumer drove up, loaded the data into itself and hung with them for about two hours. As a result, the topics were blocked, and nothing could be obtained from them.
We decided to lower max.pool.interval and leave Session timeout as is. To do this, we needed group.instance.id, which allowed Kafka, if the first consumer falls off, to wait for the Session Timeout Milliseconds, when the second consumer rises and takes the partitions.
group.instance.id is a cool thing, but it must be used carefully, otherwise a situation may arise when the consumer is in a timeout for three hours, no one reacts to this, and one of the topics is idle while everyone else is moving forward.
Fail #3: TTL
Problem: our second consumer has fallen off. It hung for a certain amount of time, and then began to take data from the topic. And as a result, we found that we lost some of the recipes.
The reason is that TTL happened. Although Kafka can be used as a database, it is not a database. Kafka is a log where messages are added one by one and from where they are taken in the same sequence.
Kafka is designed for high throughput and backs up its messages to disk. From time to time, it performs the compacting operation – it takes and cuts off part of the messages. By default, Kafka’s TTL is one week. But often they forget about it, and then they find that there is no data older than 7 days in the log.
Solution: you need to use max.poll.interval, set a reasonable Session timeout and monitor the health of the consumer. It is advisable to do monitoring.
A few words about Kafka in the end
The main problem is that Kafka is perceived and configured incorrectly. It would seem that everything is very simple (the same Rabbit MQ is many times more complicated), but inside there are a huge number of different “twists”. And it is very important to learn how to use them correctly. Do not take this article as a recommendation not to work with Kafka. On the contrary, Kafka is a great tool. But here we tried to give an honest review and talk about the difficulties you may encounter.
For those who want to learn how to work with Kafka and avoid fails
We have a video courseApache Kafka for Developers“. It will quickly take you to a new level of instrument proficiency and help you:
reduce time for work tasks with Kafka;
simplify the work with microservices;
understand common design patterns;
make the application more fault-tolerant;
Gain experience in developing applications that use Kafka.
We also have a courseApache Kafka Database“. It is aimed at system administrators, but architects and developers will also find it useful. It comprehensively covers Kafka and gives an understanding of what place it occupies in the life of an organization.