Kafka and potatoes

It just so happened that as a holder of the profession of system analysis, more than once during classes with students, interviews, and assessments I was faced with the problem of misunderstanding of the basic principles and values ​​of the work of queuing services. People don’t understand how it works or what it’s needed for. And since before IT I was lucky enough to serve in the army for almost 10 years, an example that was very helpful even to people far from IT, over time, was born by itself.

So, the input data: a military canteen, we are the system designers (commanders). In addition to you, the task involves a cellar with potatoes and the need to clean them. To do this, we have an almost unlimited number of not very smart performers – soldiers who can be entrusted with this work. According to our introductory information, soldiers, like information systems, do not know how and should not make decisions themselves; they only do what we have instructed them to do.

Stage 1. Separation of areas of responsibility of services

At the first stage, we assign one soldier to two work areas: the first will bring potatoes from the cellar, and the second will peel what the first brought and throw it into a cooking pot. Let's start the process and see what happens:

If the soldiers were IT specialists, they would of course immediately recognize the typical synchronous interaction between services. And you and I clearly see the problem: the soldier who peels the potatoes is a bottleneck: he processes requests too slowly, and as a result, the second one manages to smoke, and sometimes even forgets about the request.

Stage 2. Scaling

Seeing the problem, the first thing that comes to the mind of inexperienced commanders is “We need to throw more people on” (horizontal scaling) and perhaps, somehow optimize the cleaner to work faster – “Throw a cradle on the cleaner, finally give him a sharp knife” (vertical scaling).

We do this exercise and the porter becomes the bottleneck. We are trying to optimize the porter in the same way and give him more resources, but since all potatoes are different and all people work at different paces, it is impossible to balance the work. Someone is still standing idle.

In IT terms, we have several optimized instances of the application that run in parallel. Plus, a proxy server has been added: a sergeant-balancer, who monitors which of the peelers can give potatoes, and also makes sure that they don’t carry rotten ones (the primary validation of requests goes into the scope of the proxy’s authority). The problem became less obvious, worker downtime decreased, but did not go away completely. In addition, the system now has many application instances that need to be monitored, fed, and maintained. System monitoring becomes more difficult.

Stage 3. Queue service

The smartest of the commanders finds an old rusty basin and places it between the porters and the cleaners. The interaction model is changing: now the porters do not wait for anything and put the brought potatoes into a basin; the freed peeler also does not wait for anything, and simply takes new potatoes from the basin. The sergeant turns from a work distributor into a monitoring system, whose task is to ensure that the system’s throughput allows for peeling all the potatoes by lunchtime.

As you probably already understood, the role of the queue service here is played by that same basin. Messages in the queue are potatoes. Porters begin to be called by a smart word “producer“, and the cleaners – “consumer

Now let’s talk in IT language about how we can further optimize the process so that it goes faster:

  • We increase the size of the message: porters begin to carry potatoes from the cellar not one at a time, but in buckets at once. This is called the buzzword “array”.

  • We provide the cleaner with an electric cleaning machine, into which he can load the entire bucket or even three at once. This is already “Batch processing”.

Stage 4. Risk analysis

What could go wrong:

  • You and I are in the army. If the soldiers left the potatoes in the basin and went to bed, then in the morning they will find an empty basin from which all the potatoes were stolen. It's called “retention time” In real life, it is used to ensure that the basin does not overflow, since processed messages from Kafka do not disappear anywhere, they are simply conditionally marked as processed. This allows you, for example, to process the message again after an error if necessary.

  • The basin may tip over. To do this, the potatoes are divided into several basins. It's called “partitions” If one basin is overturned, the work will not stop while they collect the potatoes back from it.

  • The basin may become empty. The amount of potatoes in the basin is called “Consumer lag” And if it grows too much, the basin may either begin to overflow and lose messages (potatoes fell on the floor, cleverly – Retention bytes). We need to monitor this and reschedule cleaners or release porters in advance.

Stage 5. What are the possibilities?

Some smart commander may demand that potatoes be taken out of a basin, for example, by color.

  • We must remember that the queue service is not a database, and everything that is there is received in the same way as the queue gives

  • If you really want to, you can set up partitioning into different basins depending on the color of the potatoes

  • You can make sure that different groups of people peel potatoes from different basins. It's called “Consumer group” For example, if blue potatoes are to be fried and white potatoes are to be boiled, different consumer groups may process messages differently.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *