Achieve high throughput without exacerbating latency

Latency and Throughput

When discussing performance, we often hear the terms latency and throughput used to describe the characteristics of a software component.
We can give the following interpretation to these terms:

Latency is a measure of the time it takes to complete one action. For example, this may be the time required to react to a change in the price of a financial instrument, and such change may affect the decision to buy or sell. This could be the time it takes for a component that controls some external device to respond to a change in the state of that device (for example, a change in temperature reported by a thermostat).

The concept of latency can also be applied to non-IT areas. Imagine you are visiting your favorite fast food joint. In this case, the delay is the time it takes to place an order, collect it, pay for it and then receive it. Obviously, the lower the delay, the better.

image

Throughput is a measure of how much work can be done in a given period of time, such as how many transactions can be processed in a second. We've all seen examples of systems that struggle to cope with high request loads, such as websites that become unresponsive at certain times when the number of requests suddenly increases, such as at the start or end of a business day or when tickets are released for a very popular concert.

In the context of a fast food restaurant, throughput measures the number of customers served in a given period of time, and obviously the more customers the better.

image

Scaling to improve throughput

Typically, to increase the throughput of a component, it is proposed to highly parallelize the system in order to simultaneously process more than one task. This can be done inside a component instance by introducing multiple threads of execution, each of which can handle one request.

One could argue that there will likely be several “pauses” during the processing of a single task because the current task is waiting for something to happen (perhaps a read or write operation, or interaction of a given component with another component). When one thread cannot complete a task for some reason, another can continue it.

image

We can also deploy multiple instances of a component across a cluster of different systems—this is one approach that is often used to justify the use of a cloud-native application architecture.

Modern cloud infrastructure allows you to dynamically create new instances of components on different system platforms when load growth requires it, and turn off these instances when the peak load has passed to minimize additional costs.
The general term for this approach is “scaling” or “horizontal scaling”.

Scaling issues

Scaling seems to be an attractive method to increase system throughput. However, the benefits of scaling come at a price, and that's important to understand. The basic approach is to make several independent instances of a component appear from the outside as if it exists in only one instance. This approach is sometimes called a “single-system” approach. While this may seem attractive, creating and managing such an abstraction introduces complexity that can seriously slow down individual tasks.

1. It is necessary to provide the possibility of routing to distribute tasks among the available instances of the component. It can be implicit when using a single instance with multiple threads, or explicit when using a cluster approach. The implicit approach uses some form of shared work queue to collect incoming requests, and as instances running on threads become free, they pick up the first one from the queue.
In a cluster approach, an explicitly implemented router component decides which instance the request should be routed to. The decision is made in accordance with some algorithm; The simplest such algorithm simply iterates through the available instances (sometimes called a “carousel” approach, but other options are possible. For example, routing to the instance that can definitely process the request the fastest, or always routing requests from the same source to the same instance.
All of these options come with their own challenges that need to be taken into account. Access to the shared work queue must be synchronized, resulting in a potential increase in request latency. While a separate router component allows customization of decisions about which instance should receive a request, this alone has an impact on latency, not to mention introducing an additional network hop before the message is delivered.

2. It is likely that there will be a need to share state between instances. The outcome of a request may depend on data that is stored in the component and whose value accumulates in a specific way over time. An example would be a shopping cart, where requests include adding or removing items before checkout.

In a single instance, this shared state will be stored in memory, which can be seen by all threads executing individual component instances. However, access to this shared memory requires synchronization to ensure data consistency.
With a cluster approach, there are different ways to solve this problem. State specific to a single client can be stored in a specific instance, but now all requests that require access to shared information must be directed to that instance. This is the essence of the HTTP “sessions” model, and its implementation requires processing at the router level.
Alternatively, you can use a separate database component to store information that is required by multiple instances. Then any instance will be able to process a specific request, but at the cost of increased costs for reading and writing this data from the database when processing the request. In extreme cases, the database can become a point of contention between instances, which will further slow down query processing. If we are using a modern distributed database, we may encounter data consistency issues (between database instances).

All this introduces additional complexity into the request processing cycle.

image

Increase throughput by reducing latency

In many cases, you can increase the throughput of a component without scaling. This approach is based on simple observation. If we can reduce the processing time of a single request, then we can ultimately process more requests in a given amount of time.

image

Low latency Java software e.g. Chronicle Softwareis created using universal best practices that help minimize, and ideally eliminate, pauses that often impact normal Java execution.

The most obvious of them are pauses that occur when garbage collection is completely stopped using the “stop the world” principle. There has already been a lot of research aimed at minimizing pauses of this type, and they have been crowned with significant success. However, they can still occur non-deterministically. Low latency Java techniques aim to eliminate pauses entirely by reducing the number of objects on the Java heap to below a threshold for even minor garbage collections. This topic is discussed in more detail In this article.

Perhaps the most significant disadvantage of introducing multiple threads into a Java component is that it blocks synchronization of access to shared resources, in particular to shared memory. Locks of various types allow you to protect mutable state from concurrent access that could lead to corruption, but most are based on the idea that if the lock cannot be acquired, then the requesting thread must block access to the resource until it becomes available. .

Nowadays, the advent of multi-core processor architectures allows a thread to spin around, repeatedly trying to acquire a lock, without context switching from the core, as occurs with blocking synchronization. Surprisingly, synchronizing in this non-blocking manner is noticeably faster than the traditional approach, since kernel-specific hardware caches do not need to be flushed and reloaded after a context switch.

However, the net effect of these approaches is to eliminate much of the additional complexity that comes with horizontal scaling.

image

Conclusion

Low-latency programming techniques are designed to keep the processor core as busy as possible, operating at its maximum capacity, and completing work as quickly as possible. This allows you to optimize the throughput of the components.

IN Chronicle We have created a set of products and libraries that implement these ideas and are used by financial institutions around the world. For example, Chronicle Queue is an open source Java library that supports very high-speed shared memory-based interprocess communications. Chronicle Queue supports throughput of millions of messages per second with sub-microsecond latency.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *