Make pauses your own with Generational ZGC

Netflix migrated from G1 to Generational ZGC starting with JDK 21 due to the significant benefits associated with multi-threaded garbage collection.

Team Spring AIO prepared a translation of an article in which the streaming service's engineers talked about the unexpected and expected advantages of Generational ZGC.


In the latest LTS version of the JDK for garbage collection ZGC Generational mode appeared.

Netflix migrated from G1 to Generational ZGC starting with JDK 21 due to the significant benefits associated with multi-threaded garbage collection.

Now more than half of our streaming video services are running on JDK 21 with Generational ZGC, so we want to share our experience and results. If you are interested in how Netflix uses Java, we recommend watching Paul Becker's talk How Netflix Really Uses Java.

Reducing the tail of the delay distribution

In our services, powered by GRPC and DGS Frameworkgarbage collection pauses are a significant source of tail latencies. Such latencies are particularly noticeable to GRPC clients and servers, where request cancellations due to timeouts interact with reliability features such as request retries, hedging, and fallback scenarios. Each error is a canceled request, resulting in a retry. Reducing these pauses reduces overall service traffic as follows:

Errors per second. Previous week in white, current cancellation rate in purple. ZGC was brought online on the service cluster on November 16th.

Removing pause “noise” also allows us to identify real sources of delay along the chain that would otherwise be hidden in this “noise”, since pause peaks can be quite significant:

Peak GC pauses by reason, for the same service cluster as above. Yes, ZGC-related pauses are indeed typically less than a millisecond long.

Comment from the editors of Spring AIO

The Netflix team encountered such pauses because G1 does compaction, not concurrent sweep, as the CMS (Concurrent Mark Sweep) garbage collector does. Compaction avoids severe heap fragmentation after several collections, which is a problem with the CMS garbage collector. However, G1 has another problem: compaction goes under a “Stop-the-World” (STW) pause. This problem, by the way, is solved in more modern GCs such as ZGC and Shenandoah, which use concurrent compaction.

Efficiency

While initial test results for ZGC were promising, we anticipated that implementing it would require tradeoffs: slightly reduced application performance due to read/write limitations, work performed in local threads, and the garbage collector competing with the application for resources. We felt that this was an acceptable price to pay for eliminating pauses, as their absence provided more benefit than the potential performance penalty.

Comment from the editors of Spring AIO

ZGC has competition with the application for computing resources, which leads to a decrease in throughput. In other words, when ZGC is performing garbage collection, the application threads may only get about 70% of the CPU time, since the rest of the computing resources will be consumed by the ZGC threads themselves. It is worth noting that Generational ZGC is better in this regard, but not radically. You can read more here: https://inside.java/2023/11/28/gen-zgc-explainer/

In fact, we found that for our services and architecture, no such tradeoff occurred. Under the same CPU load, ZGC improved both average latency and P99 (99th percentile) latency, while CPU utilization remained the same or better than G1.

Consistency in request rate, structure, response time, and memory allocation speed across our services certainly helps ZGC. However, we have also found that ZGC can handle less predictable workloads effectively (although there are exceptions, which we will discuss below).

Easy to operate

Service owners often contact us with questions about excessively long pauses and requests for help configuring services to eliminate them. We have several frameworks that periodically update large amounts of data in the application's memory to avoid external service calls for efficiency. These regular updates to the application's memory can “catch” the G1, resulting in pauses that are significantly longer than the default target.

The main reason we didn't use Generational ZGC previously was the large amount of such long-lived data in the application's memory. In the worst case we evaluated, non-generational ZGC used 36% more CPU than G1 under the same workload. With the switch to Generational ZGC, this difference was reduced to almost 10% in favor of ZGC.

Half of all services needed for video streaming use our library Hollow for metadata stored in the application's memory. Eliminating pause issues allowed us to abandon memory-reducing techniques such as array poolingwhich resulted in the freeing up of hundreds of megabytes of memory.

Ease of use is also achieved through ZGC's default heuristics and settings. No explicit fine-tuning was required to achieve these results. Memory allocation delays are rare, usually coinciding with sharp spikes in the frequency of memory allocations. They are also shorter than the average pauses we observed with the G1.

Overheads

We expected the loss compressed references on heaps less than 32GB due to usage colored pointers requiring 64-bit object pointerswill be an important factor when choosing a garbage collector.

However, it turns out that while this does matter for stop-the-world collectors, it does not for ZGC. Even on small heaps, the increase in allocation frequency is offset by the high efficiency and improved performance of ZGC. Special thanks to Erik Österlund of Oracle for explaining the less obvious benefits of colored pointers for multithreaded garbage collectors. Thanks to him, we appreciate ZGC more than we originally intended.

In most cases, ZGC also consistently provides more available memory to the application.

Used and available heap space after each GC cycle, for the same service cluster shown above

ZGC has a fixed overhead of 3% of the heap size, which requires more system memory than G1. However, with a few exceptions, there was no need to reduce the maximum heap size to provide additional space. These exceptions were for services with high system memory requirements.

ZGC reference handling only happens during primary garbage collection. We have been particularly careful about releasing direct byte buffers, but have not yet seen any impact. This difference in reference handling has led to performance issue with JSON thread dump supportbut this was an unusual situation due to the framework accidentally creating an unused ExecutorService instance for each request.

Transparent Huge Pages

Even if you don't use ZGC, you should probably use Huge Pages, and the most convenient way to do this is to use Transparent Huge Pages.

ZGC uses shared memory for the heap, but many Linux distributions disable this feature by default via the shmem_enabled parameter set to never. Because of this, ZGC cannot use Huge Pages with the -XX:+UseTransparentHugePages flag.

In one of our services, we changed only the shmem_enabled parameter from “never” to “advise”, which significantly reduced the load on the processor:

Deployment with transition from 4K to 2M page sizes. The gap in the graph is the immutable deployment process, which temporarily doubles the cluster capacity.

Our standard configuration includes:

  • Setting the minimum and maximum heap size to the same values.

  • Setting the flags -XX:+UseTransparentHugePages and -XX:+AlwaysPreTouch.

  • Using the following configuration for Transparent Huge Pages:

echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo advise | sudo tee /sys/kernel/mm/transparent_hugepage/shmem_enabled
echo defer | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
echo 1 | sudo tee /sys/kernel/mm/transparent_hugepage/khugepaged/defrag

What tasks are not suitable for ZGC?

There is no perfect garbage collector. Each one balances collection throughput, application latency, and resource usage depending on its purpose.

For the tasks where G1 performed better than ZGC, we found that they were more throughput-oriented with memory allocation spikes and long-running tasks that hold objects indefinitely.

As an example, a service with sharp bursts of memory allocation and a large number of long-lived objects. Such a scenario turned out to be suitable for G1 in terms of minimizing pauses and the old region processing strategy. G1 could avoid unnecessary work in garbage collection cycles, while ZGC did a worse job of it.

The move to ZGC by default gave application owners an opportunity to rethink their choice of garbage collector. For batch/precompute tasks where G1 was used by default, a multi-threaded collector could provide better results. In one large compute process, we saw a 6-8% improvement in throughput, shaving an hour off the execution time compared to G1.

Try it yourself!

If we hadn't re-evaluated our assumptions and expectations, we might have missed one of the most significant changes we've made to our setup in the last ten years. We encourage you to try Generational ZGC for yourself. It may turn out to be as pleasant a surprise for you as it was for us.

Join the Russian-speaking Spring Boot developer community in telegram – Spring AIOto stay up to date with the latest news from the world of Spring Boot development and everything related to it.

We are waiting for everyone, join us

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *