100% CPU utilization: my fault?

History of a JDK bug and related development error that resulted in an abnormal 100% CPU load. What went wrong, what could have been done better, and who was ultimately to blame?

Thank you (updated later):

Sirens howl

A couple of years ago, I received a bug report stating that I caused 100% CPU utilization in a virtual machine in a situation where the processor should be at rest. I got curious as I tried to avoid patterns like

while(true)

and

for(;;)

The support engineer who wrote the same report attached proof:

$ top
  PID USER      PR  NI    VIRT    RES  %CPU  %MEM     TIME+ S COMMAND
    1 root      20   0 1832.5m  20.6m 100.7   1.0   4:11.04 S java Main
   25 root      20   0    1.6m   1.1m   0.0   0.1   0:00.04 S sh
   38 root      20   0    1.9m   1.2m   0.0   0.1   0:00.00 R  `- top

Indeed, the JVM used 100% of the CPU. And yet, how did he get to me?

How was I found?

This bug was tracked down by a software developer. First through

top -H -p ‹pid›

it got a list of thread ids associated with a Java process using

pid

from the previous command

top

:

$ top -H -p 1
  PID USER      PR  NI    VIRT    RES  %CPU  %MEM     TIME+ S COMMAND
    1 root      20   0 1832.5m  20.6m   0.0   1.0   0:00.02 S java Main
    7 root      20   0 1832.5m  20.6m   0.0   1.0   0:00.32 S  `- java Main
    8 root      20   0 1832.5m  20.6m   0.0   1.0   0:00.00 S  `- java Main
    9 root      20   0 1832.5m  20.6m   0.0   1.0   0:00.00 S  `- java Main
   10 root      20   0 1832.5m  20.6m   0.0   1.0   0:00.01 S  `- java Main
   11 root      20   0 1832.5m  20.6m   0.0   1.0   0:00.00 S  `- java Main
   12 root      20   0 1832.5m  20.6m   0.0   1.0   0:00.00 S  `- java Main
   13 root      20   0 1832.5m  20.6m   0.0   1.0   0:00.00 S  `- java Main
   14 root      20   0 1832.5m  20.6m   0.0   1.0   0:00.03 S  `- java Main
   15 root      20   0 1832.5m  20.6m   0.0   1.0   0:00.05 S  `- java Main
   16 root      20   0 1832.5m  20.6m   0.0   1.0   0:00.00 S  `- java Main
   17 root      20   0 1832.5m  20.6m   0.7   1.0   0:00.42 S  `- java Main
   18 root      20   0 1832.5m  20.6m  99.9   1.0   3:08.69 R  `- java Main

Thread with ID 18 consumed 100% of the CPU.

By converting this ID to hexadecimal, he got 12.

$ printf "%xn" 18
12

He then dumped the stream to stdout using

kill -s QUIT ‹pid›

and then found the line

nid=0x12

to exit to its stack trace.

$ kill -s QUIT 1
...
"joseph-mate-pool-1-thread-1" #8 prio=5 os_prio=0 tid=0x0000558708d94000 nid=0x12 runnable [0x00007f23e77fd000]
   java.lang.Thread.State: RUNNABLE
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.poll(ScheduledThreadPoolExecutor.java:809)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
...

Damn it! This is exactly my flow. It’s good that I have provided the appropriate factory to make possible debugging easier for other developers. It looks like something from posted by me in

ScheduledThreadPoolExecutor

completely occupied the CPU.

Is it my fault?

First, I decided to see what I have planned in the thread pool. I looked up the code and, surprisingly, did not find anything that would indicate a possible increased CPU consumption.

executorService.scheduleWithFixedDelay(
    () -> doSomethingReallyCheap(),
    0, // Начальная задержка
    10, // задержка
    TimeUnit.MINUTES
);

This script was executed every 10 minutes and had no connection to the CPU. There were other similar blocks of code scheduled to do this when the application was launched, but these were not CPU related either. In addition, I deliberately chose

scheduleWithFixedDelay

in case the execution suddenly takes longer than 10 minutes so that the thread pool is not overloaded with unnecessary work until the previous instance of the task completes. Here my guesswork ended, and I turned to a colleague to double-check me.

He assessed the situation and agreed with me. Doesn’t seem like a problem with doSomethingReallyCheap()… However, upon closer inspection of the stack trace, he noticed that the problem was not with doSomethingReallyCheap()… In fact, everything stalled in DelayedWorkQueue.poll(ScheduledThreadPoolExecutor.java:809)! It was decided to turn to the Internet for help.

Justified!

Googling for “ScheduledThreadPoolExecutor 100% CPU” I found a post on StackOverflow describing similar symptoms: “

ScheduledExecutorService

consumes 100% CPU when

corePoolSize = 0

The author had exactly the same line of code as mine:

ScheduledExecutorService executorService = Executors.newScheduledThreadPool(0);

I did this on purpose, because I didn’t want to continuously consume the resources needed by the thread for the sake of something that only runs once every 10 minutes. Although the documentation even warns against setting the pool size to zero:

In addition, it is not recommended to install corePoolSize on 0 or use allowCoreThreadTimeOut, because then the pool may be left without threads to process tasks when they are ready to run.

But I did not attach any importance to this, since the waiting for the pool to create a new thread did not bother me. The 10 minute delay was not overly harsh.

corePoolSize Is the number of threads that should be present in the pool even if they are idle, unless the option is enabled allowCoreThreadTimeOut… So I expected the pool to store threads temporarily.

In the end, to get around this unexpected behavior, I decided to switch to

corePoolSize=1

… It frustrated me that there would always be one active thread, but that was the best I could do in the time available.

Nevertheless, this strange behavior puzzled me greatly. Have I misunderstood JavaDocs? Scrolling down the topic, I was relieved to read the best answer:

This is a known bug: JDK-8129861 https://bugs.openjdk.java.net/browse/JDK-8129861which was removed in JDK9.

Justified!

Why!?

And yet I didn’t calm down. How can

corePoolSize=0

cause JDK to consume 100% CPU? The details of the named bug did not clarify the situation:

It looks like another bug with zero main threads that was fixed in the great integration of jsr166 into jdk9.

All of this suggests that large-scale changes were planned in JDK9

ScheduledThreadPoolExecutor

… It also became clear that developers are unlikely to fix this bug in JDK8. In general, if you decide to do a similar trick in the latest version of JDK8, then the problem will most likely repeat itself.

It looks like you can’t do without digging into the source code. Let’s start our investigation at the top of the stack from the previously obtained dump: ScheduledThreadPoolExecutor$DelayedWorkQueue.poll(ScheduledThreadPoolExecutor.java:809).By running a minimal rebuildable sample in IntelliJ and defining a breakpoint at DelayedWorkerQueue.poll, it will be possible to find this endless loop. Looking at several stack frames, we see ThreadPoolExecutor.getTask, which it is:

private Runnable getTask() {
    boolean timedOut = false; // Время последнего poll() истекло?
    for (;;) {
        ...
        int wc = workerCountOf(c);
        // Работники отбрасываются?
        boolean timed = allowCoreThreadTimeOut || wc > corePoolSize;

        if ((wc > maximumPoolSize || (timed && timedOut))
            && (wc > 1 || workQueue.isEmpty())) {
            if (compareAndDecrementWorkerCount(c))
                return null;
            continue;
        }

        try {
            Runnable r = timed ?
                workQueue.poll(keepAliveTime, TimeUnit.NANOSECONDS) :
                workQueue.take();
            if (r != null)
                return r;
            timedOut = true;
        ...
    }
}

  1. Note that here keepAliveTime is equal to 0
  2. Number of workers (wc) equals 1since there is one active thread.
  3. corePoolSize is equal to 0
  4. It means that timed = true
  5. Next, we check if the worker can be dropped.
  6. There are 2 conditions that may prevent this from being done:
    1. wc > 1; but in this case we only have 1 worker.
    2. workQueue.isEmpty() it cannot be empty, since this is a scheduled task and it will always be in the queue.
  7. This means that we are interrogating workerQueue with keepAliveTime=timeout=0!
  8. In turn, this means that the poll immediately returns.
  9. As a result, we again find ourselves at the beginning of the cycle for(;;), and none of the conditions of the last iteration changes.

Conversely, if

corePoolSize

is equal to

1

, then

time=false

and instead the return comes from

workQueue

which is causing the blocking. This explains why the installation

corePoolSize=0

allows you to work around the problem.

You can fix this by changing the condition wc > 1 on wc > 0and also the condition isEmpty() to check if something needs to be done in the near future. However, this technique creates a big problem in structure. We use the current worker thread to poll for the availability of tasks, and we cannot change the condition for dropping a worker to wc > 0… It was necessary to initially heed the JavaDoc warning and not use corePoolSize=0

To fix the limitation of this version of the JDK, you can probably schedule a new thread to spawn when the scheduled task’s deadline approaches and exit the current thread. However, there is still a lot to be learned here. For example, who will then create the thread when a scheduled task needs to be run?

What could have been done better?

Read the manual

: I couldn’t possibly know that

corePoolSize=0

won’t work. One could read the JavaDocs in more detail and heed the warning. Although even if I read this line, it is unlikely that it would have stopped me, because in my mind this option was acceptable.

Conduct tests: it is confusing that the problem is recreated every time, and it should be easy to detect 100% CPU load – it was enough to monitor the processor during testing. I did this only when checking the code for a change in performance and did not foresee that this change would affect it. Do I need to check the CPU for all changes? Or maybe there is an automated test that detects significant differences in CPU usage when changing code?

Refresh: if I updated the JDK in a timely manner, then the issue that dates back to 2018 or 2019 would not have arisen. Around that time, a version of JDK11 was released, in which this bug was no longer present.

Don’t optimize: shouldn’t have dealt with thread pool optimization. This was clearly communicated in two comments, pointing out my mistake:

You are guilty of one of the greatest crimes that can be committed in computer science – preventive optimization. The constant consumption of one thread takes up a very small amount of resources. Any attempt to optimize this was pointless at best and dangerous at worst. Actually, you are convinced of this. – / u / SirSagramore /

I realized that optimizing prematurely without supporting facts / numbers is rarely a good idea. This is just one OS thread, and the OS probably knows better than the developer how to manage resources in the long run. – / u / wot-teh-phuck

Try it yourself

If you want to experiment for yourself, then compile and run the code below using JDK8. At the time of this writing, the bug continues to manifest itself, as it was only fixed in JDK9.

import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;

public final class Main {

  public static void main(String [] args) throws Exception {
    ScheduledExecutorService executorService = Executors.newScheduledThreadPool(0);
    executorService.scheduleWithFixedDelay(
        () -> System.out.println("Hello World!"),
        0, // Начальная задержка
        10, // задержка
        TimeUnit.MINUTES
    );
    while (true) {
      Thread.sleep(10000);
    }
  }

}

Purely as an additional check, I tried this code in JDK9. What can I say, the processor load was low.

$ top
  PID USER      PR  NI    VIRT    RES  %CPU  %MEM     TIME+ S COMMAND
    1 root      20   0 3008.5m  29.1m   0.0   1.5   0:00.76 S java Main
   27 root      20   0    1.6m   0.0m   0.0   0.0   0:00.02 S sh
   34 root      20   0    7.9m   1.3m   0.0   0.1   0:00.01 R  `- top

Final fabrications

I was reassured that the main fault lies not with me. Although I was still involved in the problem, because I could have discovered it if I had monitored the CPU during testing. Luckily, we have strong engineers who caught the bug prior to release.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *