100% CPU utilization: my fault?
History of a JDK bug and related development error that resulted in an abnormal 100% CPU load. What went wrong, what could have been done better, and who was ultimately to blame?
Thank you (updated later):
- Omar Elrefaei for his PR, which corrected the formatting of this document.
- / u / thorn-harvestar /, / u / philipTheDev / and / u / vips7L /: for identifying another fundamental reason – I shouldn’t continue to use JDK8.
- / u / SirSagramore / and / u / wot-teh-phuck: for proving it was really my fault :(. Premature optimization is the root of all evil.
Sirens howl
A couple of years ago, I received a bug report stating that I caused 100% CPU utilization in a virtual machine in a situation where the processor should be at rest. I got curious as I tried to avoid patterns like
while(true)
and
for(;;)
…
The support engineer who wrote the same report attached proof:
$ top
PID USER PR NI VIRT RES %CPU %MEM TIME+ S COMMAND
1 root 20 0 1832.5m 20.6m 100.7 1.0 4:11.04 S java Main
25 root 20 0 1.6m 1.1m 0.0 0.1 0:00.04 S sh
38 root 20 0 1.9m 1.2m 0.0 0.1 0:00.00 R `- top
Indeed, the JVM used 100% of the CPU. And yet, how did he get to me?
How was I found?
This bug was tracked down by a software developer. First through
top -H -p ‹pid›
it got a list of thread ids associated with a Java process using
pid
from the previous command
top
:
$ top -H -p 1
PID USER PR NI VIRT RES %CPU %MEM TIME+ S COMMAND
1 root 20 0 1832.5m 20.6m 0.0 1.0 0:00.02 S java Main
7 root 20 0 1832.5m 20.6m 0.0 1.0 0:00.32 S `- java Main
8 root 20 0 1832.5m 20.6m 0.0 1.0 0:00.00 S `- java Main
9 root 20 0 1832.5m 20.6m 0.0 1.0 0:00.00 S `- java Main
10 root 20 0 1832.5m 20.6m 0.0 1.0 0:00.01 S `- java Main
11 root 20 0 1832.5m 20.6m 0.0 1.0 0:00.00 S `- java Main
12 root 20 0 1832.5m 20.6m 0.0 1.0 0:00.00 S `- java Main
13 root 20 0 1832.5m 20.6m 0.0 1.0 0:00.00 S `- java Main
14 root 20 0 1832.5m 20.6m 0.0 1.0 0:00.03 S `- java Main
15 root 20 0 1832.5m 20.6m 0.0 1.0 0:00.05 S `- java Main
16 root 20 0 1832.5m 20.6m 0.0 1.0 0:00.00 S `- java Main
17 root 20 0 1832.5m 20.6m 0.7 1.0 0:00.42 S `- java Main
18 root 20 0 1832.5m 20.6m 99.9 1.0 3:08.69 R `- java Main
Thread with ID 18 consumed 100% of the CPU.
By converting this ID to hexadecimal, he got 12.
$ printf "%xn" 18
12
He then dumped the stream to stdout using
kill -s QUIT ‹pid›
and then found the line
nid=0x12
to exit to its stack trace.
$ kill -s QUIT 1
...
"joseph-mate-pool-1-thread-1" #8 prio=5 os_prio=0 tid=0x0000558708d94000 nid=0x12 runnable [0x00007f23e77fd000]
java.lang.Thread.State: RUNNABLE
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.poll(ScheduledThreadPoolExecutor.java:809)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
...
Damn it! This is exactly my flow. It’s good that I have provided the appropriate factory to make possible debugging easier for other developers. It looks like something from posted by me in
ScheduledThreadPoolExecutor
completely occupied the CPU.
Is it my fault?
First, I decided to see what I have planned in the thread pool. I looked up the code and, surprisingly, did not find anything that would indicate a possible increased CPU consumption.
executorService.scheduleWithFixedDelay(
() -> doSomethingReallyCheap(),
0, // Начальная задержка
10, // задержка
TimeUnit.MINUTES
);
This script was executed every 10 minutes and had no connection to the CPU. There were other similar blocks of code scheduled to do this when the application was launched, but these were not CPU related either. In addition, I deliberately chose
scheduleWithFixedDelay
in case the execution suddenly takes longer than 10 minutes so that the thread pool is not overloaded with unnecessary work until the previous instance of the task completes. Here my guesswork ended, and I turned to a colleague to double-check me.
He assessed the situation and agreed with me. Doesn’t seem like a problem with doSomethingReallyCheap()
… However, upon closer inspection of the stack trace, he noticed that the problem was not with doSomethingReallyCheap()
… In fact, everything stalled in DelayedWorkQueue.poll(ScheduledThreadPoolExecutor.java:809)
! It was decided to turn to the Internet for help.
Justified!
Googling for “ScheduledThreadPoolExecutor 100% CPU” I found a post on StackOverflow describing similar symptoms: “
ScheduledExecutorService
consumes 100% CPU when
corePoolSize = 0
“
The author had exactly the same line of code as mine:
ScheduledExecutorService executorService = Executors.newScheduledThreadPool(0);
I did this on purpose, because I didn’t want to continuously consume the resources needed by the thread for the sake of something that only runs once every 10 minutes. Although the documentation even warns against setting the pool size to zero:
In addition, it is not recommended to install
corePoolSize
on0
or useallowCoreThreadTimeOut
, because then the pool may be left without threads to process tasks when they are ready to run.
But I did not attach any importance to this, since the waiting for the pool to create a new thread did not bother me. The 10 minute delay was not overly harsh.
corePoolSize
Is the number of threads that should be present in the pool even if they are idle, unless the option is enabledallowCoreThreadTimeOut
… So I expected the pool to store threads temporarily.
In the end, to get around this unexpected behavior, I decided to switch to
corePoolSize=1
… It frustrated me that there would always be one active thread, but that was the best I could do in the time available.
Nevertheless, this strange behavior puzzled me greatly. Have I misunderstood JavaDocs? Scrolling down the topic, I was relieved to read the best answer:
This is a known bug:
JDK-8129861 https://bugs.openjdk.java.net/browse/JDK-8129861
which was removed in JDK9.
Justified!
Why!?
And yet I didn’t calm down. How can
corePoolSize=0
cause JDK to consume 100% CPU? The details of the named bug did not clarify the situation:
It looks like another bug with zero main threads that was fixed in the great integration of jsr166 into jdk9.
All of this suggests that large-scale changes were planned in JDK9
ScheduledThreadPoolExecutor
… It also became clear that developers are unlikely to fix this bug in JDK8. In general, if you decide to do a similar trick in the latest version of JDK8, then the problem will most likely repeat itself.
It looks like you can’t do without digging into the source code. Let’s start our investigation at the top of the stack from the previously obtained dump: ScheduledThreadPoolExecutor$DelayedWorkQueue.poll(ScheduledThreadPoolExecutor.java:809).
By running a minimal rebuildable sample in IntelliJ and defining a breakpoint at DelayedWorkerQueue.poll
, it will be possible to find this endless loop. Looking at several stack frames, we see ThreadPoolExecutor.getTask
, which it is:
private Runnable getTask() {
boolean timedOut = false; // Время последнего poll() истекло?
for (;;) {
...
int wc = workerCountOf(c);
// Работники отбрасываются?
boolean timed = allowCoreThreadTimeOut || wc > corePoolSize;
if ((wc > maximumPoolSize || (timed && timedOut))
&& (wc > 1 || workQueue.isEmpty())) {
if (compareAndDecrementWorkerCount(c))
return null;
continue;
}
try {
Runnable r = timed ?
workQueue.poll(keepAliveTime, TimeUnit.NANOSECONDS) :
workQueue.take();
if (r != null)
return r;
timedOut = true;
...
}
}
- Note that here
keepAliveTime
is equal to0
… - Number of workers
(wc)
equals1
since there is one active thread. corePoolSize
is equal to0
…- It means that
timed = true
… - Next, we check if the worker can be dropped.
- There are 2 conditions that may prevent this from being done:
wc > 1
; but in this case we only have 1 worker.workQueue.isEmpty()
it cannot be empty, since this is a scheduled task and it will always be in the queue.
- This means that we are interrogating
workerQueue
withkeepAliveTime=timeout=0
! - In turn, this means that the poll immediately returns.
- As a result, we again find ourselves at the beginning of the cycle
for(;;)
, and none of the conditions of the last iteration changes.
Conversely, if
corePoolSize
is equal to
1
, then
time=false
and instead the return comes from
workQueue
which is causing the blocking. This explains why the installation
corePoolSize=0
allows you to work around the problem.
You can fix this by changing the condition wc > 1
on wc > 0
and also the condition isEmpty()
to check if something needs to be done in the near future. However, this technique creates a big problem in structure. We use the current worker thread to poll for the availability of tasks, and we cannot change the condition for dropping a worker to wc > 0
… It was necessary to initially heed the JavaDoc warning and not use corePoolSize=0
…
To fix the limitation of this version of the JDK, you can probably schedule a new thread to spawn when the scheduled task’s deadline approaches and exit the current thread. However, there is still a lot to be learned here. For example, who will then create the thread when a scheduled task needs to be run?
What could have been done better?
Read the manual
: I couldn’t possibly know that
corePoolSize=0
won’t work. One could read the JavaDocs in more detail and heed the warning. Although even if I read this line, it is unlikely that it would have stopped me, because in my mind this option was acceptable.
Conduct tests: it is confusing that the problem is recreated every time, and it should be easy to detect 100% CPU load – it was enough to monitor the processor during testing. I did this only when checking the code for a change in performance and did not foresee that this change would affect it. Do I need to check the CPU for all changes? Or maybe there is an automated test that detects significant differences in CPU usage when changing code?
Refresh: if I updated the JDK in a timely manner, then the issue that dates back to 2018 or 2019 would not have arisen. Around that time, a version of JDK11 was released, in which this bug was no longer present.
Don’t optimize: shouldn’t have dealt with thread pool optimization. This was clearly communicated in two comments, pointing out my mistake:
You are guilty of one of the greatest crimes that can be committed in computer science – preventive optimization. The constant consumption of one thread takes up a very small amount of resources. Any attempt to optimize this was pointless at best and dangerous at worst. Actually, you are convinced of this. – / u / SirSagramore /
I realized that optimizing prematurely without supporting facts / numbers is rarely a good idea. This is just one OS thread, and the OS probably knows better than the developer how to manage resources in the long run. – / u / wot-teh-phuck
Try it yourself
If you want to experiment for yourself, then compile and run the code below using JDK8. At the time of this writing, the bug continues to manifest itself, as it was only fixed in JDK9.
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
public final class Main {
public static void main(String [] args) throws Exception {
ScheduledExecutorService executorService = Executors.newScheduledThreadPool(0);
executorService.scheduleWithFixedDelay(
() -> System.out.println("Hello World!"),
0, // Начальная задержка
10, // задержка
TimeUnit.MINUTES
);
while (true) {
Thread.sleep(10000);
}
}
}
Purely as an additional check, I tried this code in JDK9. What can I say, the processor load was low.
$ top
PID USER PR NI VIRT RES %CPU %MEM TIME+ S COMMAND
1 root 20 0 3008.5m 29.1m 0.0 1.5 0:00.76 S java Main
27 root 20 0 1.6m 0.0m 0.0 0.0 0:00.02 S sh
34 root 20 0 7.9m 1.3m 0.0 0.1 0:00.01 R `- top
Final fabrications
I was reassured that the main fault lies not with me. Although I was still involved in the problem, because I could have discovered it if I had monitored the CPU during testing. Luckily, we have strong engineers who caught the bug prior to release.