Hey there! Are you new to the JVM? Did you just run your Java programs but never cared what the JVM does with your code under the hood? Want to learn about JVM internals, how to (not) write a Java benchmark test or are you simply curious about JVM performance? Then keep on reading!
Join me on my journey into the world of Java microbenchmarking, Just-in-Time compilers and other JVM internals.
This blog post will be about JVM Warm-Up Performance.
Usually warm-up is defined as the number of iterations the JVM needs to increase the speed of method execution between the first and the n-th invocation by applying JIT compiler optimizations to the bytecode.
There exist a range of papers on the subject of JVM warm-up that were published within the last 5 years  and the 20 most relevant results all mention that warm-up has to be considered (and also how). However, only one study actually puts a focus on JVM warm-up performance, by designing a new JVM named HotTub . What this study does not focus on, are comparisons of warm-up among existing JVMs.
So I was curious to find out more about JVM warm-up performance. I wanted to know if some JVMs can warm up code faster than others and if so, which one would be the fastest. For sure I couldn't test alle the JVMs out there, so I decided to go for the tried and tested OpenJDK HotSpot VM, the polyglot GraalVM and the enterprise JVM OpenJ9.
This blog post cannot (and will not) provide a detailed introduction to JVM internals. However, I will (briefly) cover some aspects that are relevant for understanding this article. If something seems unclear to you or you would like to deepen your knowledge, I recommend further readings, e.g. Scott Oaks' Java Performance  or watching some of these videos ,. Whatever you do, don't lose yourself in too many open browser tabs 😉. You've been warned!
Memory in the Java heap
is managed in generations (memory pools holding objects of different ages). Garbage Collection (GC) occurs in each generation when the generation fills up. The Java heap is basically divided into young and old generation (permanent generation was removed in JDK 8, now there is metaspace):
handles memory allocation but does not implement any actual memory reclamation mechanism. Once the available Java heap is exhausted, the JVM will shut down.
The HotSpot VM contains two JIT compilers: C1 (also called client compiler) and C2 (also called server compiler).
C1 is designed to run faster and produce less optimized code, while C2, on the other hand, takes a little more time to run but produces a better-optimized code.
In default configuration, the JVM marks a method as hot after this certain code block was executed more than 10 000 times. Only after that, the C2 server compiler starts to compile the hot code. Once ready, the C1 compiled code is replaced by the C2 compiled code (sometimes C2 compiled code is refered to as level 4 compilations). C1 optimizations (level 1-3 compilations) kick in a lot earlier. The first time, code is executed, it is even only interpreded by the JVM. To learn more about the concept of Tiered Compilation checkout the book I recommended or watch the first few minutes of the mentioned video.
Writing a good benchmark test is not easy. There are two fault categories: conceptual flaws when designing a microbenchmark and contextual effects when running it. Let's neglect the contextual effects for now. I'll cover that when I write about the test environment. Conceptual flaws however can mostly be avoided by using frameworks that prevent many issues by design or at least provide methods to circumvent common pitfalls. The following chapter will introduce such a tool.
The Java Microbenchmark Harness  (JMH) is a tool that was created with the intention to help developers in avoiding common pitfalls when writing and executing Java benchmarks.
JMH provides four different benchmark modes: (i) Throughput, (ii) AverageTime, (iii) SampleTime and (iv) SingleShotTime.
While (i) measures the number of operations per second and (ii) & (iii) give information about average execution time, lower and upper bounds, etc., the (iv) SingleShotTime mode measures the duration for a single benchmark method execution. This is useful to test how things perform under a cold start.
At this point I will introduce a terminology based on terms used by JMH to adequately describe the way a benchmark measurement is carried out:
The number of forks and iterations can be configured either directly in the code, using the respective annotations, or by providing them through flags on the command line (which would overrule the annotation configuration):
-i 21000 -f 20 -to 360m would instruct JMH to run the benchmark test in 20 independent forks with 21 000 iterations each. The
-to flag sets the timeout to 360 minutes. If the current fork doesn't finish within that time frame, it is aborted. With the
-wi flag you could adjust the warm-up iterations. However, since I wanted to measure the warm-up itself, I set the number of warm-up iterations to
0 inside the benchmark test code.
-rf option is used to specify the output format (e.g.
-rf json) and should be combined with
To view a complete list of JMH command line options invoke the following command:
$ java -jar target/benchmarks.jar -hIf you have several benchmark tests in your project, but want JMH to execute only a single one, you can do so by referencing it via its class name, e.g.:
$ java -jar target/benchmarks.jar MyBenchmark
Since the community agrees that writing a good benchmark test isn't easy, one might ask:
Why making the effort writing a benchmark and not using any existing benchmark suite? Well I looked for that in first place, but it quickly turned out that existing benchmark suites like SPECjvm2008 or the DaCapo benchmark suite do not fit my research's use case. Neither the outputs and measurement units were in a suitable format that would have allowed further analyzing the collected data, nor all benchmkar tests were working with the targeted Java version 11. In addition, both suites were already quite outdated. SPECjvm2008 was released in 2008 and the DaCapo benchmark suite (first release in 2009) did issue its last maintenance release almost two years ago (which is eight months before the release of JDK 11). I guess I should mention that there seems to be a 2019-snapshot of the DaCapo benchmark suite available which is build with JDK 11, but still I didn't consider DaCapo Bench to be useful in my case.
Under the given circumstances, writing a new benchmark test for evaluating the warm-up performance of Java 11 JVMs was the only reasonable alternative.
I first tried to reuse existing benchmark tests from SPECjvm2008 (compression and crypto benchmarks) and bootstrap them using JMH. I managed to get the benchmark tests running, but finally failed when the benchmark execution either exhausted the available heap (6 GB) or did not finish in a reasonable time window.
This tweet from August 2020 shows my first attempt to write my own Java microbenchmark test. The code generates 100 random integers and divides them into a list of odd or even integers. Then, both lists are sorted and printed to stdout. Soon I learned about the importance of consuming the program's outcome. Otherwise the JVM would detect that the result of the operation is never used and simply not execute the whole method. This is called Dead Code Elimination. Your benchmark would report to have finished in (almost) zero time, since it didn't do anything. To avoid this pitfall, you could implement the method to have a return value, print the processed objects to stdout or use JMH's
Blackhole to consume the objects of interest.
For further analysis on the collected data I created some Jupyter notebooks in which I could generate plots from the data. I also wrote a small script to transform the raw output from JMH into a consumable format.
Besides the benchmark that is sorting odd and even integers into lists, I experimented with other implementations doing summations or counting some bottles of beer on the wall .
In the end, I found that a backtracking algorithm to solve a Sudoku would be a suitable benchmark test, as it is a fine-grained work-based benchmark. The architecture is not too simple so the JVM would not eliminate parts of the benchmark code right away, but it is limited enough to be used for precise microbenchmarking.
The benchmark test can be found on GitHub.
The JMH version used is 1.23.
Enough theory, let's run some code!
The benchmark test was executed several times in a virtual environment hosted on a server in Germany (low latency to my location) which has the following specs:
For each JVM, the Sudoku benchmark test was executed with 21 000 iterations over 20 independent forks. All measurements are recorded without discarding any of the data early on. This means that warm-up iterations are included in the measurement results. For evaluation of warm-up speed, having this data is crucial.
I repeated the measurements at different days and daytimes, to ensure that they are not distorted by other operations running on the shared virtual environment during one of the benchmark executions.
In order to have comparable benchmark results among the JVMs I tried to avoid interferences wherever possible. Therefore, I provided the JVMs with a bunch of runtime flags:
$ java11-hotspot -XX:+HeapDumpOnOutOfMemoryError \ -jar target/benchmarks.jar Backtracking \ -jvmArgs "-Xms5g -Xmx5g -XX:+HeapDumpOnOutOfMemoryError -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -XX:+AlwaysPreTouch" \ -rf json -rff jmh-result-hotspot-backtracking-sudoku.json \ -i 21000 -f 20 -to 360m \ | tee output-hotspot-backtracking-sudoku-$(date +'%FT%H-%M').log
$ java11-graalvm -XX:+HeapDumpOnOutOfMemoryError \ -jar target/benchmarks.jar Backtracking \ -jvmArgs "-Xms5g -Xmx5g -XX:+HeapDumpOnOutOfMemoryError -XX:+UnlockExperimentalVMOptions -Dlibgraal.MaxNewSize=6442450944 -Dlibgraal.PrintGC=true -XX:+AlwaysPreTouch" \ -rf json -rff jmh-result-graalvm-backtracking-sudoku.json \ -i 21000 -f 20 -to 360m \ | tee output-graalvm-backtracking-sudoku-$(date +'%FT%H-%M').log
$ java11-openj9 -XX:+HeapDumpOnOutOfMemoryError \ -jar target/benchmarks.jar Backtracking \ -jvmArgs "-Xms5g -Xmx5g -XX:+HeapDumpOnOutOfMemoryError -Xgcpolicy:nogc" \ -rf json -rff jmh-result-openj9-backtracking-sudoku.json \ -i 21000 -f 20 -to 360m \ | tee output-openj9-backtracking-sudoku-$(date +'%FT%H-%M').log
java11-openj9are aliases I configured for each JVM.
The provided runtime flags partially differ in their syntax between the JVMs and not all configurations are available on every JVM. However, by providing a smart combination of different JVM-specific flags, a similar runtime behaviour can be achieved. In particular I decided to entirely disable Garbage Collection and prefetch memory from the operating system to avoid requesting physical memory during runtime. Coupling the
-XX:+AlwaysPreTouch option with setting the same value for
-Xmx, all memory is committed on startup, which avoids latency spikes when the memory is finally used. For further explanations, see appendix below.
Besides the runtime flags mentioned, the JVMs were all used in their default configuration (no adjustments regarding compilers, etc.), as a real user would do. The assumption is that users don't care about configuring their JVM, they just throw their Java code at it and want it to run in an acceptable performance. For GraalVM this means that the GraalVM compiler mode will be libgraal, which is the default mode of operation.
The graphs of the warm-up charts visualize the iterations' median values among all forks of all runs of the tested JVMs. Thus, I'll call these graphs median curve. The data points of the first iteration are excluded from the visualization as they include the time taken for lazy class loading. A graph's shade indicates the interquartile range Q1 (25% percentile) to Q3 (75% percentile). The interquartile range will serve to represent the scattering of the different fork's individual data points of any given time slice. Thus, I call them scatter shade. Median and quartiles are used instead of mean and standard deviation, as they are robust to outliers and skewed distributions. The x-axis is labeled with the number of iterations and the y-axis with the time per operation in nanoseconds.
A full list of charts can be found in the appendix.
A single run of the Sudoku benchmark consisting of 20 forks usually finished within less than 10 minutes. One fork with 21 000 iterations took around 45 seconds for HotSpot and GraalVM. For OpenJ9 it took a little longer (1:20min).
These are the average execution times for a single iteration of the benchmark test:
|HotSpot VM||GraalVM||OpenJ9||OpenJ9 AOT|
|Avg. Execution Time (in milliseconds)||0.4142 ms||0.4087 ms||0.4983 ms||0.4501 ms|
The histogram chart provides visual information about the distribution of measured duration per benchmark test execution among all data points. The calculation of average execution times (means) is based on the same measurement data as the warm-up charts, minus the first 250 warm-up iterations. In order to remove outliers and produce clearer diagrams, data points before the 5% and beyond the 95% percentiles were cut off.
Between iteration 6 400 and 6 800 GraalVM shows a bump [view chart]. I did not dig into details here (I didn't have a good profiler at hand). However it would definitely be interesting to investigate this anomaly, especially since the warm-up curves of HotSpot and GraalVM share a lot of similarities, this bump marks a notable difference between them.
OpenJ9 also has a much steeper negative slope over the first iterations compared to HotSpot or GraalVM. This suggests that the OpenJ9 JVM performs faster in method warm-up than the other two.
In order to gain some insights into JIT compiler optimizations that are applied on the benchmark code, I used JITWatch . JITWatch is a log analyser and visualizer for JVM JIT compilers. It uses compiler logs in XML format which are created by the HotSpot VM and GraalVM when the following runtime flags are provided:
-XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:LogFile=jit-compilation.logFor OpenJ9 the seetings look like this:
solve(int)method of my Sudoku benchmark [view]. The compilation timelines do not necessarily depict all compilation events on the graph, but the information can also be retrieved from the compilation list [view].
While JITWatch is a useful tool to visualize the actions of the JIT compiler, I encountered a discrepancy between the compilations shown by JITWatch and the JIT compiler actions logged to the terminal by providing the runtime flags
-XX:+PrintCompilation -XX:+PrintInlining (having
-XX:+UnlockDiagnosticVMOptions enabled). The terminal log output showed several inlining operations taking place already during the first iterations of the benchmark execution. These inlining operations also fit to the warm-up charts. The inlining operations probably are the main driver for the fast decline in the warm-up graphs right at the beginning of the measurements. The difference between the XML compilation log file used by JITWatch and the compilation log output on the terminal can be explained by the fact that there is a limitation in the
-XX:+LogCompilation option which leads to the fact that inlining decisions made by the C1 compiler are not included in the XML log file. 
I also spent some CPU cycles on comparing the behaviour of AOT to JIT compiled code. I conducted this experiment with OpenJ9 since the AOT compiled code is still executed in a JVM. By contrast, GraalVM AOT compiled code would result in a native binary, which could not be benchmarked with JMH anymore.
In order to leverage the OpenJ9 AOT compiler, one must provide
-Xshareclasses:nonpersistent,verboseAOT as an additional runtime flag. This enables class data sharing. The first time, the benchmark is executed, OpenJ9 compiles AOT code and stores it beyond the JVM life time. The next time, the benchmark code is executed, OpenJ9 will directly use the AOT compiled code.
Also in comparison to HotSpot and GraalVM, OpenJ9 in AOT mode performs faster during the first iterations. However, the JIT compilers of HotSpot and GraalVM catch up with their opponent after approx. 250 iterations and clearly overtake it after approx. 700 iterations [view chart]. This kind of demonstrates that as long as the JVM with the JIT compiler is warming up, just interpreting bytecode, the JVM running AOT compiled code will perform faster. But as soon as the JIT compiler optimizations kick in, AOT code cannot really compete with the performance of JIT compiled code.
Let's sum it up: AOT compilation of Java code avoids the warm-up phase at the cost of not fully optimized code later on. Thus, it will always be an individual decision whether a Java program is executed with an AOT or JIT compiler.
This blog post got quite lengthy now. I hope you still found it an interesting read and maybe I inspired you to also get started with your own JVM research. Personally, I learned a lot during the last few months and I think my journey has just started. It feels like I've only scratched the tip of the iceberg. There's so much more to discover in the world of JVMs!
When benchmarking the AOT mode of OpenJ9 I wondered about the default configuration of the JVM. OpenJ9 advertises a performance advantage over HotSpot VM . These performance gains come from making use of OpenJ9's AOT compilation and shared classes cache features. For me, this brought up the question why the features used in the advertised performance benchmarks are not enabled by default?
My study could only partially validate these advantages. On the one hand, the test scope of my study did not cover all KPIs listed on the OpenJ9 website. On the other hand my experiments were based on microbenchmarks, while the OpenJ9 project made its comparisons by a macrobenchmark. They ran the DayTrader7 benchmark using JMeter to perform a load test. So the test setups totally diverge from one another, making it impossible to falsify the performance statements on the OpenJ9 website.
It would have been my expectation that the Graal compiler makes the benchmark test being executed faster than HotSpot's C2 compiler would do. As shown in the JITWatch compilation timeline, the top level JIT compilers C2 and Graal were only activated after a few thousand iterations. At that point in time the level 4 compilations did not have a significant impact anymore, as lower tiers' compilations have already produced highly optimized code. Thus, HotSpot's and GraalVM's average execution times only differ by 5.5 μs. Another explanation for the similarity in their warm-up and average execution time results is probably, that GraalVM in version 20 still runs on the HotSpot VM and only adds the level 4 Graal compiler.
Recently, GraalVM 21.0 was released, with a JVM implementation named espresso which is fully written in Java (Java on Truffle). I'm curious to see this JVM's performance next to the GraalVM based on HotSpot VM. However, it seems like I have to wait a little for that, since the
current raw performance of Java on Truffle isn't representative of what it will be capable of in the near future. The peak performance is several times lower than running the same code in the usual JIT mode. The warmup also hasn't been optimized yet.  But the next minor releases promise to improve things there, step by step.
It also looks like
Java on Truffle can bring the JIT compiler and the dynamic Java runtime to an ahead-of-time compiled binary.  This makes me very excited about the future of this JVM's performance!
Runtime flags (expand)
Data: JVM Benchmark Results
To the top ▲