Skip to content

Commit 57f04d6

Browse files
authored
The Profiler Performance Engineering (#26)
1 parent 83bb335 commit 57f04d6

6 files changed

Lines changed: 198 additions & 1 deletion

File tree

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
.hugo_build.lock
22
public/
3-
resources/
3+
resources/
4+
.idea
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
---
2+
title: Diego Lovison
3+
photo: ''
4+
twitter: diegolovison
5+
---
6+
Senior Software Engineer - Application Services Performance Team
17.3 KB
Loading
24.9 KB
Loading
Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
---
2+
title: "Profiler Performance Engineering: A Story of What It Does and Doesn't Show"
3+
date: 2025-12-04T00:00:00Z
4+
categories: ['performance', 'methodology']
5+
summary: 'Learn how to properly benchmark an application'
6+
image: 'logo.png'
7+
related: ['']
8+
authors:
9+
- Diego Lovison
10+
---
11+
12+
Modern software systems are increasingly complex, and ensuring their performance under real-world conditions is critical
13+
to delivering reliable and scalable applications. Traditional performance testing often focuses on high-level metrics such
14+
as average response time or throughput. While these metrics are useful, they can obscure deeper system inefficiencies and
15+
bottlenecks. To uncover these hidden issues, a more granular and methodical approach is required—one that examines the
16+
behavior of individual system components under stress.
17+
18+
This article introduces a performance engineering workflow that integrates profiling techniques with the
19+
https://www.brendangregg.com/usemethod.html[USE Method] (Utilization, Saturation, and Errors) to diagnose and resolve
20+
performance issues. By combining performance testing tools like Hyperfoil with low-level profilers such as
21+
https://github.com/async-profiler/async-profiler[async-profiler] and https://man7.org/linux/man-pages/man1/perf-stat.1.html[perf],
22+
we demonstrate how to identify CPU stalls, cache misses, and poor memory access patterns. Through a real-world benchmarking scenario,
23+
we show how profiling data can guide code optimizations and system configuration changes that lead to measurable
24+
improvements in Instruction Per Cycle (IPC) and overall system responsiveness.
25+
26+
== Software Development Life Cycle
27+
28+
A *software developer* implements features based on defined requirements—such as creating multiple endpoints to solve a
29+
specific business problem. Once development is complete, the *performance engineering* team gathers SLAs from stakeholders
30+
and designs performance tests that reflect real business use cases. These tests typically measure metrics like average
31+
response time. For each release that affects the business logic, the performance tests are rerun to detect any regressions.
32+
If a regression is found, the team receives feedback to address it.
33+
34+
There is nothing wrong with this approach, but we can go even further.
35+
36+
=== Personas
37+
38+
*Software Developer*: A professional who designs, builds, tests, and maintains software applications or systems.
39+
40+
*Performance Engineering*: Ensures that a software system meets performance requirements under expected workloads. This
41+
involves creating and maintaining performance tests—using tools like Hyperfoil and web-based scenarios—to simulate
42+
real-world usage. The results provide valuable feedback to the team. If the system passes these tests, the product is
43+
considered ready for General Availability (GA).
44+
45+
*Profiler Performance Engineering*: Analyzes performance test results by profiling the source code to uncover system
46+
bottlenecks. The process typically begins by identifying which resource—CPU, memory, disk, or network— the team has
47+
chosen to stress, guiding the analysis toward the root cause of any performance degradation.
48+
49+
=== Java Developer Persona Belt
50+
51+
* Software Developer: Eclipse IDE, IBM Semeru JDK
52+
* Performance Engineering: Hyperfoil and a web-based application
53+
* Profiler Performance Engineering: async-profiler, jfrconv, perf, sar
54+
55+
== The USE Method
56+
57+
According to Brendan Gregg, The **U**tilization **S**aturation and **E**rrors (USE) Method is a methodology for analyzing the
58+
performance of any system. It directs the construction of a checklist, which for server analysis can be used for
59+
quickly identifying resource bottlenecks or errors. It begins by posing questions, and then seeks answers, instead of
60+
beginning with given metrics (partial answers) and trying to work backwards.
61+
62+
=== Terminology definitions:
63+
64+
The terminology definition is:
65+
66+
* *resource*: all physical server functional components (CPUs, disks, busses, ...)
67+
* *utilization*: the average time that the resource was busy servicing work
68+
* *saturation*: the degree to which the resource has extra work which it can't service, often queued
69+
* *errors*: the count of error events
70+
71+
The metrics are usually expressed in the following terms:
72+
73+
* *utilization*: as a percent over a time interval. eg, "one disk is running at 90% utilization".
74+
* *saturation*: as a queue length. eg, "the CPUs have an average run queue length of four".
75+
* *errors*: scalar counts. eg, "this network interface has had fifty late collisions".
76+
77+
== The benchmark
78+
79+
The benchmark scenario consists of three main components: a load generator, a System Under Test (SUT) running on an
80+
application server, and a database. The load generator initiates requests to various SUT endpoints, which in turn
81+
perform operations such as inserts, updates, and selects on the database. The benchmark is divided into two phases:
82+
warmup and steady state. During the warmup phase, the system gradually ramps up the number of requests over a defined
83+
period, aiming for a smooth transition into the steady state. This controlled increase helps stabilize the system
84+
before reaching the target injection rate for sustained testing.
85+
86+
== Hands-on
87+
88+
In this example, we’ll examine a real-world scenario that occurred while benchmarking an application.
89+
90+
=== SUT CPU analyses
91+
92+
We can start by looking for the “perf stat” for the SUT's application PID. "perf stat" is a powerful Linux command-line tool that
93+
gives you high-level statistics about a program's performance. It uses hardware performance counters in your CPU
94+
to measure things like instructions executed, CPU cycles, cache misses, and branch mispredictions.
95+
It's a great first step to understanding your application's behavior at a low level.
96+
97+
The result for the performance testing is the following.
98+
99+
```
100+
Performance counter stats for process id '3724985':
101+
3,707,235.65 msec task-clock # 5.296 CPUs utilized
102+
```
103+
104+
This metric indicates that 5.2 CPU cores are being utilized. For this test, we have a constraint of only 15 CPU cores
105+
available. Therefore, 5.2 ÷ 15 equals approximately 34%, meaning the CPU is busy 34% of the time. This suggests that
106+
the SUT is not a highly utilized resource. Since the SUT CPU utilization was only at 34%, we had plenty of headroom
107+
to experiment. We decided to drive a higher load by doubling the injection rate of the loader to observe the impact.
108+
However, this is not guaranteed to succeed, as other bottlenecks might limit the outcome. In our case, the loader
109+
could sustain the 2x increase in injection rate, and the new `perf stat` output for the SUT is:
110+
111+
```
112+
Performance counter stats for process id '3854032':
113+
8,798,820.34 msec task-clock # 12.570 CPUs utilized
114+
72,884,739 context-switches # 8.283 K/sec
115+
30,504,245 cpu-migrations # 3.467 K/sec
116+
192,855 page-faults # 21.918 /sec
117+
20,804,314,839,811 cycles # 2.364 GHz
118+
13,372,592,699,670 instructions # 0.64 insn per cycle
119+
2,535,826,054,792 branches # 288.201 M/sec
120+
148,654,216,727 branch-misses # 5.86% of all branches
121+
700.009883946 seconds time elapsed
122+
```
123+
124+
Our CPU resource showed 83% saturation over a 700-second interval, with the same error rate observed when CPU
125+
utilization was previously at 34%. This metric provides a strong starting point for deeper investigation. The
126+
next metric that we can analyze is the Instruction Per Cycle. The current value is 0.64. 0.64 is a low value for IPC.
127+
It means that the CPU is stalled and not computing.
128+
129+
The CPU is https://www.intel.com/content/www/us/en/docs/vtune-profiler/cookbook/2023-0/top-down-microarchitecture-analysis-method.html[stalled],
130+
not "stressed." It's spending most of its time waiting for data instead of doing useful work. In order to investigate
131+
the root cause, we can use async-profiler and monitor the cache-miss event. The output of the profiler can be a flame
132+
graph as the following.
133+
134+
image::cache-miss-event.png[Interaction]
135+
136+
==== The Analysis: Visualizing the Bottleneck
137+
138+
Upon analyzing the `cache-misses` flame graph, a clear pattern emerges. The function triggering the most CPU stalls—is
139+
`java.util.HashMap$KeyIterator.next`.
140+
141+
When we correlate this visual data with the source code, we see that the application is iterating over a
142+
large `HashSet` (which uses a `HashMap` internally). While this appears to be a standard operation in Java, the
143+
hardware reality is quite different.
144+
145+
==== "Pointer Chasing"
146+
147+
The performance penalty here stems from **memory layout** and **data dependency**.
148+
149+
Iterating over a `HashMap` is effectively a **Pointer Chasing** operation. Unlike an `ArrayList`, where elements are
150+
stored contiguously in memory (allowing the CPU to calculate the next address mathematically), a `HashMap` stores
151+
data in scattered nodes. To move from one element to the next, the CPU must:
152+
153+
1. **Fetch the current node** from memory.
154+
2. **Read the pointer** inside that node to find the address of the next node.
155+
3. **Fetch the next node** from that new address.
156+
157+
==== The Consequence: Breaking Instruction Level Parallelism
158+
159+
This structure creates a critical **Data Dependency**. The CPU *cannot* know *where* to look for the next item
160+
until the current item has been fully retrieved.
161+
162+
This dependency chain breaks the CPU’s ability to use **Instruction Level Parallelism (ILP)**.
163+
Modern CPUs attempt to "speculatively load" data by fetching future elements while processing current ones.
164+
However, because the memory addresses in a `HashMap` are random and dependent on the previous node,
165+
the hardware prefetcher is rendered useless.
166+
167+
Instead of processing data in parallel, the CPU pipeline is forced to **stall** completely between every element,
168+
waiting hundreds of cycles for data to arrive from main RAM. This explains the high cache-miss rate and the
169+
low IPC (Instructions Per Cycle) observed in our `perf` stats.
170+
171+
==== Validating the Hypothesis: Correlating Cycles with Misses
172+
173+
However, observing a high number of cache misses is not enough to declare a verdict. A function could generate
174+
millions of cache misses but execute so rarely that it impacts only 1% of the total runtime.
175+
To be certain this is our bottleneck, we must correlate the "Misses" (the event) with the "Cycles" (the time).
176+
177+
image::cycles.png[Interaction]
178+
179+
We compared the `cache-misses` flame graph against the `cycles` and the correlation was:
180+
181+
* In the cycles profile: The HashMap iteration, Integer.intValue and String.equals logic dominated the graph, accounting for the vast majority of CPU time.
182+
* In the cache-misses profile: These exact same methods appeared as the widest bar.
183+
184+
This overlap confirms the diagnosis: The application is not just "busy"; it is stalled. The high CPU usage seen in
185+
our original top output wasn't due to complex calculation, but rather the CPU burning cycles while waiting for data
186+
to arrive from memory to satisfy the HashMap's pointer-chasing requirements.
187+
188+
== Acknowledgments
189+
190+
All this work was guided and supervised by Francesco Nigro.
257 KB
Loading

0 commit comments

Comments
 (0)