Conversation
Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
jasperpotts
left a comment
There was a problem hiding this comment.
Looking at the high error bars and knowing how hard it was to get these benchmarks to work in a way that you are actually measuring what you want. I have pasted a benchmark I wrote that is better but not perfect but might be helpful as example.
- Pre-compute random inputs in @setup(Level.Trial) and index into them with a counter, keeping Random out of the hot loop
- Return values from @benchmark methods or explicitly blackhole.consume() them
- Use @param for controlled variation instead of random-per-invocation
- Reserve Level.Invocation for genuinely minimal per-call setup
- Use adequate fork/warmup/measurement counts (@fork(2)+, @WarmUp(iterations = 3)+, @measurement(iterations = 5)+)
// SPDX-License-Identifier: Apache-2.0
package com.hedera.pbj.integration.jmh.varint;
import com.hedera.pbj.runtime.io.buffer.BufferedData;
import java.io.IOException;
import java.util.Random;
import java.util.concurrent.TimeUnit;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
/**
* Single varint read/write per invocation (no batching).
* Measures per-call overhead precisely with @Param for byte size.
*/
@SuppressWarnings("unused")
@State(Scope.Benchmark)
@Fork(2)
@Warmup(iterations = 3, time = 2)
@Measurement(iterations = 5, time = 2)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
public class SingleVarIntBench {
@Param({"1", "2", "3", "4", "5", "8", "10"})
public int numOfBytes;
private long[] values;
private byte[][] encodedValues;
private int index;
private static final int NUM_VALUES = 1024;
private BufferedData writeBuffer;
private BufferedData readBuffer;
@Setup(Level.Trial)
public void setup() throws IOException {
Random random = new Random(9387498731984L);
values = new long[NUM_VALUES];
encodedValues = new byte[NUM_VALUES][];
final long minValue;
final long maxValue;
switch (numOfBytes) {
case 1 -> {
minValue = 0L;
maxValue = (1L << 7) - 1;
}
case 2 -> {
minValue = 1L << 7;
maxValue = (1L << 14) - 1;
}
case 3 -> {
minValue = 1L << 14;
maxValue = (1L << 21) - 1;
}
case 4 -> {
minValue = 1L << 21;
maxValue = (1L << 28) - 1;
}
case 5 -> {
minValue = 1L << 28;
maxValue = (1L << 35) - 1;
}
case 8 -> {
minValue = 1L << 49;
maxValue = (1L << 56) - 1;
}
case 10 -> {
minValue = Long.MIN_VALUE;
maxValue = -1L;
} // negative values need 10 bytes
default -> {
minValue = 0L;
maxValue = 127L;
}
}
BufferedData tempBuf = BufferedData.allocate(16);
for (int i = 0; i < NUM_VALUES; i++) {
if (numOfBytes == 10) {
values[i] = random.nextLong(Long.MIN_VALUE, 0);
} else {
values[i] = random.nextLong(minValue, maxValue + 1);
}
// Encode to get the byte representation for reading benchmarks
tempBuf.reset();
tempBuf.writeVarLong(values[i], false);
int len = (int) tempBuf.position();
encodedValues[i] = new byte[len];
for (int j = 0; j < len; j++) {
encodedValues[i][j] = tempBuf.getByte(j);
}
}
writeBuffer = BufferedData.allocate(16);
readBuffer = BufferedData.allocate(16);
}
@Setup(Level.Invocation)
public void setupInvocation() {
index = (index + 1) & (NUM_VALUES - 1);
// Prepare read buffer with the next encoded value
readBuffer.reset();
byte[] encoded = encodedValues[index];
for (byte b : encoded) {
readBuffer.writeByte(b);
}
readBuffer.resetPosition();
}
@Benchmark
public long readVarLong() throws IOException {
return readBuffer.readVarLong(false);
}
@Benchmark
public void writeVarLong(Blackhole bh) throws IOException {
writeBuffer.reset();
writeBuffer.writeVarLong(values[index], false);
bh.consume(writeBuffer);
}
public static void main(String[] args) throws Exception {
Options opt = new OptionsBuilder()
.include(SingleVarIntBench.class.getSimpleName())
.build();
new Runner(opt).run();
}
}
Claudes detailed analysis on this PR benchmark
This benchmark has several critical flaws that mean it is not validly measuring what it claims to measure.
CRITICAL: Return values are never consumed — dead code elimination risk
Every single-value benchmark accepts a Blackhole blackhole parameter but never uses it. The return values from getArrayByteNoChecks, getInt, getLong, etc. are all silently discarded:
public void getArrayByteNoChecks_Old(final BenchState state, final Blackhole blackhole) {
for (int i = 0; i < INVOCATIONS; i++) {
OldUnsafeUtils.getArrayByteNoChecks(ARRAY, state.random.nextInt(ARRAY.length));
// ^^^ return value thrown away, blackhole never touched
}
}
The JIT is free to eliminate the entire call as dead code if it can prove the method is side-effect-free. Now, sun.misc.Unsafe.getByte() is treated as an intrinsic with memory semantics, so it probably survives DCE in the old implementation. But the new implementation (presumably using MemorySegment / FFM API) may get different treatment by the JIT — creating an asymmetric optimization between old and new that silently invalidates the comparison. This is the single most important fix: every read needs blackhole.consume(result).
CRITICAL: You're measuring Random.nextInt(), not memory access
Every single-value benchmark follows this pattern:
for (int i = 0; i < INVOCATIONS; i++) {
OldUnsafeUtils.getArrayByteNoChecks(ARRAY, state.random.nextInt(ARRAY.length));
}
Random.nextInt() involves a CAS on an internal AtomicLong seed — it's significantly more expensive than a single unchecked memory read. This perfectly explains why all single-value benchmarks report ~124–126 ops/µs regardless of access type (array, heap buffer, direct buffer, int, long). You're seeing the throughput ceiling of java.util.Random, not of the unsafe memory operations. The actual memory access is lost in the noise.
If the goal is to vary the offset to prevent constant folding, a better approach is to use a pre-generated int[] of random offsets created in @Setup(Level.Trial) and index into it with a simple counter.
CRITICAL: @Setup(Level.Invocation) is doing ~8MB of work per invocation
The randomize() method fills and copies four 1MB buffers on every single invocation. The JMH docs have a big warning about Level.Invocation:
WARNING: This is the most dangerous level to use.
For the single-value benchmarks (measured in µs), the setup cost is on the same order of magnitude as the benchmark itself, which means JMH's timing compensation becomes unreliable. And critically, there's no reason to re-randomize the buffers every invocation — the content of the buffers doesn't matter for measuring access speed. A single @Setup(Level.Trial) randomization would be correct and not distort measurements.
Significant: @Fork(1) and @Warmup(iterations = 1)
One fork means no protection against JIT profile pollution between benchmarks. One warmup iteration may not be enough for the JIT to reach a steady state, especially for methods that use different intrinsic paths (Unsafe vs. FFM). This directly explains the enormous error bars on the bulk copy results (some at 76% of the score). Should be at minimum @Fork(3) and @Warmup(iterations = 3).
Moderate: Static mutable buffers shared across benchmarks
ARRAY, HEAP_BUFFER, DIRECT_BUFFER, DIRECT_BUFFER_2 are all static and mutated by the bulk copy benchmarks (getHeapBufferToArray writes to ARRAY, getDirectBufferToDirectBuffer writes to DIRECT_BUFFER_2). Since all benchmarks share these within a fork, cache state from one benchmark leaks into the next. These should be @State instance fields, not static.
Moderate: Bulk copy benchmarks have wildly variable work per invocation
final int length = state.random.nextInt(SIZE - offset);
The copy length is uniformly distributed from 0 to ~1MB. This means some invocations copy 0 bytes and some copy 1MB. The variance in work per invocation is enormous, which directly explains the massive error bars (±99 ops/ms on a score of ~175). If you want to benchmark representative PBJ usage, pick a fixed set of realistic copy sizes. If you want to benchmark across a range, use @Param with specific lengths.
Summary of what the benchmark is actually measuring:
| Benchmark group | What it claims to measure | What it actually measures |
|---|---|---|
| Single byte reads | Unsafe byte access throughput | Random.nextInt() throughput + possibly DCE'd reads |
| getInt / getLong | Unsafe multi-byte access | Random.nextInt() throughput + possibly DCE'd reads |
| Bulk copies | Memory copy throughput | Mixture of Random overhead, 0-to-1MB variable copies, and 8MB setup per invocation |
Recommended fixes:
blackhole.consume()every return value- Replace
Randomwith a pre-computed offset array indexed by a counter, or useThreadLocalRandom(cheaper) with results consumed - Change
@Setup(Level.Invocation)→@Setup(Level.Trial) @Fork(3),@Warmup(iterations = 5),@Measurement(iterations = 5)- Move buffers from
staticto@Stateinstance fields - Use fixed or
@Param-controlled copy lengths for bulk benchmarks
As written, I would not trust these results to validate the safety of the Unsafe → FFM migration in PBJ 0.15.0.
|
@jasperpotts : Thanks for sharing the feedback. In your example benchmark in the above comment, I see you're also using In general, this particular functionality seems to be exceptionally difficult to benchmark. And you indeed pointed to a few aspects of the benchmark implementation that were incorrect, so I'm fixing them in the latest revision. Some overall comments regarding the feedback:
Now comments about specific issues listed in the feedback:
I'll push a commit with changes and update the results in the description shortly. |
Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
jasperpotts
left a comment
There was a problem hiding this comment.
Thanks for fixing. For future reference on the point you said:
Static mutable buffers - I see the point, and I even applied the recommendation. But I'm not fully convinced about this. One of the reasons is that the benchmark methods now have to access them indirectly through a state reference rather than refer to a single static field - this does add a few extra cycles to what the methods have to do, which, in case of methods that otherwise just read a single byte, might in fact be significant. But I did it anyway.
You can use the test class its self as state, avoiding the state object all together. Then it is just a local field lookup which should be super cheap.
Description:
Benchmarking the new, Java 25-friendly UnsafeUtils introduced in #771 , comparing them with the old, sun.misc.Unsafe-based implementations:
New results on 4/7:
Old results on 4/6:
Related issue(s):
Fixes #774
Notes for reviewer:
.
Checklist