perf: benchmark UnsafeUtils by anthony-swirldslabs · Pull Request #775 · hashgraph/pbj

anthony-swirldslabs · 2026-04-06T22:36:12Z

Description:
Benchmarking the new, Java 25-friendly UnsafeUtils introduced in #771 , comparing them with the old, sun.misc.Unsafe-based implementations:

New results on 4/7:

Benchmark                                      (littleEndian)   Mode  Cnt     Score     Error   Units
UnsafeBench.getArrayByteNoChecks_New                      N/A  thrpt   15  9414.853 ±  74.711  ops/us
UnsafeBench.getArrayByteNoChecks_Old                      N/A  thrpt   15  9468.357 ±  20.356  ops/us
UnsafeBench.getDirectBufferByteNoChecks_New               N/A  thrpt   15  9458.915 ±  14.492  ops/us
UnsafeBench.getDirectBufferByteNoChecks_Old               N/A  thrpt   15  9462.491 ±  28.118  ops/us
UnsafeBench.getDirectBufferToArray_New                    N/A  thrpt   15  7169.200 ±  76.704  ops/ms
UnsafeBench.getDirectBufferToArray_Old                    N/A  thrpt   15  7079.295 ±  17.110  ops/ms
UnsafeBench.getDirectBufferToDirectBuffer_New             N/A  thrpt   15  7173.239 ±  11.251  ops/ms
UnsafeBench.getDirectBufferToDirectBuffer_Old             N/A  thrpt   15  5214.054 ±  15.676  ops/ms
UnsafeBench.getHeapBufferByteNoChecks_New                 N/A  thrpt   15  9207.595 ±  19.634  ops/us
UnsafeBench.getHeapBufferByteNoChecks_Old                 N/A  thrpt   15  9180.665 ±  28.933  ops/us
UnsafeBench.getHeapBufferToArray_New                      N/A  thrpt   15  5064.402 ±  17.589  ops/ms
UnsafeBench.getHeapBufferToArray_Old                      N/A  thrpt   15  5071.612 ±  11.497  ops/ms
UnsafeBench.getInt_New                                  false  thrpt   15  9014.501 ±  16.564  ops/us
UnsafeBench.getInt_New                                   true  thrpt   15  9085.640 ±  31.148  ops/us
UnsafeBench.getInt_Old                                  false  thrpt   15  9037.718 ±  26.492  ops/us
UnsafeBench.getInt_Old                                   true  thrpt   15  9093.505 ±  25.302  ops/us
UnsafeBench.getLong_New                                 false  thrpt   15  8859.154 ±  19.220  ops/us
UnsafeBench.getLong_New                                  true  thrpt   15  8931.391 ± 161.337  ops/us
UnsafeBench.getLong_Old                                 false  thrpt   15  9013.217 ±  86.528  ops/us
UnsafeBench.getLong_Old                                  true  thrpt   15  9154.096 ±  71.015  ops/us
UnsafeBench.putByteArrayToDirectBuffer_New                N/A  thrpt   15  7255.625 ±  51.167  ops/ms
UnsafeBench.putByteArrayToDirectBuffer_Old                N/A  thrpt   15  7267.593 ±  63.579  ops/ms

Old results on 4/6:

Benchmark                                      (littleEndian)   Mode  Cnt    Score     Error   Units
UnsafeBench.getArrayByteNoChecks_New                      N/A  thrpt    3  124.272 ±  16.153  ops/us
UnsafeBench.getArrayByteNoChecks_Old                      N/A  thrpt    3  124.710 ±   1.233  ops/us
UnsafeBench.getDirectBufferByteNoChecks_New               N/A  thrpt    3  119.038 ±  23.852  ops/us
UnsafeBench.getDirectBufferByteNoChecks_Old               N/A  thrpt    3  124.915 ±  11.187  ops/us
UnsafeBench.getDirectBufferToArray_New                    N/A  thrpt    3  178.670 ±  99.688  ops/ms
UnsafeBench.getDirectBufferToArray_Old                    N/A  thrpt    3  175.581 ±  99.743  ops/ms
UnsafeBench.getDirectBufferToDirectBuffer_New             N/A  thrpt    3  182.226 ±  99.270  ops/ms
UnsafeBench.getDirectBufferToDirectBuffer_Old             N/A  thrpt    3  172.294 ±  14.220  ops/ms
UnsafeBench.getHeapBufferByteNoChecks_New                 N/A  thrpt    3  120.113 ±  33.606  ops/us
UnsafeBench.getHeapBufferByteNoChecks_Old                 N/A  thrpt    3  120.560 ±  17.918  ops/us
UnsafeBench.getHeapBufferToArray_New                      N/A  thrpt    3  179.992 ± 136.978  ops/ms
UnsafeBench.getHeapBufferToArray_Old                      N/A  thrpt    3  173.067 ±  13.386  ops/ms
UnsafeBench.getInt_New                                  false  thrpt    3  125.294 ±   4.574  ops/us
UnsafeBench.getInt_New                                   true  thrpt    3  125.715 ±   4.893  ops/us
UnsafeBench.getInt_Old                                  false  thrpt    3  125.893 ±   2.500  ops/us
UnsafeBench.getInt_Old                                   true  thrpt    3  125.990 ±   1.283  ops/us
UnsafeBench.getLong_New                                 false  thrpt    3  126.270 ±   2.030  ops/us
UnsafeBench.getLong_New                                  true  thrpt    3  126.009 ±   7.585  ops/us
UnsafeBench.getLong_Old                                 false  thrpt    3  125.690 ±   1.211  ops/us
UnsafeBench.getLong_Old                                  true  thrpt    3  125.981 ±   4.034  ops/us
UnsafeBench.putByteArrayToDirectBuffer_New                N/A  thrpt    3  172.712 ±   5.654  ops/ms
UnsafeBench.putByteArrayToDirectBuffer_Old                N/A  thrpt    3  188.883 ±  31.014  ops/ms

Related issue(s):

Fixes #774

Notes for reviewer:
.

Checklist

Documented (Code comments, README, etc.)
Tested (unit, integration, etc.)

Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>

github-actions · 2026-04-06T22:37:24Z

JUnit Test Report

79 files ±0 79 suites ±0 3m 23s ⏱️ ±0s
1 354 tests ±0 1 350 ✅ ±0 4 💤 ±0 0 ❌ ±0
7 236 runs ±0 7 216 ✅ ±0 20 💤 ±0 0 ❌ ±0

Results for commit b33625d. ± Comparison against base commit 81e2c58.

♻️ This comment has been updated with latest results.

github-actions · 2026-04-06T22:41:05Z

Integration Test Report

419 files ±0 419 suites ±0 23m 54s ⏱️ ±0s
114 982 tests ±0 114 982 ✅ ±0 0 💤 ±0 0 ❌ ±0
115 224 runs ±0 115 224 ✅ ±0 0 💤 ±0 0 ❌ ±0

Results for commit b33625d. ± Comparison against base commit 81e2c58.

♻️ This comment has been updated with latest results.

jasperpotts

Looking at the high error bars and knowing how hard it was to get these benchmarks to work in a way that you are actually measuring what you want. I have pasted a benchmark I wrote that is better but not perfect but might be helpful as example.

Pre-compute random inputs in @setup(Level.Trial) and index into them with a counter, keeping Random out of the hot loop
Return values from @benchmark methods or explicitly blackhole.consume() them
Use @param for controlled variation instead of random-per-invocation
Reserve Level.Invocation for genuinely minimal per-call setup
Use adequate fork/warmup/measurement counts (@fork(2)+, @WarmUp(iterations = 3)+, @measurement(iterations = 5)+)

// SPDX-License-Identifier: Apache-2.0
package com.hedera.pbj.integration.jmh.varint;

import com.hedera.pbj.runtime.io.buffer.BufferedData;
import java.io.IOException;
import java.util.Random;
import java.util.concurrent.TimeUnit;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

/**
 * Single varint read/write per invocation (no batching).
 * Measures per-call overhead precisely with @Param for byte size.
 */
@SuppressWarnings("unused")
@State(Scope.Benchmark)
@Fork(2)
@Warmup(iterations = 3, time = 2)
@Measurement(iterations = 5, time = 2)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
public class SingleVarIntBench {

    @Param({"1", "2", "3", "4", "5", "8", "10"})
    public int numOfBytes;

    private long[] values;
    private byte[][] encodedValues;
    private int index;

    private static final int NUM_VALUES = 1024;

    private BufferedData writeBuffer;
    private BufferedData readBuffer;

    @Setup(Level.Trial)
    public void setup() throws IOException {
        Random random = new Random(9387498731984L);
        values = new long[NUM_VALUES];
        encodedValues = new byte[NUM_VALUES][];

        final long minValue;
        final long maxValue;
        switch (numOfBytes) {
            case 1 -> {
                minValue = 0L;
                maxValue = (1L << 7) - 1;
            }
            case 2 -> {
                minValue = 1L << 7;
                maxValue = (1L << 14) - 1;
            }
            case 3 -> {
                minValue = 1L << 14;
                maxValue = (1L << 21) - 1;
            }
            case 4 -> {
                minValue = 1L << 21;
                maxValue = (1L << 28) - 1;
            }
            case 5 -> {
                minValue = 1L << 28;
                maxValue = (1L << 35) - 1;
            }
            case 8 -> {
                minValue = 1L << 49;
                maxValue = (1L << 56) - 1;
            }
            case 10 -> {
                minValue = Long.MIN_VALUE;
                maxValue = -1L;
            } // negative values need 10 bytes
            default -> {
                minValue = 0L;
                maxValue = 127L;
            }
        }

        BufferedData tempBuf = BufferedData.allocate(16);
        for (int i = 0; i < NUM_VALUES; i++) {
            if (numOfBytes == 10) {
                values[i] = random.nextLong(Long.MIN_VALUE, 0);
            } else {
                values[i] = random.nextLong(minValue, maxValue + 1);
            }
            // Encode to get the byte representation for reading benchmarks
            tempBuf.reset();
            tempBuf.writeVarLong(values[i], false);
            int len = (int) tempBuf.position();
            encodedValues[i] = new byte[len];
            for (int j = 0; j < len; j++) {
                encodedValues[i][j] = tempBuf.getByte(j);
            }
        }

        writeBuffer = BufferedData.allocate(16);
        readBuffer = BufferedData.allocate(16);
    }

    @Setup(Level.Invocation)
    public void setupInvocation() {
        index = (index + 1) & (NUM_VALUES - 1);
        // Prepare read buffer with the next encoded value
        readBuffer.reset();
        byte[] encoded = encodedValues[index];
        for (byte b : encoded) {
            readBuffer.writeByte(b);
        }
        readBuffer.resetPosition();
    }

    @Benchmark
    public long readVarLong() throws IOException {
        return readBuffer.readVarLong(false);
    }

    @Benchmark
    public void writeVarLong(Blackhole bh) throws IOException {
        writeBuffer.reset();
        writeBuffer.writeVarLong(values[index], false);
        bh.consume(writeBuffer);
    }

    public static void main(String[] args) throws Exception {
        Options opt = new OptionsBuilder()
                .include(SingleVarIntBench.class.getSimpleName())
                .build();

        new Runner(opt).run();
    }
}

Claudes detailed analysis on this PR benchmark

This benchmark has several critical flaws that mean it is not validly measuring what it claims to measure.

CRITICAL: Return values are never consumed — dead code elimination risk

Every single-value benchmark accepts a Blackhole blackhole parameter but never uses it. The return values from getArrayByteNoChecks, getInt, getLong, etc. are all silently discarded:

public void getArrayByteNoChecks_Old(final BenchState state, final Blackhole blackhole) {
    for (int i = 0; i < INVOCATIONS; i++) {
        OldUnsafeUtils.getArrayByteNoChecks(ARRAY, state.random.nextInt(ARRAY.length));
        // ^^^ return value thrown away, blackhole never touched
    }
}

The JIT is free to eliminate the entire call as dead code if it can prove the method is side-effect-free. Now, sun.misc.Unsafe.getByte() is treated as an intrinsic with memory semantics, so it probably survives DCE in the old implementation. But the new implementation (presumably using MemorySegment / FFM API) may get different treatment by the JIT — creating an asymmetric optimization between old and new that silently invalidates the comparison. This is the single most important fix: every read needs blackhole.consume(result).

CRITICAL: You're measuring Random.nextInt(), not memory access

Every single-value benchmark follows this pattern:

for (int i = 0; i < INVOCATIONS; i++) {
    OldUnsafeUtils.getArrayByteNoChecks(ARRAY, state.random.nextInt(ARRAY.length));
}

Random.nextInt() involves a CAS on an internal AtomicLong seed — it's significantly more expensive than a single unchecked memory read. This perfectly explains why all single-value benchmarks report ~124–126 ops/µs regardless of access type (array, heap buffer, direct buffer, int, long). You're seeing the throughput ceiling of java.util.Random, not of the unsafe memory operations. The actual memory access is lost in the noise.

If the goal is to vary the offset to prevent constant folding, a better approach is to use a pre-generated int[] of random offsets created in @Setup(Level.Trial) and index into it with a simple counter.

CRITICAL: @Setup(Level.Invocation) is doing ~8MB of work per invocation

The randomize() method fills and copies four 1MB buffers on every single invocation. The JMH docs have a big warning about Level.Invocation:

WARNING: This is the most dangerous level to use.

For the single-value benchmarks (measured in µs), the setup cost is on the same order of magnitude as the benchmark itself, which means JMH's timing compensation becomes unreliable. And critically, there's no reason to re-randomize the buffers every invocation — the content of the buffers doesn't matter for measuring access speed. A single @Setup(Level.Trial) randomization would be correct and not distort measurements.

Significant: @Fork(1) and @Warmup(iterations = 1)

One fork means no protection against JIT profile pollution between benchmarks. One warmup iteration may not be enough for the JIT to reach a steady state, especially for methods that use different intrinsic paths (Unsafe vs. FFM). This directly explains the enormous error bars on the bulk copy results (some at 76% of the score). Should be at minimum @Fork(3) and @Warmup(iterations = 3).

Moderate: Static mutable buffers shared across benchmarks

ARRAY, HEAP_BUFFER, DIRECT_BUFFER, DIRECT_BUFFER_2 are all static and mutated by the bulk copy benchmarks (getHeapBufferToArray writes to ARRAY, getDirectBufferToDirectBuffer writes to DIRECT_BUFFER_2). Since all benchmarks share these within a fork, cache state from one benchmark leaks into the next. These should be @State instance fields, not static.

Moderate: Bulk copy benchmarks have wildly variable work per invocation

final int length = state.random.nextInt(SIZE - offset);

The copy length is uniformly distributed from 0 to ~1MB. This means some invocations copy 0 bytes and some copy 1MB. The variance in work per invocation is enormous, which directly explains the massive error bars (±99 ops/ms on a score of ~175). If you want to benchmark representative PBJ usage, pick a fixed set of realistic copy sizes. If you want to benchmark across a range, use @Param with specific lengths.

Summary of what the benchmark is actually measuring:

Benchmark group	What it claims to measure	What it actually measures
Single byte reads	Unsafe byte access throughput	Random.nextInt() throughput + possibly DCE'd reads
getInt / getLong	Unsafe multi-byte access	Random.nextInt() throughput + possibly DCE'd reads
Bulk copies	Memory copy throughput	Mixture of Random overhead, 0-to-1MB variable copies, and 8MB setup per invocation

Recommended fixes:

blackhole.consume() every return value
Replace Random with a pre-computed offset array indexed by a counter, or use ThreadLocalRandom (cheaper) with results consumed
Change @Setup(Level.Invocation) → @Setup(Level.Trial)
@Fork(3), @Warmup(iterations = 5), @Measurement(iterations = 5)
Move buffers from static to @State instance fields
Use fixed or @Param-controlled copy lengths for bulk benchmarks

As written, I would not trust these results to validate the safety of the Unsafe → FFM migration in PBJ 0.15.0.

anthony-swirldslabs · 2026-04-07T20:46:54Z

@jasperpotts : Thanks for sharing the feedback. In your example benchmark in the above comment, I see you're also using @Setup(Level.Invocation), yet you keep scolding me for using it in every benchmark PR. :) Anyway, the root cause of that is a cut'n'paste, and I filed #777 to fix the root cause. Also, I don't see too many particularly relevant things in your example.

In general, this particular functionality seems to be exceptionally difficult to benchmark. And you indeed pointed to a few aspects of the benchmark implementation that were incorrect, so I'm fixing them in the latest revision. Some overall comments regarding the feedback:

The benchmark DOES NOT claim to measure Unsafe byte access throughput, or Unsafe multi-byte access, or even Memory copy throughput. This has never been the intent or the purpose of this benchmark as this is out of scope of the original fix that replaced the UnsafeUtils methods implementations. The benchmark specifically claims to try and measure the performance of calls to the UnsafeUtils methods comparing them with calling the same methods from the old implementation. Anything else would change the performance profile of the implementation and is hence totally outside of the current fix at hand.
The major difficulty with measuring the performance of these methods comes from the fact that some methods do very little. As an example, the getArrayByteNoChecks method reads a single byte from an array. Unless the compiler of JIT inlines the implementation, any code (application, PBJ, or this very benchmark) would lose a lot of cycles on performing a call to the UnsafeUtils method than on reading the actual byte. Further, even if this very call is eliminated (e.g. by inlining), the code would still lose on calling either the Unsafe method in the old implementation or the VarHandle method in the new implementation. These calls would consume more CPU cycles than reading a single byte from a memory address in any case. This is one of the reasons why this particular benchmark doesn't claim what the feedback claims it claims :) There's of course a room for optimization if we decide to move the core VarHandle (aka former Unsafe) calls directly into the code where we want to read one byte, and this could indeed improve the performance. However, doing this is totally outside of scope of the original fix, and therefore, it's outside of scope of this particular benchmark.

Now comments about specific issues listed in the feedback:

I indeed missed using the Blackhole. This is now fixed.
Measuring Random.nextInt performance - this is a good point that I didn't fully realize previously. I eliminated usages of Random from the measurement methods.
Level.Invocation is fixed as well - see a note at the very top of my reply.
Iterations/forks - I increased those.
Static mutable buffers - I see the point, and I even applied the recommendation. But I'm not fully convinced about this. One of the reasons is that the benchmark methods now have to access them indirectly through a state reference rather than refer to a single static field - this does add a few extra cycles to what the methods have to do, which, in case of methods that otherwise just read a single byte, might in fact be significant. But I did it anyway.
Bulk copies - I use a constant LENGTH now for all sub-array transfers.

I'll push a commit with changes and update the results in the description shortly.

Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>

jasperpotts

Thanks for fixing. For future reference on the point you said:

Static mutable buffers - I see the point, and I even applied the recommendation. But I'm not fully convinced about this. One of the reasons is that the benchmark methods now have to access them indirectly through a state reference rather than refer to a single static field - this does add a few extra cycles to what the methods have to do, which, in case of methods that otherwise just read a single byte, might in fact be significant. But I did it anyway.

You can use the test class its self as state, avoiding the state object all together. Then it is just a local field lookup which should be super cheap.

perf: benchmark UnsafeUtils

457e9d8

Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>

anthony-swirldslabs self-assigned this Apr 6, 2026

anthony-swirldslabs requested review from a team as code owners April 6, 2026 22:36

jasperpotts reviewed Apr 6, 2026

View reviewed changes

imalygin previously approved these changes Apr 7, 2026

View reviewed changes

address feedback

5537cf0

Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>

anthony-swirldslabs dismissed imalygin’s stale review via 5537cf0 April 7, 2026 20:47

anthony-swirldslabs added 2 commits April 7, 2026 13:47

merge from upstrea

39bc6af

Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>

spotless

b33625d

Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>

jasperpotts approved these changes Apr 7, 2026

View reviewed changes

anthony-swirldslabs merged commit 91572fb into main Apr 7, 2026
16 checks passed

anthony-swirldslabs deleted the 774-benchUnsafeUtils branch April 7, 2026 21:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: benchmark UnsafeUtils#775

perf: benchmark UnsafeUtils#775
anthony-swirldslabs merged 4 commits intomainfrom
774-benchUnsafeUtils

anthony-swirldslabs commented Apr 6, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 6, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 6, 2026 •

edited

Loading

Uh oh!

jasperpotts left a comment

Uh oh!

anthony-swirldslabs commented Apr 7, 2026

Uh oh!

jasperpotts left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

anthony-swirldslabs commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JUnit Test Report

Uh oh!

github-actions bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Integration Test Report

Uh oh!

jasperpotts left a comment

Choose a reason for hiding this comment

Claudes detailed analysis on this PR benchmark

Uh oh!

anthony-swirldslabs commented Apr 7, 2026

Uh oh!

jasperpotts left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

anthony-swirldslabs commented Apr 6, 2026 •

edited

Loading

github-actions bot commented Apr 6, 2026 •

edited

Loading

github-actions bot commented Apr 6, 2026 •

edited

Loading