Skip to content

perf: benchmark UnsafeUtils#775

Merged
anthony-swirldslabs merged 4 commits intomainfrom
774-benchUnsafeUtils
Apr 7, 2026
Merged

perf: benchmark UnsafeUtils#775
anthony-swirldslabs merged 4 commits intomainfrom
774-benchUnsafeUtils

Conversation

@anthony-swirldslabs
Copy link
Copy Markdown
Contributor

@anthony-swirldslabs anthony-swirldslabs commented Apr 6, 2026

Description:
Benchmarking the new, Java 25-friendly UnsafeUtils introduced in #771 , comparing them with the old, sun.misc.Unsafe-based implementations:

New results on 4/7:

Benchmark                                      (littleEndian)   Mode  Cnt     Score     Error   Units
UnsafeBench.getArrayByteNoChecks_New                      N/A  thrpt   15  9414.853 ±  74.711  ops/us
UnsafeBench.getArrayByteNoChecks_Old                      N/A  thrpt   15  9468.357 ±  20.356  ops/us
UnsafeBench.getDirectBufferByteNoChecks_New               N/A  thrpt   15  9458.915 ±  14.492  ops/us
UnsafeBench.getDirectBufferByteNoChecks_Old               N/A  thrpt   15  9462.491 ±  28.118  ops/us
UnsafeBench.getDirectBufferToArray_New                    N/A  thrpt   15  7169.200 ±  76.704  ops/ms
UnsafeBench.getDirectBufferToArray_Old                    N/A  thrpt   15  7079.295 ±  17.110  ops/ms
UnsafeBench.getDirectBufferToDirectBuffer_New             N/A  thrpt   15  7173.239 ±  11.251  ops/ms
UnsafeBench.getDirectBufferToDirectBuffer_Old             N/A  thrpt   15  5214.054 ±  15.676  ops/ms
UnsafeBench.getHeapBufferByteNoChecks_New                 N/A  thrpt   15  9207.595 ±  19.634  ops/us
UnsafeBench.getHeapBufferByteNoChecks_Old                 N/A  thrpt   15  9180.665 ±  28.933  ops/us
UnsafeBench.getHeapBufferToArray_New                      N/A  thrpt   15  5064.402 ±  17.589  ops/ms
UnsafeBench.getHeapBufferToArray_Old                      N/A  thrpt   15  5071.612 ±  11.497  ops/ms
UnsafeBench.getInt_New                                  false  thrpt   15  9014.501 ±  16.564  ops/us
UnsafeBench.getInt_New                                   true  thrpt   15  9085.640 ±  31.148  ops/us
UnsafeBench.getInt_Old                                  false  thrpt   15  9037.718 ±  26.492  ops/us
UnsafeBench.getInt_Old                                   true  thrpt   15  9093.505 ±  25.302  ops/us
UnsafeBench.getLong_New                                 false  thrpt   15  8859.154 ±  19.220  ops/us
UnsafeBench.getLong_New                                  true  thrpt   15  8931.391 ± 161.337  ops/us
UnsafeBench.getLong_Old                                 false  thrpt   15  9013.217 ±  86.528  ops/us
UnsafeBench.getLong_Old                                  true  thrpt   15  9154.096 ±  71.015  ops/us
UnsafeBench.putByteArrayToDirectBuffer_New                N/A  thrpt   15  7255.625 ±  51.167  ops/ms
UnsafeBench.putByteArrayToDirectBuffer_Old                N/A  thrpt   15  7267.593 ±  63.579  ops/ms

Old results on 4/6:

Benchmark                                      (littleEndian)   Mode  Cnt    Score     Error   Units
UnsafeBench.getArrayByteNoChecks_New                      N/A  thrpt    3  124.272 ±  16.153  ops/us
UnsafeBench.getArrayByteNoChecks_Old                      N/A  thrpt    3  124.710 ±   1.233  ops/us
UnsafeBench.getDirectBufferByteNoChecks_New               N/A  thrpt    3  119.038 ±  23.852  ops/us
UnsafeBench.getDirectBufferByteNoChecks_Old               N/A  thrpt    3  124.915 ±  11.187  ops/us
UnsafeBench.getDirectBufferToArray_New                    N/A  thrpt    3  178.670 ±  99.688  ops/ms
UnsafeBench.getDirectBufferToArray_Old                    N/A  thrpt    3  175.581 ±  99.743  ops/ms
UnsafeBench.getDirectBufferToDirectBuffer_New             N/A  thrpt    3  182.226 ±  99.270  ops/ms
UnsafeBench.getDirectBufferToDirectBuffer_Old             N/A  thrpt    3  172.294 ±  14.220  ops/ms
UnsafeBench.getHeapBufferByteNoChecks_New                 N/A  thrpt    3  120.113 ±  33.606  ops/us
UnsafeBench.getHeapBufferByteNoChecks_Old                 N/A  thrpt    3  120.560 ±  17.918  ops/us
UnsafeBench.getHeapBufferToArray_New                      N/A  thrpt    3  179.992 ± 136.978  ops/ms
UnsafeBench.getHeapBufferToArray_Old                      N/A  thrpt    3  173.067 ±  13.386  ops/ms
UnsafeBench.getInt_New                                  false  thrpt    3  125.294 ±   4.574  ops/us
UnsafeBench.getInt_New                                   true  thrpt    3  125.715 ±   4.893  ops/us
UnsafeBench.getInt_Old                                  false  thrpt    3  125.893 ±   2.500  ops/us
UnsafeBench.getInt_Old                                   true  thrpt    3  125.990 ±   1.283  ops/us
UnsafeBench.getLong_New                                 false  thrpt    3  126.270 ±   2.030  ops/us
UnsafeBench.getLong_New                                  true  thrpt    3  126.009 ±   7.585  ops/us
UnsafeBench.getLong_Old                                 false  thrpt    3  125.690 ±   1.211  ops/us
UnsafeBench.getLong_Old                                  true  thrpt    3  125.981 ±   4.034  ops/us
UnsafeBench.putByteArrayToDirectBuffer_New                N/A  thrpt    3  172.712 ±   5.654  ops/ms
UnsafeBench.putByteArrayToDirectBuffer_Old                N/A  thrpt    3  188.883 ±  31.014  ops/ms

Related issue(s):

Fixes #774

Notes for reviewer:
.

Checklist

  • Documented (Code comments, README, etc.)
  • Tested (unit, integration, etc.)

Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
@anthony-swirldslabs anthony-swirldslabs self-assigned this Apr 6, 2026
@anthony-swirldslabs anthony-swirldslabs requested review from a team as code owners April 6, 2026 22:36
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

JUnit Test Report

   79 files  ±0     79 suites  ±0   3m 23s ⏱️ ±0s
1 354 tests ±0  1 350 ✅ ±0   4 💤 ±0  0 ❌ ±0 
7 236 runs  ±0  7 216 ✅ ±0  20 💤 ±0  0 ❌ ±0 

Results for commit b33625d. ± Comparison against base commit 81e2c58.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

Integration Test Report

    419 files  ±0      419 suites  ±0   23m 54s ⏱️ ±0s
114 982 tests ±0  114 982 ✅ ±0  0 💤 ±0  0 ❌ ±0 
115 224 runs  ±0  115 224 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit b33625d. ± Comparison against base commit 81e2c58.

♻️ This comment has been updated with latest results.

Copy link
Copy Markdown
Member

@jasperpotts jasperpotts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the high error bars and knowing how hard it was to get these benchmarks to work in a way that you are actually measuring what you want. I have pasted a benchmark I wrote that is better but not perfect but might be helpful as example.

  • Pre-compute random inputs in @setup(Level.Trial) and index into them with a counter, keeping Random out of the hot loop
  • Return values from @benchmark methods or explicitly blackhole.consume() them
  • Use @param for controlled variation instead of random-per-invocation
  • Reserve Level.Invocation for genuinely minimal per-call setup
  • Use adequate fork/warmup/measurement counts (@fork(2)+, @WarmUp(iterations = 3)+, @measurement(iterations = 5)+)
// SPDX-License-Identifier: Apache-2.0
package com.hedera.pbj.integration.jmh.varint;

import com.hedera.pbj.runtime.io.buffer.BufferedData;
import java.io.IOException;
import java.util.Random;
import java.util.concurrent.TimeUnit;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

/**
 * Single varint read/write per invocation (no batching).
 * Measures per-call overhead precisely with @Param for byte size.
 */
@SuppressWarnings("unused")
@State(Scope.Benchmark)
@Fork(2)
@Warmup(iterations = 3, time = 2)
@Measurement(iterations = 5, time = 2)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
public class SingleVarIntBench {

    @Param({"1", "2", "3", "4", "5", "8", "10"})
    public int numOfBytes;

    private long[] values;
    private byte[][] encodedValues;
    private int index;

    private static final int NUM_VALUES = 1024;

    private BufferedData writeBuffer;
    private BufferedData readBuffer;

    @Setup(Level.Trial)
    public void setup() throws IOException {
        Random random = new Random(9387498731984L);
        values = new long[NUM_VALUES];
        encodedValues = new byte[NUM_VALUES][];

        final long minValue;
        final long maxValue;
        switch (numOfBytes) {
            case 1 -> {
                minValue = 0L;
                maxValue = (1L << 7) - 1;
            }
            case 2 -> {
                minValue = 1L << 7;
                maxValue = (1L << 14) - 1;
            }
            case 3 -> {
                minValue = 1L << 14;
                maxValue = (1L << 21) - 1;
            }
            case 4 -> {
                minValue = 1L << 21;
                maxValue = (1L << 28) - 1;
            }
            case 5 -> {
                minValue = 1L << 28;
                maxValue = (1L << 35) - 1;
            }
            case 8 -> {
                minValue = 1L << 49;
                maxValue = (1L << 56) - 1;
            }
            case 10 -> {
                minValue = Long.MIN_VALUE;
                maxValue = -1L;
            } // negative values need 10 bytes
            default -> {
                minValue = 0L;
                maxValue = 127L;
            }
        }

        BufferedData tempBuf = BufferedData.allocate(16);
        for (int i = 0; i < NUM_VALUES; i++) {
            if (numOfBytes == 10) {
                values[i] = random.nextLong(Long.MIN_VALUE, 0);
            } else {
                values[i] = random.nextLong(minValue, maxValue + 1);
            }
            // Encode to get the byte representation for reading benchmarks
            tempBuf.reset();
            tempBuf.writeVarLong(values[i], false);
            int len = (int) tempBuf.position();
            encodedValues[i] = new byte[len];
            for (int j = 0; j < len; j++) {
                encodedValues[i][j] = tempBuf.getByte(j);
            }
        }

        writeBuffer = BufferedData.allocate(16);
        readBuffer = BufferedData.allocate(16);
    }

    @Setup(Level.Invocation)
    public void setupInvocation() {
        index = (index + 1) & (NUM_VALUES - 1);
        // Prepare read buffer with the next encoded value
        readBuffer.reset();
        byte[] encoded = encodedValues[index];
        for (byte b : encoded) {
            readBuffer.writeByte(b);
        }
        readBuffer.resetPosition();
    }

    @Benchmark
    public long readVarLong() throws IOException {
        return readBuffer.readVarLong(false);
    }

    @Benchmark
    public void writeVarLong(Blackhole bh) throws IOException {
        writeBuffer.reset();
        writeBuffer.writeVarLong(values[index], false);
        bh.consume(writeBuffer);
    }

    public static void main(String[] args) throws Exception {
        Options opt = new OptionsBuilder()
                .include(SingleVarIntBench.class.getSimpleName())
                .build();

        new Runner(opt).run();
    }
}

Claudes detailed analysis on this PR benchmark

This benchmark has several critical flaws that mean it is not validly measuring what it claims to measure.


CRITICAL: Return values are never consumed — dead code elimination risk

Every single-value benchmark accepts a Blackhole blackhole parameter but never uses it. The return values from getArrayByteNoChecks, getInt, getLong, etc. are all silently discarded:

public void getArrayByteNoChecks_Old(final BenchState state, final Blackhole blackhole) {
    for (int i = 0; i < INVOCATIONS; i++) {
        OldUnsafeUtils.getArrayByteNoChecks(ARRAY, state.random.nextInt(ARRAY.length));
        // ^^^ return value thrown away, blackhole never touched
    }
}

The JIT is free to eliminate the entire call as dead code if it can prove the method is side-effect-free. Now, sun.misc.Unsafe.getByte() is treated as an intrinsic with memory semantics, so it probably survives DCE in the old implementation. But the new implementation (presumably using MemorySegment / FFM API) may get different treatment by the JIT — creating an asymmetric optimization between old and new that silently invalidates the comparison. This is the single most important fix: every read needs blackhole.consume(result).


CRITICAL: You're measuring Random.nextInt(), not memory access

Every single-value benchmark follows this pattern:

for (int i = 0; i < INVOCATIONS; i++) {
    OldUnsafeUtils.getArrayByteNoChecks(ARRAY, state.random.nextInt(ARRAY.length));
}

Random.nextInt() involves a CAS on an internal AtomicLong seed — it's significantly more expensive than a single unchecked memory read. This perfectly explains why all single-value benchmarks report ~124–126 ops/µs regardless of access type (array, heap buffer, direct buffer, int, long). You're seeing the throughput ceiling of java.util.Random, not of the unsafe memory operations. The actual memory access is lost in the noise.

If the goal is to vary the offset to prevent constant folding, a better approach is to use a pre-generated int[] of random offsets created in @Setup(Level.Trial) and index into it with a simple counter.


CRITICAL: @Setup(Level.Invocation) is doing ~8MB of work per invocation

The randomize() method fills and copies four 1MB buffers on every single invocation. The JMH docs have a big warning about Level.Invocation:

WARNING: This is the most dangerous level to use.

For the single-value benchmarks (measured in µs), the setup cost is on the same order of magnitude as the benchmark itself, which means JMH's timing compensation becomes unreliable. And critically, there's no reason to re-randomize the buffers every invocation — the content of the buffers doesn't matter for measuring access speed. A single @Setup(Level.Trial) randomization would be correct and not distort measurements.


Significant: @Fork(1) and @Warmup(iterations = 1)

One fork means no protection against JIT profile pollution between benchmarks. One warmup iteration may not be enough for the JIT to reach a steady state, especially for methods that use different intrinsic paths (Unsafe vs. FFM). This directly explains the enormous error bars on the bulk copy results (some at 76% of the score). Should be at minimum @Fork(3) and @Warmup(iterations = 3).


Moderate: Static mutable buffers shared across benchmarks

ARRAY, HEAP_BUFFER, DIRECT_BUFFER, DIRECT_BUFFER_2 are all static and mutated by the bulk copy benchmarks (getHeapBufferToArray writes to ARRAY, getDirectBufferToDirectBuffer writes to DIRECT_BUFFER_2). Since all benchmarks share these within a fork, cache state from one benchmark leaks into the next. These should be @State instance fields, not static.


Moderate: Bulk copy benchmarks have wildly variable work per invocation

final int length = state.random.nextInt(SIZE - offset);

The copy length is uniformly distributed from 0 to ~1MB. This means some invocations copy 0 bytes and some copy 1MB. The variance in work per invocation is enormous, which directly explains the massive error bars (±99 ops/ms on a score of ~175). If you want to benchmark representative PBJ usage, pick a fixed set of realistic copy sizes. If you want to benchmark across a range, use @Param with specific lengths.


Summary of what the benchmark is actually measuring:

Benchmark group What it claims to measure What it actually measures
Single byte reads Unsafe byte access throughput Random.nextInt() throughput + possibly DCE'd reads
getInt / getLong Unsafe multi-byte access Random.nextInt() throughput + possibly DCE'd reads
Bulk copies Memory copy throughput Mixture of Random overhead, 0-to-1MB variable copies, and 8MB setup per invocation

Recommended fixes:

  1. blackhole.consume() every return value
  2. Replace Random with a pre-computed offset array indexed by a counter, or use ThreadLocalRandom (cheaper) with results consumed
  3. Change @Setup(Level.Invocation)@Setup(Level.Trial)
  4. @Fork(3), @Warmup(iterations = 5), @Measurement(iterations = 5)
  5. Move buffers from static to @State instance fields
  6. Use fixed or @Param-controlled copy lengths for bulk benchmarks

As written, I would not trust these results to validate the safety of the Unsafe → FFM migration in PBJ 0.15.0.

imalygin
imalygin previously approved these changes Apr 7, 2026
@anthony-swirldslabs
Copy link
Copy Markdown
Contributor Author

@jasperpotts : Thanks for sharing the feedback. In your example benchmark in the above comment, I see you're also using @Setup(Level.Invocation), yet you keep scolding me for using it in every benchmark PR. :) Anyway, the root cause of that is a cut'n'paste, and I filed #777 to fix the root cause. Also, I don't see too many particularly relevant things in your example.

In general, this particular functionality seems to be exceptionally difficult to benchmark. And you indeed pointed to a few aspects of the benchmark implementation that were incorrect, so I'm fixing them in the latest revision. Some overall comments regarding the feedback:

  1. The benchmark DOES NOT claim to measure Unsafe byte access throughput, or Unsafe multi-byte access, or even Memory copy throughput. This has never been the intent or the purpose of this benchmark as this is out of scope of the original fix that replaced the UnsafeUtils methods implementations. The benchmark specifically claims to try and measure the performance of calls to the UnsafeUtils methods comparing them with calling the same methods from the old implementation. Anything else would change the performance profile of the implementation and is hence totally outside of the current fix at hand.
  2. The major difficulty with measuring the performance of these methods comes from the fact that some methods do very little. As an example, the getArrayByteNoChecks method reads a single byte from an array. Unless the compiler of JIT inlines the implementation, any code (application, PBJ, or this very benchmark) would lose a lot of cycles on performing a call to the UnsafeUtils method than on reading the actual byte. Further, even if this very call is eliminated (e.g. by inlining), the code would still lose on calling either the Unsafe method in the old implementation or the VarHandle method in the new implementation. These calls would consume more CPU cycles than reading a single byte from a memory address in any case. This is one of the reasons why this particular benchmark doesn't claim what the feedback claims it claims :) There's of course a room for optimization if we decide to move the core VarHandle (aka former Unsafe) calls directly into the code where we want to read one byte, and this could indeed improve the performance. However, doing this is totally outside of scope of the original fix, and therefore, it's outside of scope of this particular benchmark.

Now comments about specific issues listed in the feedback:

  1. I indeed missed using the Blackhole. This is now fixed.
  2. Measuring Random.nextInt performance - this is a good point that I didn't fully realize previously. I eliminated usages of Random from the measurement methods.
  3. Level.Invocation is fixed as well - see a note at the very top of my reply.
  4. Iterations/forks - I increased those.
  5. Static mutable buffers - I see the point, and I even applied the recommendation. But I'm not fully convinced about this. One of the reasons is that the benchmark methods now have to access them indirectly through a state reference rather than refer to a single static field - this does add a few extra cycles to what the methods have to do, which, in case of methods that otherwise just read a single byte, might in fact be significant. But I did it anyway.
  6. Bulk copies - I use a constant LENGTH now for all sub-array transfers.

I'll push a commit with changes and update the results in the description shortly.

Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
Copy link
Copy Markdown
Member

@jasperpotts jasperpotts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing. For future reference on the point you said:

Static mutable buffers - I see the point, and I even applied the recommendation. But I'm not fully convinced about this. One of the reasons is that the benchmark methods now have to access them indirectly through a state reference rather than refer to a single static field - this does add a few extra cycles to what the methods have to do, which, in case of methods that otherwise just read a single byte, might in fact be significant. But I did it anyway.

You can use the test class its self as state, avoiding the state object all together. Then it is just a local field lookup which should be super cheap.

@anthony-swirldslabs anthony-swirldslabs merged commit 91572fb into main Apr 7, 2026
16 checks passed
@anthony-swirldslabs anthony-swirldslabs deleted the 774-benchUnsafeUtils branch April 7, 2026 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Benchmark new UnsafeUtils

3 participants