Skip to content

fix: base64DecodeBytes unsigned byte values + improve JMH benchmark config#705

Draft
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:fix/base64-unsigned-jmh-config
Draft

fix: base64DecodeBytes unsigned byte values + improve JMH benchmark config#705
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:fix/base64-unsigned-jmh-config

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented Apr 6, 2026

Motivation

std.base64DecodeBytes returns incorrect values for bytes >= 128. Java's byte type is signed (-128..127), so decoding base64 data containing high bytes (e.g., 0x80 = 128) produces negative numbers instead of the expected unsigned values (0-255). This violates the Jsonnet specification.

Additionally, the JMH RegressionBenchmark uses only 1 warmup and 1 measurement iteration, producing noisy results with wide confidence intervals that make it difficult to detect real performance changes.

Key Design Decision

  • Apply the standard Java idiom & 0xff to convert signed bytes to unsigned integers, consistent with the existing encodeUTF8 implementation in StringModule.scala
  • Increase JMH iterations to 3 warmup + 3 measurement for statistically meaningful results while keeping total runtime reasonable (~20 min for 35 benchmarks)

Modification

Bug Fix: EncodingModule.scala

  • Apply & 0xff mask when converting decoded bytes to Val.Num, ensuring values are in the unsigned range 0-255
  • Replace .map() with explicit while-loop for consistency with other encoding functions

Benchmark Config: RegressionBenchmark.scala

  • @Warmup(iterations = 3, time = 2) (was iterations = 1)
  • @Measurement(iterations = 3) (was iterations = 1)

Regression Test: base64DecodeBytes_unsigned.jsonnet

  • Tests boundary values: 0 (0x00), 128 (0x80), 255 (0xFF)
  • Tests round-trip: base64DecodeBytes(base64([0, 127, 128, 200, 255])) == [0, 127, 128, 200, 255]

Benchmark Results

JMH (3 iterations, back-to-back runs)

No performance impact — this is a correctness fix. The base64DecodeBytes benchmark is unchanged within CI:

Benchmark Before After Δ%
base64DecodeBytes 8.915 ± 1.362 ms/op 9.234 ± 1.372 ms/op +3.6% (within CI)
base64_byte_array 1.373 ± 0.148 ms/op 1.441 ± 0.137 ms/op +5.0% (within CI)
base64 0.760 ± 0.066 ms/op 0.752 ± 0.041 ms/op -1.1% (within CI)
base64Decode 0.571 ± 0.031 ms/op 0.571 ± 0.139 ms/op +0.0% (within CI)

All confidence intervals overlap — no statistically significant change.

Analysis

The bug manifests when decoding base64 data containing bytes in the range 128-255. For example:

  • std.base64DecodeBytes("gIA=") should return [128, 128] but returned [-128, -128]
  • std.base64DecodeBytes("/w==") should return [255] but returned [-1]

The fix is the standard Java unsigned byte conversion: byte & 0xff promotes the signed byte to an int with zero-extension, producing the correct unsigned value.

References

  • Upstream jit branch commits: b833428e, af4832f2
  • Consistent with existing pattern in StringModule.scala encodeUTF8: bytes(i) & 0xff

Result

  • ✅ All tests pass (11 suites)
  • ✅ No performance regression
  • ✅ Cross-model review: GPT-5.4 + Opus 4.5 — no blockers
  • ✅ New regression test covers boundary values and round-trip

…onfig

Fix std.base64DecodeBytes to correctly return unsigned byte values (0-255)
instead of signed Java byte values (-128..127). Java's byte type is signed,
so without masking with 0xff, bytes >= 128 (e.g., 0x80 = 128) would appear
as negative numbers (e.g., -128), violating the Jsonnet specification.

Also increase JMH RegressionBenchmark iterations from 1 to 3 for both
warmup and measurement phases, providing more stable and reliable benchmark
results with proper confidence intervals.

Changes:
- EncodingModule.scala: Apply `& 0xff` mask in base64DecodeBytes to convert
  signed Java bytes to unsigned integers, matching Jsonnet byte semantics
- RegressionBenchmark.scala: Increase warmup/measurement iterations to 3
- Add regression test: base64DecodeBytes_unsigned.jsonnet with boundary
  values (0, 127, 128, 255) and encode/decode round-trip verification

Upstream reference: jit branch commits b833428, af4832f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant