Skip to content

Latest commit

 

History

History
230 lines (173 loc) · 16.3 KB

File metadata and controls

230 lines (173 loc) · 16.3 KB

InferNode Performance Benchmarks

Test Platforms

Platform CPU Cores RAM OS
AMD64 Linux AMD Ryzen 7 H 255 16 21 GB Linux 6.14.0 (x86_64)
AMD64 Windows AMD Ryzen 7 255 16 28 GB Windows 11 Pro 10.0.26100
ARM64 macOS Apple M4 10 (4P+6E) 32 GB macOS 15.4 (Darwin 24.6.0)
ARM64 Linux ARM Cortex-A78AE 12 64 GB Linux 5.15.148-tegra (Jetson AGX Orin)

JIT Compiler Performance

Both AMD64 and ARM64 JIT compilers translate Dis VM bytecode to native machine code at module load time. The AMD64 JIT (comp-amd64.c) targets x86-64 with System V ABI on Linux/macOS and Windows x64 ABI on Windows. The ARM64 JIT (comp-arm64.c) targets ARMv8-A with AAPCS64 ABI. On macOS, JIT code buffers use mmap(MAP_JIT) with pthread_jit_write_protect_np() for W^X compliance; Linux uses mmap(MAP_ANON); Windows uses VirtualAlloc(PAGE_READWRITE) with VirtualProtect(PAGE_EXECUTE_READ) for W^X.

Cross-Platform Summary

Two benchmark suites measure JIT speedup: v1 (6 compute-intensive benchmarks) and v2 (26 benchmarks across 9 categories including function calls and list operations which have lower JIT gains).

Platform v1 Interp v1 JIT v1 Speedup v2 Interp v2 JIT v2 Speedup
AMD64 Linux 21,255 ms 1,500 ms 14.2x 1,504 ms 263 ms 5.7x
AMD64 Windows 19,115 ms 1,437 ms 13.3x 1,464 ms 261 ms 5.6x
ARM64 macOS 16,697 ms 1,735 ms 9.6x 1,086 ms 413 ms 2.6x
ARM64 Linux 38,320 ms 4,615 ms 8.3x 2,743 ms 938 ms 2.9x

The AMD64 JIT achieves the highest speedup ratios (14.2x on v1) due to efficient x86-64 instruction encoding for the Dis VM's register-based bytecode. AMD64 Windows matches Linux within 3% on JIT absolute performance (1,437 ms vs 1,500 ms on v1) despite different ABIs and W^X mechanisms, confirming the Windows x64 ABI adaptation (callee-saved RSI/RDI, shadow space) introduces no measurable overhead. In absolute terms, AMD64 and Apple M4 JIT performance are comparable (~1,500 ms vs 1,735 ms on v1) despite different architectures. The Jetson Cortex-A78AE is roughly 2.5x slower in absolute terms but achieves similar JIT-over-interpreter ratios, confirming the ARM64 JIT generates efficient code on both microarchitectures.

The v1-to-v2 speedup reduction reflects benchmark composition: v2 includes function calls (recursive Fibonacci, mutual recursion) where the JIT must still pay runtime overhead for frame allocation, type checking, and garbage collector interaction. v1 is dominated by tight loops where eliminating interpreter dispatch yields the greatest gains.

v1 Per-Benchmark Results (best-of-3)

Benchmark AMD64 JIT AMD64 Interp Speedup M4 JIT M4 Interp Speedup Jetson JIT Jetson Interp Speedup
Integer Arithmetic 23 ms 466 ms 20.3x 29 ms 354 ms 12.2x 119 ms 856 ms 7.2x
Array Access 1,131 ms 18,284 ms 16.2x 1,317 ms 14,496 ms 11.0x 3,666 ms 33,708 ms 9.2x
Function Calls 1 ms 20 ms 20.0x 1 ms 18 ms 18.0x 4 ms 38 ms 9.5x
Fibonacci 243 ms 777 ms 3.2x 220 ms 786 ms 3.6x 633 ms 1,671 ms 2.6x
Sieve 5 ms 74 ms 14.8x 5 ms 51 ms 10.2x 16 ms 136 ms 8.5x
Nested Loops 65 ms 1,152 ms 17.7x 54 ms 990 ms 18.3x 169 ms 1,911 ms 11.3x

Fibonacci shows the lowest speedup across all platforms (2.6-3.6x) because recursive function calls involve frame allocation, module pointer validation, and type checking at each call site — operations the JIT cannot eliminate.

v2 Category Aggregates (AMD64 Linux, best-of-3)

Category JIT (ms) Interp (ms) Speedup
Branch & Control 2 72 36.0x
Memory Access 5 111 22.2x
Integer ALU 10 139 13.9x
Mixed Workloads 25 382 15.3x
Byte Ops 7 88 12.6x
Type Conversions 6 51 8.5x
Big (64-bit) 18 135 7.5x
List Ops 4 23 5.8x
Function Calls 164 455 2.8x

Branch and control flow operations see the largest speedup (36x) because the interpreter's dispatch loop overhead is most pronounced for simple, fast instructions. Function calls remain the bottleneck (2.8x) due to non-eliminable runtime overhead.

Cross-Language Comparison

Six benchmarks (Integer Arithmetic, Array Access, Function Calls, Fibonacci, Sieve, Nested Loops) ported to C, Go, Java, Python, and Limbo with matched parameters and 64-bit integer types.

ARM64 macOS (Apple M4, best-of-3)

Benchmark C -O2 C -O0 Go Java Limbo JIT Limbo Interp Python
Integer Arithmetic 10 ms 44 ms 14 ms 11 ms 25 ms 279 ms 2,882 ms
Array Access 70 ms 567 ms 263 ms 252 ms 1,039 ms 10,208 ms 10,382 ms
Function Calls 0 ms 1 ms 0 ms 0 ms 1 ms 13 ms 60 ms
Fibonacci 0 ms 28 ms 16 ms 9 ms 210 ms 615 ms 554 ms
Sieve 1 ms 4 ms 2 ms 1 ms 5 ms 45 ms 24 ms
Nested Loops 0 ms 32 ms 15 ms 14 ms 51 ms 717 ms 1,136 ms
Total 81 ms 676 ms 310 ms 292 ms 1,331 ms 11,877 ms 15,038 ms

ARM64 Linux (Jetson AGX Orin, best-of-3)

No Java toolchain on this platform.

Benchmark C -O2 C -O0 Go Python 3.12 Limbo JIT Limbo Interp
Integer Arithmetic 39 ms 105 ms 40 ms 9,379 ms 122 ms 856 ms
Array Access 518 ms 2,892 ms 523 ms 39,996 ms 3,655 ms 33,259 ms
Function Calls 0 ms 3 ms 1 ms 163 ms 5 ms 38 ms
Fibonacci 26 ms 49 ms 36 ms 1,522 ms 627 ms 1,710 ms
Sieve 5 ms 13 ms 4 ms 74 ms 16 ms 136 ms
Nested Loops 0 ms 136 ms 32 ms 4,317 ms 169 ms 1,887 ms
Total 588 ms 3,198 ms 637 ms 55,451 ms 4,594 ms 37,886 ms

Relative Performance — Apple M4 (total time)

Contestant Total vs C -O2 vs C -O0
C -O2 81 ms 1.0x 8.3x faster
Java (HotSpot) 292 ms 3.6x slower 2.3x faster
Go 310 ms 3.8x slower 2.2x faster
C -O0 676 ms 8.3x slower 1.0x
Limbo JIT 1,331 ms 16.4x slower 2.0x slower
Limbo Interpreter 11,877 ms 147x slower 17.6x slower
Python 3.11 15,038 ms 186x slower 22.2x slower

Relative Performance — Jetson AGX Orin (total time)

Contestant Total vs C -O0
C -O2 588 ms 5.4x faster
Go 637 ms 5.0x faster
C -O0 3,198 ms 1.0x
Limbo JIT 4,594 ms 1.4x slower
Limbo Interpreter 37,886 ms 11.8x slower
Python 3.12 55,451 ms 17.3x slower

Limbo JIT reaches 69% of unoptimized C throughput on the Jetson — closer to native performance than on the M4, reflecting the Cortex-A78AE's narrower execution pipelines where the JIT's simpler code generation is less of a disadvantage.

Analysis

Limbo JIT vs native languages. On the M4, JIT-compiled Limbo is 16x slower than optimized C and 2x slower than unoptimized C. On the Jetson, the gap narrows to 1.4x slower than C -O0 — the simpler Cortex-A78AE pipelines penalize the JIT's unoptimized code less than the M4's wide out-of-order core. The remaining gap reflects fundamental Dis VM constraints: memory-to-memory architecture (no register file), garbage collector invariants, and mandatory bounds checking on every array access.

Limbo JIT vs managed languages. Java HotSpot (3.6x faster on M4) and Go (3.8x faster on M4, 5.0x on Jetson) outperform Limbo JIT. Both benefit from decades of optimization work, profile-guided compilation (Java), and register-allocated intermediate representations. The Dis JIT is a single-pass translator with no optimization passes.

Limbo JIT vs interpreter. The JIT provides an 8.9x speedup over the Dis interpreter on the M4 and 8.2x on the Jetson, consistent with the v1 benchmark results. This is the JIT's primary value proposition: making compute-bound Limbo code practical without rewriting in a native language.

Limbo vs Python. JIT-compiled Limbo is 11.3x faster than CPython 3.11 (M4) and 12.1x faster than CPython 3.12 (Jetson). Even the Dis interpreter matches or beats Python on array-heavy workloads where Python's per-element overhead dominates.

Where Limbo JIT excels. Integer arithmetic (25 ms vs 279 ms interpreter = 11x), nested loops (51 ms vs 717 ms = 14x), and sieve (5 ms vs 45 ms = 9x) show the strongest JIT gains — tight loops with simple operations where eliminating interpreter dispatch overhead matters most.

Where Limbo JIT struggles. Recursive Fibonacci (210 ms JIT vs 9 ms Java) highlights the cost of Dis frame allocation. Each recursive call allocates a new frame, checks module pointers, and validates types. Java's HotSpot inlines these calls; the Dis JIT cannot, because frame layout is determined at compile time by the Limbo compiler, not the JIT.

Go-on-Dis Benchmark Suite

16 benchmarks comparing Native Go, Go-on-Dis (via godis compiler), and hand-written Limbo across 5 execution modes: Native Go, Go-on-Dis JIT, Go-on-Dis Interpreter, Limbo JIT, and Limbo Interpreter. Each benchmark runs 1 warmup + 5 timed iterations; mean and stddev reported.

ARM64 Linux (Jetson AGX Orin, Cortex-A78AE, 64 GB, Linux 5.15.148-tegra)

Go 1.23.4 linux/arm64. Dis VM arena 512 MB.

Summary Table (mean ± stddev, ms)

Benchmark Native Go GoDis JIT GoDis Interp Limbo JIT Limbo Interp
fib 50 ± 11 547 ± 6 2666 ± 11 676 ± 5 1971 ± 12
sieve 19 ± 2 85 ± 3 863 ± 10 55 ± 7 415 ± 2
qsort 29 ± 3 171 ± 7 1537 ± 11 128 ± 1 755 ± 9
strcat 388 ± 12 OOM 117 ± 12 11 ± 0 25 ± 2
matrix 36 ± 4 245 ± 11 2278 ± 6 173 ± 10 1543 ± 2
channel 11 ± 1 6 ± 1 32 ± 6 4 ± 2 10 ± 2
nbody 10 ± 2 220 ± 3 590 ± 12 187 ± 6 377 ± 10
spawn 21 ± 1 374 ± 67 328 ± 59 351 ± 39 337 ± 92
bsearch 59 ± 4 269 ± 7 2250 ± 12 205 ± 1 998 ± 3
closure 34 ± 5 332 ± 9 1697 ± 10 40 ± 2 344 ± 6
interface 91 ± 7 152 ± 4 741 ± 6 344 ± 3 614 ± 6
map_ops 2 ± 1 43 ± 2 91 ± 8 10 ± 3 21 ± 1
binary_trees 187 ± 13 529 ± 7 1035 ± 8 472 ± 6 793 ± 46
spectral_norm 48 ± 4 720 ± 4 3894 ± 9 295 ± 6 1901 ± 3
fannkuch 68 ± 9 413 ± 10 4187 ± 3 274 ± 3 2515 ± 4
mandelbrot 71 ± 11 295 ± 3 2976 ± 10 258 ± 12 1771 ± 4

JIT Speedup (JIT time / Interpreter time, lower = JIT faster)

Benchmark GoDis JIT/Interp Limbo JIT/Interp
fib 0.21 0.34
sieve 0.10 0.13
qsort 0.11 0.17
matrix 0.11 0.11
channel 0.17 0.45
nbody 0.37 0.50
spawn 1.14 1.04
bsearch 0.12 0.21
closure 0.20 0.12
interface 0.21 0.56
map_ops 0.47 0.49
binary_trees 0.51 0.59
spectral_norm 0.19 0.16
fannkuch 0.10 0.11
mandelbrot 0.10 0.15

Speedup vs Go-on-Dis Interpreter (higher = faster)

Benchmark Native Go GoDis JIT Limbo JIT Limbo Interp
fib 53.7x 4.9x 3.9x 1.4x
sieve 45.9x 10.2x 15.7x 2.1x
qsort 52.3x 9.0x 12.0x 2.0x
matrix 62.6x 9.3x 13.2x 1.5x
channel 2.8x 5.7x 7.3x 3.3x
nbody 61.5x 2.7x 3.2x 1.6x
spawn 15.5x 0.9x 0.9x 1.0x
bsearch 38.1x 8.4x 11.0x 2.3x
closure 49.9x 5.1x 42.9x 4.9x
interface 8.2x 4.9x 2.2x 1.2x
map_ops 50.3x 2.1x 8.9x 4.4x
binary_trees 5.5x 2.0x 2.2x 1.3x
spectral_norm 81.1x 5.4x 13.2x 2.0x
fannkuch 61.8x 10.1x 15.3x 1.7x
mandelbrot 41.9x 10.1x 11.5x 1.7x

Analysis

Go-on-Dis JIT vs Interpreter. The JIT provides 2x-10x speedup on compute-bound benchmarks. Median JIT/Interp ratio is 0.17 (roughly 6x speedup). The strongest gains are on tight-loop benchmarks: sieve (0.10), fannkuch (0.10), mandelbrot (0.10), qsort (0.11), matrix (0.11). The weakest gains are on allocation-heavy workloads: binary_trees (0.51), map_ops (0.47), nbody (0.37). Spawn shows no JIT benefit (1.14) because it is dominated by thread scheduling overhead.

Go-on-Dis vs Limbo (same VM). Hand-written Limbo JIT consistently outperforms Go-on-Dis JIT by 1.1x-8.5x. The godis compiler translates Go SSA to Dis bytecode, but the generated code is less optimal than what the Limbo compiler produces: more temporaries, less register reuse, and no peephole optimization. The gap is smallest on interface dispatch (Go-on-Dis 152 ms vs Limbo 344 ms — Go-on-Dis is actually 2.3x faster here) because godis maps Go interfaces directly to Dis ADT dispatch, which is efficient. The closure benchmark shows a large gap (332 ms vs 40 ms = 8.3x) because godis uses a dispatch chain for dynamic closures rather than Limbo's native fn references.

Go-on-Dis vs Native Go. Native Go is 3x-82x faster than Go-on-Dis JIT (median ~10x). The gap is smallest on channel operations (11 ms native vs 6 ms Go-on-Dis JIT — the Dis VM's channel implementation is actually faster) and interface dispatch (91 ms vs 152 ms = 1.7x). The gap is largest on floating-point (spectral_norm: 48 ms vs 720 ms = 15x) and compute-intensive loops (fib: 50 ms vs 547 ms = 11x) where Native Go's register allocator and optimization passes dominate.

Channel operations. Notably, Go-on-Dis JIT (6 ms) beats Native Go (11 ms) on channel benchmarks. The Dis VM's channel implementation (from the Inferno kernel) uses lightweight thread scheduling that outperforms Go's goroutine scheduler for simple send/receive patterns. Limbo JIT is even faster (4 ms).

strcat OOM. Go-on-Dis JIT exhausted the 512 MB Dis arena on string concatenation. The godis compiler generates intermediate string allocations that the Dis GC cannot collect fast enough. Go-on-Dis interpreter (117 ms) and Limbo (11 ms JIT) handle this fine — Limbo's string implementation is more memory-efficient.

spawn (goroutine creation). All Dis modes cluster around 330-374 ms with high variance (±39-92 ms), far slower than Native Go (21 ms). Dis thread creation involves frame allocation, module loading, and scheduler enqueue — heavier than Go's goroutine spawn.

Methodology

  • Protocol: Best-of-N reported (N=3 or N=4 depending on suite) for v1/v2/cross-language. Mean ± stddev over 5 timed runs (1 warmup discarded) for Go-on-Dis suite. System idle during runs.
  • JIT benchmarks: appl/cmd/jitbench.b (v1, 6 benchmarks), appl/cmd/jitbench2.b (v2, 26 benchmarks). Run via emu -c0 (interpreter) and emu -c1 (JIT).
  • Cross-language: benchmarks/run-comparison.sh. Same algorithms with matched parameters and 64-bit integers. C compiled with cc (Apple Clang on macOS, GCC on Linux). Go, Java HotSpot (where available), CPython. Run on Apple M4 and Jetson AGX Orin.
  • Go-on-Dis: benchmarks/run.sh. 16 benchmarks in Go (compiled via godis), Limbo, and Native Go. 5 execution modes. Statistics computed per benchmark/mode.
  • Correctness: 181/181 JIT correctness tests pass on Linux and macOS; 216/216 on Windows. Benchmark result values match between JIT and interpreter on all platforms.
  • Variation: JIT run-to-run variance <5% on macOS and Windows, <1% on Linux. Interpreter variance <5% on all platforms. Windows system timer resolution (~15.6 ms) limits per-benchmark precision for fast tests; totals and longer benchmarks are unaffected.

Detailed Results

Per-platform breakdowns with all individual runs and v2 per-benchmark data:

Benchmark source code and runner scripts: benchmarks/.