Reproducible benchmarks for backend programming languages.
Real work each language does vs the C baseline (= 1.00×), as the differential I(n₂)−I(n₁) that cancels startup + JIT. Lower is better (less work than C). Geomean across all 19 axes; green cells beat or tie C. Full method below.
Leaderboard (sorted by overall geomean)
| # | Language | Overall (vs C) | Fastest axis | Slowest axis |
|---|---|---|---|---|
| 1 | C (baseline) | 1.00× | — | — |
| 2 | Rust | 1.04× | message-ring 0.28× | k-nucleotide 2.73× |
| 3 | Go | 1.75× | binary-trees 1.09× | message-ring 11.5× |
| 4 | C# | 1.81× | binary-trees 0.45× | message-ring 35.3× |
| 5 | Swift | 2.41× | blur 0.56× | message-ring 52.2× |
| 6 | Scala | 2.82× | binary-trees 0.28× | message-ring 55.1× |
| 7 | Kotlin | 2.99× | binary-trees 0.28× | message-ring 59.6× |
| 8 | Elixir | 22.6× | binary-trees 0.30× | polymorphism 136× |
| 9 | PHP | 29.9× | binary-trees 5.75× | sha256 98.0× |
| 10 | Ruby | 70.0× | binary-trees 10.3× | sha256 278× |
| 11 | Python | 109× | binary-trees 11.2× | sha256 601× |
| 12 | Perl | 146× | binary-trees 19.0× | sha256 701× |
| 13 | COBOL | 461× | fannkuch 26.8× | sha256 223k×* |
* extrapolated from small probes (negligible-startup runtimes only), not measured full-size; still counted in the geomean.
How to read the matrix. It shows user-space instruction work vs C (a proxy for algorithmic efficiency), not wall-clock speed. It is a reliable rank for native and interpreter languages and ISA-robust for them; the JVM cells (Kotlin, Scala) are ISA-specific (their ratio roughly doubles from arm64 to x86_64). For concurrency read the scaling track and the concurrency study. The metric's empirical validity is measured in docs/metric-validity.md.
Lang Lab measures, rigorously and reproducibly, how much CPU work it costs to run the same algorithm across backend programming languages. It doesn't chase the realism of a web framework (that's TechEmpower) or the syntactic breadth of Rosetta Code. Its single differentiator is reproducibility and automatic maintenance.
⚠️ Not a human-language lab. Lang Lab measures programming languages (Python, Go, Rust, …), it doesn't teach you to conjugate verbs.
For each language we run the same benchmark implementation inside a version-pinned Docker image, count the CPU instructions executed (not wall-clock time) with a deterministic emulator, and publish the history, all on free CI, refreshed only when a language ships a new version.
A complementary scaling track additionally measures wall-clock parallel speedup (T1/TP) on
multiple cores for the parallelizable benchmarks (see
below).
Timing seconds on shared CI runners gives ±20-30 % noise: variable hardware, noisy
neighbours, unpredictable turbo. Instead we count the number of instructions executed,
which for single-threaded code is identical on every run regardless of machine load:
the same technique rustc-perf uses to track the
Rust compiler's performance on CI.
Honest trade-off. Instructions ≠ wall-clock time. This metric is excellent for comparing algorithmic efficiency and detecting regressions across versions, but it does not capture real latency, parallelism, memory bandwidth, or the per-instruction cost (which varies by ISA and microarchitecture). It is a measure of computational work, not absolute speed. For parallelism specifically, the complementary scaling track measures wall-clock multicore speedup. Empirically the metric predicts wall-clock order well and magnitude to roughly a 1.3× factor for compute-bound code (worse for JIT and syscall-bound code); its single most dependable use is regression detection within one language across versions. The full validity characterization is in docs/metric-validity.md.
The project measures two different things, with two different reproducibility contracts. Read one track's number through its own contract, never the other's.
| Track | Measures | Backend | Reproducibility contract |
|---|---|---|---|
| Instruction (headline) | user-space instruction work vs C (a proxy for algorithmic efficiency, not speed) | qemu-user + insn plugin |
deterministic given (ISA, qemu, plugin): bit-exact for native/interpreters, median-of-N for JIT/GC |
| Scaling | wall-clock parallel speedup T1/TP on 1/2/4 cores |
native, no qemu | a ratio (machine speed cancels), not bit-reproducible; stable to ±0.03 on shared CI |
The scaling track and the message-ring wall-clock companion exist precisely because the instruction
metric cannot see parallelism or syscall cost (its limits are quantified in
docs/metric-validity.md).
Chosen to cover every backend runtime archetype, not just the popular ones, so the methodology meets the hard runtimes early.
| Archetype | Languages |
|---|---|
| Native (no GC) | Rust, Swift, C (1.0× baseline), COBOL (GnuCOBOL→native) |
| Compiled + concurrent GC | Go |
| Interpreter | Python, Perl, PHP, Ruby |
| VM with JIT + GC | Kotlin (JVM), Scala (JVM), C# (CLR) |
| Actor VM (BEAM) | Elixir |
Measuring instructions uniformly and comparably across all those runtimes was the core problem of the project, and it took some work.
Backend: QEMU user-mode + the TCG insn plugin. It emulates at the instruction level and
counts the guest's instructions. Deterministic and, within one ISA, directly comparable.
(valgrind/cachegrind was rejected: it segfaults on Go and measures the wrong process for
launcher runtimes.)
The lesson. For a while it looked like qemu couldn't emulate the complex runtimes
(CPython, Perl, PHP, JVM, BEAM); they all exited silently. The real cause was a harness bug,
not emulation: qemu-user doesn't PATH-resolve a bare command name. Natives were invoked
by absolute path and worked; interpreters/VMs by bare command (python, java) and failed.
Resolving argv[0] to an absolute path fixed it, and all 12 measure cleanly. Elixir needed
one extra step (its elixir→erl→beam.smp shell wrappers aren't ELF binaries, so we capture
and run beam.smp directly). "No output" ≠ "can't emulate": verify the cause.
Rigor rules:
- Fixed ISA (arm64 locally, x86_64 in CI): qemu counts the guest's instructions; different ISAs aren't comparable.
- Metric = differential
I(n₂) − I(n₁), normalized to C = 1.0× (lower is better: less work than C): cancels startup and JIT, isolating the algorithm's real work. - Jittery runtimes (Go, C#, JVM): pinned to single-thread + GC off (
runtimeEnv) and reported as median of N; pure natives are bit-exact.
See the fannkuch study for the full results and methodology.
endoflife.date ──► version-watch (weekly) ──► PR bumping versions.lock.json
│ merge triggers…
▼
benchmark ──► version-pinned Docker image per language
└► qemu+insn → results/<date>.json (instruction track)
scaling ──► the same images, run natively (no qemu)
└► wall-clock T1/TP at 1/2/4 cores → results/scaling/ (scaling track)
Everything runs on free GitHub-hosted runners: no self-hosted runners, no own VM, no
security exposure. The result history lives versioned in results/.
Requires Docker.
scripts/bench-local.sh # every language × every benchmark
scripts/bench-local.sh rust # one language, all benchmarks
BENCH=binary-trees scripts/bench-local.sh rust # one language, one benchmarkEach run prints one JSON line with the differential and per-size medians.
A small suite, chosen so each benchmark stresses a different, orthogonal axis of the runtime. A language fast at one can be slow at another, which is exactly why one micro-benchmark isn't enough.
| Benchmark | Axis: what it measures | Study |
|---|---|---|
| fannkuch-redux | Integer compute: permutations & array reversals, no allocation | benchmarks/fannkuch |
| binary-trees | Allocation + GC churn: build & walk real heap nodes, no arenas | benchmarks/binary-trees |
| mandelbrot | Floating-point: a tight IEEE-754 double inner loop, almost no memory traffic |
benchmarks/mandelbrot |
| k-nucleotide | Hash maps: count k-mer frequencies of a DNA string via the std dictionary | benchmarks/k-nucleotide |
| reverse-complement | String/byte processing: reverse-complement a DNA buffer + hash it, char by char | benchmarks/reverse-complement |
| sort-search | Classic algorithms: hand-written quicksort + binary search over a mutable array | benchmarks/sort-search |
| dijkstra | Graphs: shortest paths with a hand-written binary-heap priority queue | benchmarks/dijkstra |
| blur | Image / 2D stencil: a 3×3 Gaussian convolution over a generated grayscale image | benchmarks/blur |
| k-means | Machine learning: integer Lloyd's clustering (nearest-centroid + floor-mean update) | benchmarks/k-means |
| sha256 | Bit manipulation / crypto: hand-written SHA-256, applied iteratively | benchmarks/sha256 |
| lz77 | Compression: hand-written LZ77 sliding-window longest-match search | benchmarks/lz77 |
| vm | Interpreter dispatch: a stack bytecode VM running a fixed program | benchmarks/vm |
| bigint | Multi-precision: hand-rolled base-2³² factorial with carry propagation | benchmarks/bigint |
| tak | Function-call / recursion overhead: naive triple-recursive Takeuchi, no memory | benchmarks/tak |
| polymorphism | Dynamic dispatch: megamorphic virtual calls resolved at runtime (K=6 types) | benchmarks/polymorphism |
| gemm | AI/ML — quantized int8 matrix multiply: the dominant tensor inference kernel, cache-pressure inner loop | benchmarks/gemm |
| viterbi | AI/ML — HMM/CRF sequence decoding: integer max-plus DP trellis + back-pointer trace | benchmarks/viterbi |
| gbdt | AI/ML — gradient-boosted tree ensemble inference: data-dependent branchy tree traversal | benchmarks/gbdt |
| message-ring | Concurrency overhead: per-handoff cost of a language's cooperative message-passing primitive (a 32-worker ring); N/A for Perl and COBOL | benchmarks/message-ring |
Every benchmark has a reference checksum that all implementations must reproduce bit for bit: proof that they all do exactly the same work.
Adding a benchmark is adding source files, not rewiring the harness. A
benchmarks/<name>/spec.json carries the two sizes + checksums; each image compiles every
benchmark via a glob, and a per-language run template in languages.json (with a {b}
placeholder) tells the driver how to launch it. Run one benchmark locally with
BENCH=binary-trees scripts/bench-local.sh <lang>.
The instruction track above is deliberately single-threaded: the qemu insn plugin sums guest
instructions across all cores, so it is blind to parallel speedup (a GIL-bound thread program
even looks like it scales). A separate, complementary scaling track answers the other
question: given a benchmark whose algorithm is parallelizable, how well does each language let
you use multiple cores?
It reports the wall-clock speedup T1/TP (time at 1 core over time at P cores) of the
compute region only, run natively (no qemu) at 1, 2 and 4 cores; higher is better (the
ideal is the core count). Taking a ratio cancels
machine-speed noise, so it stays stable on shared CI runners (validated to ±0.03). This track is
not bit-reproducible like the instruction track; it is reported as a ratio. The full fairness
rulebook (decomposition, partition, no-shared-write rules, per-language primitives) lives in
docs/scaling-track.md.
Five embarrassingly-parallel axes are measured across every language that has a concurrency primitive (all but COBOL), each using its idiomatic real-parallel primitive (pthreads, goroutines, fork/processes, BEAM Tasks, JVM/CLR thread pools):
| Benchmark | Speedup charts |
|---|---|
| gemm | vs cores · bars |
| mandelbrot | vs cores · bars |
| blur | vs cores · bars |
| k-means | vs cores · bars |
| gbdt | vs cores · bars |
Most native and VM runtimes land close to the ideal (near 4× on 4 cores). The GIL/GVL languages
(Python, Ruby) reach it via processes; the thread variant flatlining at ~1.0× is the GIL made
visible, not a bug. A dedicated CI workflow (.github/workflows/scaling.yml) reruns the track on
x86_64 and commits the results under results/scaling/.
For a combined, cross-track view of concurrency per language (primitive overhead + parallel scalability + the GIL/GVL line, with an explicit account of what is and is not measurable without dedicated hardware), see the concurrency study.
languages/<lang>/Dockerfile version-pinned image (compiles every benchmark)
languages/<lang>/<bench>.* benchmark implementation (fannkuch.*, binary-trees.*, …)
languages/_base/ shared qemu + insn-plugin base image
scripts/measure.sh measures and emits the JSON (inside each image)
scripts/measure-scaling.sh scaling track: wall-clock T1/TP, native, no qemu
scripts/bench-local.sh build + measure benchmarks × languages locally
scripts/check-versions.mjs version watcher (Node, no dependencies)
scripts/make_charts.py SVG chart generator (no dependencies)
scripts/make_scaling_charts.py scaling-track speedup chart generator
scripts/make_tables.py splices each study's results table into its README
scripts/make_matrix.py master language × benchmark matrix (the README hero)
scripts/make_ring_wall_chart.py message-ring wall-clock per-hop chart
languages.json registry: endoflife slug, build-arg, runtimeKind, runtimeEnv, run
versions.lock.json pinned version per language
scaling-config.json scaling track: par-run templates, per-class sizes, JIT warmup
benchmarks/<name>/spec.json sizes + reference checksums for that benchmark
benchmarks/<name>/README.md algorithm spec, fairness rules, study
languages/<lang>/<bench>-par.* parallel variant for the scaling track (5 axes)
results/ instruction-track result history (versioned in git)
results/scaling/ scaling-track results (wall-clock T1/TP per language)
docs/charts/ generated SVG charts (instruction + scaling)
docs/scaling-track.md scaling track: fairness rulebook
docs/concurrency-study.md cross-track concurrency study (per language)
docs/metric-validity.md empirical validity of the instruction metric (calibration + cross-ISA)
v0: 12 languages + a C baseline measured uniformly under qemu+insn across a nineteen-benchmark
suite (fannkuch, binary-trees, mandelbrot, k-nucleotide, reverse-complement, sort-search, dijkstra,
blur, k-means, sha256, lz77, vm, bigint, tak, polymorphism, gemm, viterbi, gbdt, message-ring: integer / allocation / floating-point / hash-map / string /
algorithms / graphs / image / ML / bit-manipulation / compression / interpreter-dispatch /
multi-precision / call-overhead / dynamic-dispatch / quantized-matmul / sequence-DP / tree-ensemble /
concurrency-overhead; message-ring is N/A for Perl and COBOL, which have no cooperative primitive),
the measurement engine characterized empirically, CI pipelines defined. A complementary wall-clock
scaling track adds multicore speedup (T1/TP) for the five parallelizable axes, and a
cross-track concurrency study (docs/concurrency-study.md) ties the two together. Local, in
development.
MIT.