Lang Lab

Reproducible benchmarks for backend programming languages.

Real work each language does vs the C baseline (= 1.00×), as the differential I(n₂)−I(n₁) that cancels startup + JIT. Lower is better (less work than C). Geomean across all 19 axes; green cells beat or tie C. Full method below.

Leaderboard (sorted by overall geomean)

#	Language	Overall (vs C)	Fastest axis	Slowest axis
1	C (baseline)	1.00×	—	—
2	Rust	1.04×	message-ring 0.28×	k-nucleotide 2.73×
3	Go	1.75×	binary-trees 1.09×	message-ring 11.5×
4	C#	1.81×	binary-trees 0.45×	message-ring 35.3×
5	Swift	2.41×	blur 0.56×	message-ring 52.2×
6	Scala	2.82×	binary-trees 0.28×	message-ring 55.1×
7	Kotlin	2.99×	binary-trees 0.28×	message-ring 59.6×
8	Elixir	22.6×	binary-trees 0.30×	polymorphism 136×
9	PHP	29.9×	binary-trees 5.75×	sha256 98.0×
10	Ruby	70.0×	binary-trees 10.3×	sha256 278×
11	Python	109×	binary-trees 11.2×	sha256 601×
12	Perl	146×	binary-trees 19.0×	sha256 701×
13	COBOL	461×	fannkuch 26.8×	sha256 223k×*

* extrapolated from small probes (negligible-startup runtimes only), not measured full-size; still counted in the geomean.

How to read the matrix. It shows user-space instruction work vs C (a proxy for algorithmic efficiency), not wall-clock speed. It is a reliable rank for native and interpreter languages and ISA-robust for them; the JVM cells (Kotlin, Scala) are ISA-specific (their ratio roughly doubles from arm64 to x86_64). For concurrency read the scaling track and the concurrency study. The metric's empirical validity is measured in docs/metric-validity.md.

Lang Lab measures, rigorously and reproducibly, how much CPU work it costs to run the same algorithm across backend programming languages. It doesn't chase the realism of a web framework (that's TechEmpower) or the syntactic breadth of Rosetta Code. Its single differentiator is reproducibility and automatic maintenance.

⚠️ Not a human-language lab. Lang Lab measures programming languages (Python, Go, Rust, …), it doesn't teach you to conjugate verbs.

The idea in one sentence

For each language we run the same benchmark implementation inside a version-pinned Docker image, count the CPU instructions executed (not wall-clock time) with a deterministic emulator, and publish the history, all on free CI, refreshed only when a language ships a new version.

A complementary scaling track additionally measures wall-clock parallel speedup (T1/TP) on multiple cores for the parallelizable benchmarks (see below).

Why instructions, not seconds

Timing seconds on shared CI runners gives ±20-30 % noise: variable hardware, noisy neighbours, unpredictable turbo. Instead we count the number of instructions executed, which for single-threaded code is identical on every run regardless of machine load: the same technique rustc-perf uses to track the Rust compiler's performance on CI.

Honest trade-off. Instructions ≠ wall-clock time. This metric is excellent for comparing algorithmic efficiency and detecting regressions across versions, but it does not capture real latency, parallelism, memory bandwidth, or the per-instruction cost (which varies by ISA and microarchitecture). It is a measure of computational work, not absolute speed. For parallelism specifically, the complementary scaling track measures wall-clock multicore speedup. Empirically the metric predicts wall-clock order well and magnitude to roughly a 1.3× factor for compute-bound code (worse for JIT and syscall-bound code); its single most dependable use is regression detection within one language across versions. The full validity characterization is in docs/metric-validity.md.

Two tracks, two contracts

The project measures two different things, with two different reproducibility contracts. Read one track's number through its own contract, never the other's.

Track	Measures	Backend	Reproducibility contract
Instruction (headline)	user-space instruction work vs C (a proxy for algorithmic efficiency, not speed)	qemu-user + `insn` plugin	deterministic given (ISA, qemu, plugin): bit-exact for native/interpreters, median-of-N for JIT/GC
Scaling	wall-clock parallel speedup `T1/TP` on 1/2/4 cores	native, no qemu	a ratio (machine speed cancels), not bit-reproducible; stable to ±0.03 on shared CI

The scaling track and the message-ring wall-clock companion exist precisely because the instruction metric cannot see parallelism or syscall cost (its limits are quantified in docs/metric-validity.md).

Languages (12 + a C baseline)

Chosen to cover every backend runtime archetype, not just the popular ones, so the methodology meets the hard runtimes early.

Archetype	Languages
Native (no GC)	Rust, Swift, C (1.0× baseline), COBOL (GnuCOBOL→native)
Compiled + concurrent GC	Go
Interpreter	Python, Perl, PHP, Ruby
VM with JIT + GC	Kotlin (JVM), Scala (JVM), C# (CLR)
Actor VM (BEAM)	Elixir

The measurement engine

Measuring instructions uniformly and comparably across all those runtimes was the core problem of the project, and it took some work.

Backend: QEMU user-mode + the TCG insn plugin. It emulates at the instruction level and counts the guest's instructions. Deterministic and, within one ISA, directly comparable. (valgrind/cachegrind was rejected: it segfaults on Go and measures the wrong process for launcher runtimes.)

The lesson. For a while it looked like qemu couldn't emulate the complex runtimes (CPython, Perl, PHP, JVM, BEAM); they all exited silently. The real cause was a harness bug, not emulation: qemu-user doesn't PATH-resolve a bare command name. Natives were invoked by absolute path and worked; interpreters/VMs by bare command (python, java) and failed. Resolving argv[0] to an absolute path fixed it, and all 12 measure cleanly. Elixir needed one extra step (its elixir→erl→beam.smp shell wrappers aren't ELF binaries, so we capture and run beam.smp directly). "No output" ≠ "can't emulate": verify the cause.

Rigor rules:

Fixed ISA (arm64 locally, x86_64 in CI): qemu counts the guest's instructions; different ISAs aren't comparable.
Metric = differential I(n₂) − I(n₁), normalized to C = 1.0× (lower is better: less work than C): cancels startup and JIT, isolating the algorithm's real work.
Jittery runtimes (Go, C#, JVM): pinned to single-thread + GC off (runtimeEnv) and reported as median of N; pure natives are bit-exact.

See the fannkuch study for the full results and methodology.

How it works (three pipelines, free CI)

endoflife.date ──► version-watch (weekly)  ──► PR bumping versions.lock.json
                                                     │ merge triggers…
                                                     ▼
                 benchmark ──► version-pinned Docker image per language
                               └► qemu+insn → results/<date>.json   (instruction track)

                 scaling   ──► the same images, run natively (no qemu)
                               └► wall-clock T1/TP at 1/2/4 cores → results/scaling/   (scaling track)

Everything runs on free GitHub-hosted runners: no self-hosted runners, no own VM, no security exposure. The result history lives versioned in results/.

Run it locally

Requires Docker.

scripts/bench-local.sh                          # every language × every benchmark
scripts/bench-local.sh rust                      # one language, all benchmarks
BENCH=binary-trees scripts/bench-local.sh rust   # one language, one benchmark

Each run prints one JSON line with the differential and per-size medians.

Benchmarks

A small suite, chosen so each benchmark stresses a different, orthogonal axis of the runtime. A language fast at one can be slow at another, which is exactly why one micro-benchmark isn't enough.

Benchmark	Axis: what it measures	Study
fannkuch-redux	Integer compute: permutations & array reversals, no allocation	benchmarks/fannkuch
binary-trees	Allocation + GC churn: build & walk real heap nodes, no arenas	benchmarks/binary-trees
mandelbrot	Floating-point: a tight IEEE-754 `double` inner loop, almost no memory traffic	benchmarks/mandelbrot
k-nucleotide	Hash maps: count k-mer frequencies of a DNA string via the std dictionary	benchmarks/k-nucleotide
reverse-complement	String/byte processing: reverse-complement a DNA buffer + hash it, char by char	benchmarks/reverse-complement
sort-search	Classic algorithms: hand-written quicksort + binary search over a mutable array	benchmarks/sort-search
dijkstra	Graphs: shortest paths with a hand-written binary-heap priority queue	benchmarks/dijkstra
blur	Image / 2D stencil: a 3×3 Gaussian convolution over a generated grayscale image	benchmarks/blur
k-means	Machine learning: integer Lloyd's clustering (nearest-centroid + floor-mean update)	benchmarks/k-means
sha256	Bit manipulation / crypto: hand-written SHA-256, applied iteratively	benchmarks/sha256
lz77	Compression: hand-written LZ77 sliding-window longest-match search	benchmarks/lz77
vm	Interpreter dispatch: a stack bytecode VM running a fixed program	benchmarks/vm
bigint	Multi-precision: hand-rolled base-2³² factorial with carry propagation	benchmarks/bigint
tak	Function-call / recursion overhead: naive triple-recursive Takeuchi, no memory	benchmarks/tak
polymorphism	Dynamic dispatch: megamorphic virtual calls resolved at runtime (K=6 types)	benchmarks/polymorphism
gemm	AI/ML — quantized int8 matrix multiply: the dominant tensor inference kernel, cache-pressure inner loop	benchmarks/gemm
viterbi	AI/ML — HMM/CRF sequence decoding: integer max-plus DP trellis + back-pointer trace	benchmarks/viterbi
gbdt	AI/ML — gradient-boosted tree ensemble inference: data-dependent branchy tree traversal	benchmarks/gbdt
message-ring	Concurrency overhead: per-handoff cost of a language's cooperative message-passing primitive (a 32-worker ring); N/A for Perl and COBOL	benchmarks/message-ring

Every benchmark has a reference checksum that all implementations must reproduce bit for bit: proof that they all do exactly the same work.

Adding a benchmark is adding source files, not rewiring the harness. A benchmarks/<name>/spec.json carries the two sizes + checksums; each image compiles every benchmark via a glob, and a per-language run template in languages.json (with a {b} placeholder) tells the driver how to launch it. Run one benchmark locally with BENCH=binary-trees scripts/bench-local.sh <lang>.

Scaling track (wall-clock parallel speedup)

The instruction track above is deliberately single-threaded: the qemu insn plugin sums guest instructions across all cores, so it is blind to parallel speedup (a GIL-bound thread program even looks like it scales). A separate, complementary scaling track answers the other question: given a benchmark whose algorithm is parallelizable, how well does each language let you use multiple cores?

It reports the wall-clock speedup T1/TP (time at 1 core over time at P cores) of the compute region only, run natively (no qemu) at 1, 2 and 4 cores; higher is better (the ideal is the core count). Taking a ratio cancels machine-speed noise, so it stays stable on shared CI runners (validated to ±0.03). This track is not bit-reproducible like the instruction track; it is reported as a ratio. The full fairness rulebook (decomposition, partition, no-shared-write rules, per-language primitives) lives in docs/scaling-track.md.

Five embarrassingly-parallel axes are measured across every language that has a concurrency primitive (all but COBOL), each using its idiomatic real-parallel primitive (pthreads, goroutines, fork/processes, BEAM Tasks, JVM/CLR thread pools):

Benchmark	Speedup charts
gemm	vs cores · bars
mandelbrot	vs cores · bars
blur	vs cores · bars
k-means	vs cores · bars
gbdt	vs cores · bars

Most native and VM runtimes land close to the ideal (near 4× on 4 cores). The GIL/GVL languages (Python, Ruby) reach it via processes; the thread variant flatlining at ~1.0× is the GIL made visible, not a bug. A dedicated CI workflow (.github/workflows/scaling.yml) reruns the track on x86_64 and commits the results under results/scaling/.

For a combined, cross-track view of concurrency per language (primitive overhead + parallel scalability + the GIL/GVL line, with an explicit account of what is and is not measurable without dedicated hardware), see the concurrency study.

Structure

languages/<lang>/Dockerfile      version-pinned image (compiles every benchmark)
languages/<lang>/<bench>.*       benchmark implementation (fannkuch.*, binary-trees.*, …)
languages/_base/                 shared qemu + insn-plugin base image
scripts/measure.sh               measures and emits the JSON (inside each image)
scripts/measure-scaling.sh       scaling track: wall-clock T1/TP, native, no qemu
scripts/bench-local.sh           build + measure benchmarks × languages locally
scripts/check-versions.mjs       version watcher (Node, no dependencies)
scripts/make_charts.py           SVG chart generator (no dependencies)
scripts/make_scaling_charts.py   scaling-track speedup chart generator
scripts/make_tables.py           splices each study's results table into its README
scripts/make_matrix.py           master language × benchmark matrix (the README hero)
scripts/make_ring_wall_chart.py  message-ring wall-clock per-hop chart
languages.json                   registry: endoflife slug, build-arg, runtimeKind, runtimeEnv, run
versions.lock.json               pinned version per language
scaling-config.json              scaling track: par-run templates, per-class sizes, JIT warmup
benchmarks/<name>/spec.json      sizes + reference checksums for that benchmark
benchmarks/<name>/README.md      algorithm spec, fairness rules, study
languages/<lang>/<bench>-par.*   parallel variant for the scaling track (5 axes)
results/                         instruction-track result history (versioned in git)
results/scaling/                 scaling-track results (wall-clock T1/TP per language)
docs/charts/                     generated SVG charts (instruction + scaling)
docs/scaling-track.md            scaling track: fairness rulebook
docs/concurrency-study.md        cross-track concurrency study (per language)
docs/metric-validity.md          empirical validity of the instruction metric (calibration + cross-ISA)

Status

v0: 12 languages + a C baseline measured uniformly under qemu+insn across a nineteen-benchmark suite (fannkuch, binary-trees, mandelbrot, k-nucleotide, reverse-complement, sort-search, dijkstra, blur, k-means, sha256, lz77, vm, bigint, tak, polymorphism, gemm, viterbi, gbdt, message-ring: integer / allocation / floating-point / hash-map / string / algorithms / graphs / image / ML / bit-manipulation / compression / interpreter-dispatch / multi-precision / call-overhead / dynamic-dispatch / quantized-matmul / sequence-DP / tree-ensemble / concurrency-overhead; message-ring is N/A for Perl and COBOL, which have no cooperative primitive), the measurement engine characterized empirically, CI pipelines defined. A complementary wall-clock scaling track adds multicore speedup (T1/TP) for the five parallelizable axes, and a cross-track concurrency study (docs/concurrency-study.md) ties the two together. Local, in development.

License

MIT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lang Lab

The idea in one sentence

Why instructions, not seconds

Two tracks, two contracts

Languages (12 + a C baseline)

The measurement engine

How it works (three pipelines, free CI)

Run it locally

Benchmarks

Scaling track (wall-clock parallel speedup)

Structure

Status

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
docs		docs
languages		languages
results		results
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
languages.json		languages.json
package.json		package.json
scaling-config.json		scaling-config.json
versions.lock.json		versions.lock.json

Folders and files

Latest commit

History

Repository files navigation

Lang Lab

The idea in one sentence

Why instructions, not seconds

Two tracks, two contracts

Languages (12 + a C baseline)

The measurement engine

How it works (three pipelines, free CI)

Run it locally

Benchmarks

Scaling track (wall-clock parallel speedup)

Structure

Status

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages