Releases: nl2shell/bench
v0.3.0 — Nl2shellAgent + Terminal World Model first results
v0.3.0 — Nl2shellAgent + Terminal World Model first results
v0.2 measured a frontier baseline on TerminalBench-2.0 (claude-haiku-4-5: 30% pass on the 10-task sample, $1.63, 21 m). v0.3 closes the actual ask: let nl2shell:0.8b attempt the tasks itself as a Harbor BaseAgent, and report the honest gap.
What's new
Nl2shellAgent(src/nl2shell/agent.py) — a HarborBaseAgentthat drives a TerminalBench task by:- rendering
instruction + observation historyinto a flat NL prompt - asking Ollama for one bash command
- executing it via
environment.exec - appending
(cmd, rc, stdout, stderr)to the history - looping until
max_steps, an empty/unsafe command, or a stop-marker.
- rendering
- Observation modes —
full | command | nonefor ablation: how much state does the model actually use? - Configs —
configs/harbor_nl2shell_smoke.yaml(single agent, sample) andconfigs/harbor_nl2shell_obs_ablation.yaml(3 agents × sample for the ablation). --configpassthrough inevals/run_terminal_bench.pyso YAML/JSON Harbor configs work directly:nl2shell-eval-tb --config configs/harbor_nl2shell_smoke.yaml.- 8 new tests — prompt rendering, observation-mode redaction, unsafe-command blocking, max-steps cap, Ollama health check. 28/28 passing.
Terminal World Model — first complete pipeline
The companion ~/terminal-world-model workspace ran end-to-end for the first time in this session, and all 12 gate items pass (`complete=true`):
- State-LoRA training (108 train / 12 val) → eval_loss 0.745, 66 s on RTX 5060 Ti
- Baseline Harbor smoke on
adaptive-rejection-sampler(Daytona) → reward 0.0, 6 cmds, fails on missing R - State-adapter Harbor smoke → reward 0.0, but knows
library(ars)(rc=0 vs baseline rc=127) — real behavioural delta from the LoRA - Observation-history ablation (
none/command/full):none&command: model loops on the same broken command (cannot course-correct)full: tries 3 different invocations before looping — observations matter, even when reward stays at 0.
Attached: terminal_world_model.pdf (full paper), evidence-report.md (gate state), ablation-comparison.md (per-mode metrics).
Why this matters
These numbers are 0.0 across the board on one task — that's the honest signal. Single-shot NL→bash translators trained on nl2shell-training-v3 (which excludes TerminalBench tasks, tests, and solutions) lack the agentic loop-breaking behaviour that frontier models like claude-haiku-4-5 have. The gap between v0.2 (30%) and v0.3 (0%) on the same Daytona sandbox is exactly the headroom a Terminal World Model layer is supposed to close. v0.4 will run Nl2shellAgent against the full 89-task TB2 and start the v4 dataset targeted at the 18 starved concept tags (multi-turn observation, missing-tool recognition, file-system state).
Reproduce
pip install -e . # adds the nl2shell-eval-tb CLI
ollama pull hf.co/AryaYT/nl2shell-0.8b
ollama cp hf.co/AryaYT/nl2shell-0.8b nl2shell:0.8b
# Local Docker
nl2shell-eval-tb --config configs/harbor_nl2shell_smoke.yaml
# Or via Daytona (flip the env field in the YAML; needs DAYTONA_API_KEY)🤖 Generated with Claude Code
v0.2.0 — TerminalBench-2.0 baseline + docker eval-in-a-box
Highlights
- TerminalBench-2.0 via Harbor is wired up end-to-end. New
nl2shell-eval-tbCLI shells out toharbor runand ingests per-trial outputs into our scoreboard format. - Claude-haiku-4-5 baseline on
terminal-bench-sample@2.0via Daytona cloud sandboxes:
| metric | value |
|---|---|
| pass rate | 30.0% (3/10) |
| mean reward | 0.300 |
| wall clock | 21m 02s |
| total cost | $1.63 |
| timeouts | 2 (qemu-alpine-ssh, qemu-startup) |
Passed: fix-code-vulnerability, log-summary-date-ranges, sqlite-with-gcov.
- Reproducible docker eval-in-a-box —
docker compose -f docker/docker-compose.yml upbrings up Ollama + the bench harness on any machine. Mounts/var/run/docker.sockso Harbor can spawn sibling task containers. Nl2shellAgentHarbor adapter is on the v0.3 roadmap — that's where nl2shell:0.8b runs as the agent itself and we get the honest TB2 number for our model.
Eval-only reminder
TerminalBench is eval-only material. Outputs from these runs are not folded into model training, DPO, GRPO, or preference data.
Reproduce
```bash
git clone https://github.com/nl2shell/bench.git && cd bench
pip install -e .
export ANTHROPIC_API_KEY=...
export DAYTONA_API_KEY=...
nl2shell-eval-tb --dataset terminal-bench-sample@2.0 \
--agent claude-code --model anthropic/claude-haiku-4-5 \
--env daytona -n 4
```
See `docs/benchmarks.md` for details and `docker/README.md` for the compose-up flow.
Full changelog: https://github.com/nl2shell/bench/blob/main/CHANGELOG.md
v0.1.0 — multi-metric eval harness, first headline benchmark
Changelog
All notable changes to nl2shell/bench are recorded here. Format follows Keep a Changelog; versioning is Semantic Versioning.
Unreleased
Planned for v0.2
- Filter multi-line shell scripts from the v3 holdout (currently ~3% of cases have gold containing newlines or shebangs that a single-shot model can't reproduce).
- Wire up the TerminalBench (Harbor) adapter. Skeleton exists at
evals/benchmarks/terminal_bench.py. - Stop-token tuning on the Ollama client (currently relies only on
num_predict=96cap).
0.1.0 — 2026-05-24
First release. Standalone evaluation harness for the nl2shell:0.8b model.
Added
src/nl2shell/concepts.py— 28 closed-vocabulary regex tags spanning redirection (8), pipes/chaining (5), process control (4), substitution (5), globbing/quoting (4), and tool idioms (2). Plusoperators()for verbatim symbol extraction,core_command()for first-token analysis,difficulty()bucketing 1–4.src/nl2shell/label.py— pure-function row labeller that drops the chat-template field and emits{nl, bash, source, core_command, operators, postprocessors, concepts, difficulty, unsafe_quoting_detected, hash}.src/nl2shell/audit.py— applies the labeller to a HuggingFace dataset (default:AryaYT/nl2shell-training-v3), writes labelled JSONL + a per-tag gap report. Marks tags asSTARVED(<5%),lean(<15%), orok(≥15%).src/nl2shell/client.py— minimal Ollama HTTP client with deterministic options (temperature=0,seed=42,num_predict=96). Health check and first-command-line extraction.src/nl2shell/metrics.py— per-case scorers:exact_match,template_match(literal-aware tokenisation),parses(bash -n),shellcheck_clean(when shellcheck installed),concept_overlap(Jaccard of v3 concept tags).scripts/build_holdout.py— stratified sampler that balances the holdout across STARVED tags so per-tag accuracy is informative. Deterministic seeded shuffle.evals/run.py— end-to-end runner that ingests a benchmark, queries Ollama, scores, aggregates per-tag/per-difficulty/per-source, writessummary.json+per_case.jsonl+report.md.evals/benchmarks/v3_holdout.py— adapter yielding cases fromdata/holdout/holdout.jsonl.evals/benchmarks/terminal_bench.py— scaffold + design notes; not yet runnable.tests/— 16 unit tests covering tagger and metrics.- CI-ready —
pyproject.tomlwith hatchling backend, ruff + pytest dev extras.
Headline benchmark — v3-holdout (n=426)
Wall time: 99s on the local box. Median latency: 0.21s/case. p95: 0.41s.
| metric | value |
|---|---|
| exact_match | 3.3% |
| template_match | 4.0% |
parses (bash -n) |
96.7% |
| concept-tag Jaccard | 0.310 |
Per difficulty bucket:
| difficulty | n | exact |
|---|---|---|
| 1 (no operators) | 32 | 21.9% |
| 2 (one operator) | 90 | 5.6% |
| 3 (2–3 operators) | 182 | 1.1% |
| 4 (4+ operators) | 122 | 0.0% |
See results/v3-holdout__20260524T101822Z/report.md for the full per-tag breakdown.
Known limitations
- Exact-match is harsh: gold is canonical-bash-from-noisy-sources; the model often produces more idiomatic shell (e.g.
find . -name "*bar"vsfind -name *bar). - Some gold cases are multi-line shell scripts. A single-shot translator can't reproduce them. To be filtered in v0.2.
- ProgramBench is documented as out-of-scope (whole-program reconstruction; not a NL→bash task).