Skip to content

Releases: nl2shell/bench

v0.3.0 — Nl2shellAgent + Terminal World Model first results

24 May 13:05

Choose a tag to compare

v0.3.0 — Nl2shellAgent + Terminal World Model first results

v0.2 measured a frontier baseline on TerminalBench-2.0 (claude-haiku-4-5: 30% pass on the 10-task sample, $1.63, 21 m). v0.3 closes the actual ask: let nl2shell:0.8b attempt the tasks itself as a Harbor BaseAgent, and report the honest gap.

What's new

  • Nl2shellAgent (src/nl2shell/agent.py) — a Harbor BaseAgent that drives a TerminalBench task by:
    1. rendering instruction + observation history into a flat NL prompt
    2. asking Ollama for one bash command
    3. executing it via environment.exec
    4. appending (cmd, rc, stdout, stderr) to the history
    5. looping until max_steps, an empty/unsafe command, or a stop-marker.
  • Observation modesfull | command | none for ablation: how much state does the model actually use?
  • Configsconfigs/harbor_nl2shell_smoke.yaml (single agent, sample) and configs/harbor_nl2shell_obs_ablation.yaml (3 agents × sample for the ablation).
  • --config passthrough in evals/run_terminal_bench.py so YAML/JSON Harbor configs work directly: nl2shell-eval-tb --config configs/harbor_nl2shell_smoke.yaml.
  • 8 new tests — prompt rendering, observation-mode redaction, unsafe-command blocking, max-steps cap, Ollama health check. 28/28 passing.

Terminal World Model — first complete pipeline

The companion ~/terminal-world-model workspace ran end-to-end for the first time in this session, and all 12 gate items pass (`complete=true`):

  • State-LoRA training (108 train / 12 val) → eval_loss 0.745, 66 s on RTX 5060 Ti
  • Baseline Harbor smoke on adaptive-rejection-sampler (Daytona) → reward 0.0, 6 cmds, fails on missing R
  • State-adapter Harbor smoke → reward 0.0, but knows library(ars) (rc=0 vs baseline rc=127) — real behavioural delta from the LoRA
  • Observation-history ablation (none / command / full):
    • none & command: model loops on the same broken command (cannot course-correct)
    • full: tries 3 different invocations before looping — observations matter, even when reward stays at 0.

Attached: terminal_world_model.pdf (full paper), evidence-report.md (gate state), ablation-comparison.md (per-mode metrics).

Why this matters

These numbers are 0.0 across the board on one task — that's the honest signal. Single-shot NL→bash translators trained on nl2shell-training-v3 (which excludes TerminalBench tasks, tests, and solutions) lack the agentic loop-breaking behaviour that frontier models like claude-haiku-4-5 have. The gap between v0.2 (30%) and v0.3 (0%) on the same Daytona sandbox is exactly the headroom a Terminal World Model layer is supposed to close. v0.4 will run Nl2shellAgent against the full 89-task TB2 and start the v4 dataset targeted at the 18 starved concept tags (multi-turn observation, missing-tool recognition, file-system state).

Reproduce

pip install -e .  # adds the nl2shell-eval-tb CLI
ollama pull hf.co/AryaYT/nl2shell-0.8b
ollama cp hf.co/AryaYT/nl2shell-0.8b nl2shell:0.8b

# Local Docker
nl2shell-eval-tb --config configs/harbor_nl2shell_smoke.yaml

# Or via Daytona (flip the env field in the YAML; needs DAYTONA_API_KEY)

🤖 Generated with Claude Code

v0.2.0 — TerminalBench-2.0 baseline + docker eval-in-a-box

24 May 11:25

Choose a tag to compare

Highlights

  • TerminalBench-2.0 via Harbor is wired up end-to-end. New nl2shell-eval-tb CLI shells out to harbor run and ingests per-trial outputs into our scoreboard format.
  • Claude-haiku-4-5 baseline on terminal-bench-sample@2.0 via Daytona cloud sandboxes:
metric value
pass rate 30.0% (3/10)
mean reward 0.300
wall clock 21m 02s
total cost $1.63
timeouts 2 (qemu-alpine-ssh, qemu-startup)

Passed: fix-code-vulnerability, log-summary-date-ranges, sqlite-with-gcov.

  • Reproducible docker eval-in-a-boxdocker compose -f docker/docker-compose.yml up brings up Ollama + the bench harness on any machine. Mounts /var/run/docker.sock so Harbor can spawn sibling task containers.
  • Nl2shellAgent Harbor adapter is on the v0.3 roadmap — that's where nl2shell:0.8b runs as the agent itself and we get the honest TB2 number for our model.

Eval-only reminder

TerminalBench is eval-only material. Outputs from these runs are not folded into model training, DPO, GRPO, or preference data.

Reproduce

```bash
git clone https://github.com/nl2shell/bench.git && cd bench
pip install -e .

export ANTHROPIC_API_KEY=...
export DAYTONA_API_KEY=...
nl2shell-eval-tb --dataset terminal-bench-sample@2.0 \
--agent claude-code --model anthropic/claude-haiku-4-5 \
--env daytona -n 4
```

See `docs/benchmarks.md` for details and `docker/README.md` for the compose-up flow.

Full changelog: https://github.com/nl2shell/bench/blob/main/CHANGELOG.md

v0.1.0 — multi-metric eval harness, first headline benchmark

24 May 10:28

Choose a tag to compare

Changelog

All notable changes to nl2shell/bench are recorded here. Format follows Keep a Changelog; versioning is Semantic Versioning.

Unreleased

Planned for v0.2

  • Filter multi-line shell scripts from the v3 holdout (currently ~3% of cases have gold containing newlines or shebangs that a single-shot model can't reproduce).
  • Wire up the TerminalBench (Harbor) adapter. Skeleton exists at evals/benchmarks/terminal_bench.py.
  • Stop-token tuning on the Ollama client (currently relies only on num_predict=96 cap).

0.1.0 — 2026-05-24

First release. Standalone evaluation harness for the nl2shell:0.8b model.

Added

  • src/nl2shell/concepts.py — 28 closed-vocabulary regex tags spanning redirection (8), pipes/chaining (5), process control (4), substitution (5), globbing/quoting (4), and tool idioms (2). Plus operators() for verbatim symbol extraction, core_command() for first-token analysis, difficulty() bucketing 1–4.
  • src/nl2shell/label.py — pure-function row labeller that drops the chat-template field and emits {nl, bash, source, core_command, operators, postprocessors, concepts, difficulty, unsafe_quoting_detected, hash}.
  • src/nl2shell/audit.py — applies the labeller to a HuggingFace dataset (default: AryaYT/nl2shell-training-v3), writes labelled JSONL + a per-tag gap report. Marks tags as STARVED (<5%), lean (<15%), or ok (≥15%).
  • src/nl2shell/client.py — minimal Ollama HTTP client with deterministic options (temperature=0, seed=42, num_predict=96). Health check and first-command-line extraction.
  • src/nl2shell/metrics.py — per-case scorers: exact_match, template_match (literal-aware tokenisation), parses (bash -n), shellcheck_clean (when shellcheck installed), concept_overlap (Jaccard of v3 concept tags).
  • scripts/build_holdout.py — stratified sampler that balances the holdout across STARVED tags so per-tag accuracy is informative. Deterministic seeded shuffle.
  • evals/run.py — end-to-end runner that ingests a benchmark, queries Ollama, scores, aggregates per-tag/per-difficulty/per-source, writes summary.json + per_case.jsonl + report.md.
  • evals/benchmarks/v3_holdout.py — adapter yielding cases from data/holdout/holdout.jsonl.
  • evals/benchmarks/terminal_bench.py — scaffold + design notes; not yet runnable.
  • tests/ — 16 unit tests covering tagger and metrics.
  • CI-readypyproject.toml with hatchling backend, ruff + pytest dev extras.

Headline benchmark — v3-holdout (n=426)

Wall time: 99s on the local box. Median latency: 0.21s/case. p95: 0.41s.

metric value
exact_match 3.3%
template_match 4.0%
parses (bash -n) 96.7%
concept-tag Jaccard 0.310

Per difficulty bucket:

difficulty n exact
1 (no operators) 32 21.9%
2 (one operator) 90 5.6%
3 (2–3 operators) 182 1.1%
4 (4+ operators) 122 0.0%

See results/v3-holdout__20260524T101822Z/report.md for the full per-tag breakdown.

Known limitations

  • Exact-match is harsh: gold is canonical-bash-from-noisy-sources; the model often produces more idiomatic shell (e.g. find . -name "*bar" vs find -name *bar).
  • Some gold cases are multi-line shell scripts. A single-shot translator can't reproduce them. To be filtered in v0.2.
  • ProgramBench is documented as out-of-scope (whole-program reconstruction; not a NL→bash task).