24 May 13:05

aryayt

1a08138

v0.3.0 — Nl2shellAgent + Terminal World Model first results Latest

Latest

v0.3.0 — Nl2shellAgent + Terminal World Model first results

v0.2 measured a frontier baseline on TerminalBench-2.0 (claude-haiku-4-5: 30% pass on the 10-task sample, $1.63, 21 m). v0.3 closes the actual ask: let nl2shell:0.8b attempt the tasks itself as a Harbor BaseAgent, and report the honest gap.

What's new

Nl2shellAgent (src/nl2shell/agent.py) — a Harbor BaseAgent that drives a TerminalBench task by:
1. rendering instruction + observation history into a flat NL prompt
2. asking Ollama for one bash command
3. executing it via environment.exec
4. appending (cmd, rc, stdout, stderr) to the history
5. looping until max_steps, an empty/unsafe command, or a stop-marker.
Observation modes — full | command | none for ablation: how much state does the model actually use?
Configs — configs/harbor_nl2shell_smoke.yaml (single agent, sample) and configs/harbor_nl2shell_obs_ablation.yaml (3 agents × sample for the ablation).
--config passthrough in evals/run_terminal_bench.py so YAML/JSON Harbor configs work directly: nl2shell-eval-tb --config configs/harbor_nl2shell_smoke.yaml.
8 new tests — prompt rendering, observation-mode redaction, unsafe-command blocking, max-steps cap, Ollama health check. 28/28 passing.

Terminal World Model — first complete pipeline

The companion ~/terminal-world-model workspace ran end-to-end for the first time in this session, and all 12 gate items pass (`complete=true`):

State-LoRA training (108 train / 12 val) → eval_loss 0.745, 66 s on RTX 5060 Ti
Baseline Harbor smoke on adaptive-rejection-sampler (Daytona) → reward 0.0, 6 cmds, fails on missing R
State-adapter Harbor smoke → reward 0.0, but knows library(ars) (rc=0 vs baseline rc=127) — real behavioural delta from the LoRA
Observation-history ablation (none / command / full):
- none & command: model loops on the same broken command (cannot course-correct)
- full: tries 3 different invocations before looping — observations matter, even when reward stays at 0.

Attached: terminal_world_model.pdf (full paper), evidence-report.md (gate state), ablation-comparison.md (per-mode metrics).

Why this matters

These numbers are 0.0 across the board on one task — that's the honest signal. Single-shot NL→bash translators trained on nl2shell-training-v3 (which excludes TerminalBench tasks, tests, and solutions) lack the agentic loop-breaking behaviour that frontier models like claude-haiku-4-5 have. The gap between v0.2 (30%) and v0.3 (0%) on the same Daytona sandbox is exactly the headroom a Terminal World Model layer is supposed to close. v0.4 will run Nl2shellAgent against the full 89-task TB2 and start the v4 dataset targeted at the 18 starved concept tags (multi-turn observation, missing-tool recognition, file-system state).

Reproduce

pip install -e .  # adds the nl2shell-eval-tb CLI
ollama pull hf.co/AryaYT/nl2shell-0.8b
ollama cp hf.co/AryaYT/nl2shell-0.8b nl2shell:0.8b

# Local Docker
nl2shell-eval-tb --config configs/harbor_nl2shell_smoke.yaml

# Or via Daytona (flip the env field in the YAML; needs DAYTONA_API_KEY)

🤖 Generated with Claude Code

Assets 5

24 May 11:25

aryayt

v0.2.0

a16d0a0

v0.2.0 — TerminalBench-2.0 baseline + docker eval-in-a-box

Highlights

TerminalBench-2.0 via Harbor is wired up end-to-end. New nl2shell-eval-tb CLI shells out to harbor run and ingests per-trial outputs into our scoreboard format.
Claude-haiku-4-5 baseline on terminal-bench-sample@2.0 via Daytona cloud sandboxes:

metric	value
pass rate	30.0% (3/10)
mean reward	0.300
wall clock	21m 02s
total cost	$1.63
timeouts	2 (`qemu-alpine-ssh`, `qemu-startup`)

Passed: fix-code-vulnerability, log-summary-date-ranges, sqlite-with-gcov.

Reproducible docker eval-in-a-box — docker compose -f docker/docker-compose.yml up brings up Ollama + the bench harness on any machine. Mounts /var/run/docker.sock so Harbor can spawn sibling task containers.
Nl2shellAgent Harbor adapter is on the v0.3 roadmap — that's where nl2shell:0.8b runs as the agent itself and we get the honest TB2 number for our model.

Eval-only reminder

TerminalBench is eval-only material. Outputs from these runs are not folded into model training, DPO, GRPO, or preference data.

Reproduce

```bash
git clone https://github.com/nl2shell/bench.git && cd bench
pip install -e .

export ANTHROPIC_API_KEY=...
export DAYTONA_API_KEY=...
nl2shell-eval-tb --dataset terminal-bench-sample@2.0 \
--agent claude-code --model anthropic/claude-haiku-4-5 \
--env daytona -n 4
```

See `docs/benchmarks.md` for details and `docker/README.md` for the compose-up flow.

Full changelog: https://github.com/nl2shell/bench/blob/main/CHANGELOG.md

Assets 4

24 May 10:28

aryayt

v0.1.0

0fb0bf0

v0.1.0 — multi-metric eval harness, first headline benchmark

Changelog

All notable changes to nl2shell/bench are recorded here. Format follows Keep a Changelog; versioning is Semantic Versioning.

Unreleased

Planned for v0.2

Filter multi-line shell scripts from the v3 holdout (currently ~3% of cases have gold containing newlines or shebangs that a single-shot model can't reproduce).
Wire up the TerminalBench (Harbor) adapter. Skeleton exists at evals/benchmarks/terminal_bench.py.
Stop-token tuning on the Ollama client (currently relies only on num_predict=96 cap).

0.1.0 — 2026-05-24

First release. Standalone evaluation harness for the nl2shell:0.8b model.

Added

src/nl2shell/concepts.py — 28 closed-vocabulary regex tags spanning redirection (8), pipes/chaining (5), process control (4), substitution (5), globbing/quoting (4), and tool idioms (2). Plus operators() for verbatim symbol extraction, core_command() for first-token analysis, difficulty() bucketing 1–4.
src/nl2shell/label.py — pure-function row labeller that drops the chat-template field and emits {nl, bash, source, core_command, operators, postprocessors, concepts, difficulty, unsafe_quoting_detected, hash}.
src/nl2shell/audit.py — applies the labeller to a HuggingFace dataset (default: AryaYT/nl2shell-training-v3), writes labelled JSONL + a per-tag gap report. Marks tags as STARVED (<5%), lean (<15%), or ok (≥15%).
src/nl2shell/client.py — minimal Ollama HTTP client with deterministic options (temperature=0, seed=42, num_predict=96). Health check and first-command-line extraction.
src/nl2shell/metrics.py — per-case scorers: exact_match, template_match (literal-aware tokenisation), parses (bash -n), shellcheck_clean (when shellcheck installed), concept_overlap (Jaccard of v3 concept tags).
scripts/build_holdout.py — stratified sampler that balances the holdout across STARVED tags so per-tag accuracy is informative. Deterministic seeded shuffle.
evals/run.py — end-to-end runner that ingests a benchmark, queries Ollama, scores, aggregates per-tag/per-difficulty/per-source, writes summary.json + per_case.jsonl + report.md.
evals/benchmarks/v3_holdout.py — adapter yielding cases from data/holdout/holdout.jsonl.
evals/benchmarks/terminal_bench.py — scaffold + design notes; not yet runnable.
tests/ — 16 unit tests covering tagger and metrics.
CI-ready — pyproject.toml with hatchling backend, ruff + pytest dev extras.

Headline benchmark — v3-holdout (n=426)

Wall time: 99s on the local box. Median latency: 0.21s/case. p95: 0.41s.

metric	value
exact_match	3.3%
template_match	4.0%
parses (`bash -n`)	96.7%
concept-tag Jaccard	0.310

Per difficulty bucket:

difficulty	n	exact
1 (no operators)	32	21.9%
2 (one operator)	90	5.6%
3 (2–3 operators)	182	1.1%
4 (4+ operators)	122	0.0%

See results/v3-holdout__20260524T101822Z/report.md for the full per-tag breakdown.

Known limitations

Exact-match is harsh: gold is canonical-bash-from-noisy-sources; the model often produces more idiomatic shell (e.g. find . -name "*bar" vs find -name *bar).
Some gold cases are multi-line shell scripts. A single-shot translator can't reproduce them. To be filtered in v0.2.
ProgramBench is documented as out-of-scope (whole-program reconstruction; not a NL→bash task).

Assets 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

v0.3.0 — Nl2shellAgent + Terminal World Model first results

What's new

Terminal World Model — first complete pipeline

Why this matters

Reproduce

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

Eval-only reminder

Reproduce

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Changelog

Unreleased

Planned for v0.2

0.1.0 — 2026-05-24

Added

Headline benchmark — v3-holdout (n=426)

Known limitations

Uh oh!

Releases: nl2shell/bench