feat(eval): reimplement bashkit-eval on the mira framework by chaliy · Pull Request #2129 · everruns/bashkit

chaliy · 2026-06-27T00:54:09Z

What

Reimplements bashkit-eval as a mira eval
study (replacing the hand-rolled harness), refreshes the model matrix, hardens
the LLM providers against quota/billing hangs, and lands a fresh full results run.

Why

mira gives us the model matrix, scheduling, retries, resume, and
JSON/JUnit/Markdown/HTML reporting for free, so bashkit only has to supply the
subject under test and the scoring. The old runner.rs/report.rs machinery
goes away.

How

mira integration (src/mira_study.rs): each JSONL task → a mira Sample;
bashkit's agent loop over a persistent VFS → a mira Subject; the deterministic
expectation checks → a mira Scorer. The provider stack (Anthropic Messages,
OpenAI Chat, OpenAI Responses) and the agent loop are reused. Depends only on
mira-eval 0.2 (crates.io) — in-process Subject path, no mira-everruns.
Snapshot scoring (src/snapshot.rs): a mira Scorer only sees
&Sample + &Transcript, so after each run the subject walks the VFS into
transcript.files and records a Snapshot (tool stdout/stderr/exit + dir set)
in transcript.metadata. The checks (src/checks.rs) are ported verbatim from
the old scorer.rs — byte-for-byte the same pass/fail semantics.
Three evals: bashkit_bash (58 tasks, tagged by category),
bashkit_smoke (3), bashkit_scripting (scripted vs baseline via a mode
axis). Datasets unchanged, embedded via include_str!.
Provider hardening (src/provider/): shared HTTP client with connect (15s)
- total (300s) timeouts; central is_retryable_error() that retries only
  transient errors (rate-limit 429s / 5xx / Anthropic 529) and fast-fails on
  permanent ones — insufficient_quota, billing limits, auth (401/403), 4xx.
  This fixes an observed multi-minute hang when an OpenAI account was out of quota.
Removed runner.rs / report.rs / scorer.rs (+ scripting variants); main.rs
now serves the study to the mira host over stdio. Added mira.toml (results
dir + documented knobs). Updated specs/eval.md, specs/architecture.md,
specs/performance-results.md, AGENTS.md, both READMEs, and the justfile.

Results — full `bashkit_bash` (58 tasks), run 2026-06-27 on mira

Model	Score	Passed	Tool-call success	Duration
Claude Opus 4.8	95%	55/58	96%	12.8 min
Claude Haiku 4.5	95%	55/58	94%	7.4 min
GPT-5.3-Codex	93%	54/58	85%	12.1 min
GPT-5.5	88%	51/58	90%	8.2 min
Claude Sonnet 4.6	84%	49/58	93%	19.9 min

Saved under crates/bashkit-eval/results/mira/. bashkit_smoke is 9/9 across the
three Anthropic models.

Tests / verification

27 lib tests green (check semantics, sample loading, eval builders, retry
classifier — incl. the insufficient_quota-not-retryable case).
Live end-to-end: bashkit_smoke and full bashkit_bash across all 5 models.
cargo clippy -p bashkit-eval clean; rustfmt --check clean.

bashkit gaps the eval surfaced (filed, not fixed here)

bug(ls): -d / --directory not implemented #2127 — ls -d / --directory not implemented.
bug(vfs): user home directory not created for username(...) → writes to $HOME fail #2128 — username(...) doesn't provision /home/<user>, so writes to $HOME
fail (root cause of file_path_organizer failing on all 5 models).

⚠️ Before merge (needs a github-git-capable environment)

This was developed in a sandbox whose egress blocks GitHub git, so the
monty/jiter git deps can't be resolved here. Two CI-side items remain:

Regenerate Cargo.lock to include mira-eval (and transitives
mira-macros, inventory). A normal cargo build/just build does this.
cargo vet entries/exemptions for those new crates (the cargo vet --locked CI gate).

Everything else (code, docs, results, tests via a scoped local build) is done.

Generated by Claude Code

Port the LLM eval harness onto the mira framework (github.com/everruns/mira). bashkit's agent loop becomes a mira Subject, each JSONL task a Sample, and the deterministic expectation checks a Scorer; mira owns the model matrix, scheduling, retries, resume, and reporting (JSON/JUnit/Markdown/HTML). - New modules: snapshot.rs (walks the VFS + tool trace into the Transcript so a Scorer can inspect post-run state), checks.rs (expectation checks ported verbatim from the old scorer.rs, same pass/fail semantics, with unit tests), mira_study.rs (samples, subjects, the bashkit_expectations scorer, and #[eval] builders). - Remove the hand-rolled runner/report/scorer and their scripting variants; main.rs now serves the study to the `mira` host over stdio. - Datasets unchanged, embedded via include_str!. Both eval types ported: bash agent eval (bashkit_bash / bashkit_smoke) and the scripting-tool eval (bashkit_scripting, scripted vs baseline via a `mode` axis). - Depend on mira-eval 0.2 (crates.io, in-process Subject path only, no mira-everruns); drop clap. - Update justfile recipes, specs/eval.md, crate + top-level READMEs, architecture.md, and performance-results.md. Historical pre-mira results under crates/bashkit-eval/results/ are kept as an archive and remain the /benches eval input until the site is re-wired to mira's output format (follow-up).

Update the default model matrix: Opus 4.7 -> 4.8, and use the `claude-haiku-4-5` alias instead of the dated `claude-haiku-4-5-20251001`.

Point mira's saved run folders at crates/bashkit-eval/results/mira/ so runs persist with the repo. Include the first saved run: bashkit_smoke across the updated Anthropic targets (opus-4-8, haiku-4-5, sonnet-4-6), 9/9 cases passed (OpenAI targets skipped — no key).

Full 58-task bashkit_bash run via the mira host. Results: opus-4-8 55/58, haiku-4-5 55/58, sonnet-4-6 49/58 (159 passed / 174 scored; OpenAI + codex targets skipped, no key).

The OpenAI account returning 429 insufficient_quota on every call caused runs to hang in a retry-backoff loop. Fixes: - Add a shared HTTP client with connect (15s) + total (300s) timeouts; the providers previously used reqwest::Client::new() with no timeout, so a stalled socket could hang a run indefinitely. - Classify retryability centrally (is_retryable_error): retry only transient errors (rate-limit 429s, 5xx, Anthropic 529); fast-fail on permanent ones — insufficient_quota / billing_hard_limit_reached and auth (401/403) — and on 4xx client errors (e.g. bad model id). - Apply across all three providers (anthropic, openai, openai_responses). - Unit tests for the classifier; document the guard in mira.toml + spec.

Full 58-task bashkit_bash run for the OpenAI targets, now that credits are restored and the providers fast-fail on quota/billing instead of hanging: gpt-5.5 51/58, gpt-5.3-codex 54/58 (105 passed / 116 scored; Anthropic targets skipped — separate run).

- Update top-level README and crate README with the 2026-06-27 mira run matrix (Opus 4.8 / Haiku 4.5 55/58, GPT-5.3-Codex 54/58, GPT-5.5 51/58, Sonnet 4.6 49/58), incl. per-model tokens / tool-call success / duration. - Note the bashkit gaps the eval surfaced (ls -d #2127, $HOME dir #2128). - Fix --targets examples everywhere: mira takes exact labels, not globs (README, crate README, justfile, specs/eval.md, source comments).

cloudflare-workers-and-pages · 2026-06-27T00:54:35Z

Deploying with Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status	Name	Latest Commit	Preview URL	Updated (UTC)
✅ Deployment successful! View logs	bashkit	`2ae0c6f`	Commit Preview URL Branch Preview URL	Jun 27 2026, 01:03 AM

Add the mira-eval / mira-macros / inventory entries to Cargo.lock (exact crates.io checksums) and update bashkit-eval's locked deps (drop clap, add mira-eval). Add safe-to-deploy cargo-vet exemptions for the three new crates (inventory's rustversion dep is already exempt). Resolves the two CI prerequisites for the mira eval migration.

chaliy added 7 commits June 27, 2026 00:51

chore(eval): bump default eval targets

ab151b4

Update the default model matrix: Opus 4.7 -> 4.8, and use the `claude-haiku-4-5` alias instead of the dated `claude-haiku-4-5-20251001`.

chore(eval): save bashkit_bash run (3 Anthropic models)

702561f

Full 58-task bashkit_bash run via the mira host. Results: opus-4-8 55/58, haiku-4-5 55/58, sonnet-4-6 49/58 (159 passed / 174 scored; OpenAI + codex targets skipped, no key).

chaliy merged commit 87b8bc3 into main Jun 27, 2026
35 checks passed

chaliy deleted the claude/evals-migrate-mira-dhrrpz branch June 27, 2026 03:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(eval): reimplement bashkit-eval on the mira framework#2129

feat(eval): reimplement bashkit-eval on the mira framework#2129
chaliy merged 8 commits into
mainfrom
claude/evals-migrate-mira-dhrrpz

chaliy commented Jun 27, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

chaliy commented Jun 27, 2026

What

Why

How

Results — full bashkit_bash (58 tasks), run 2026-06-27 on mira

Tests / verification

bashkit gaps the eval surfaced (filed, not fixed here)

⚠️ Before merge (needs a github-git-capable environment)

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying with Cloudflare Workers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Results — full `bashkit_bash` (58 tasks), run 2026-06-27 on mira

cloudflare-workers-and-pages Bot commented Jun 27, 2026 •

edited

Loading