feat(eval): reimplement bashkit-eval on the mira framework#2129
Merged
Conversation
Port the LLM eval harness onto the mira framework (github.com/everruns/mira). bashkit's agent loop becomes a mira Subject, each JSONL task a Sample, and the deterministic expectation checks a Scorer; mira owns the model matrix, scheduling, retries, resume, and reporting (JSON/JUnit/Markdown/HTML). - New modules: snapshot.rs (walks the VFS + tool trace into the Transcript so a Scorer can inspect post-run state), checks.rs (expectation checks ported verbatim from the old scorer.rs, same pass/fail semantics, with unit tests), mira_study.rs (samples, subjects, the bashkit_expectations scorer, and #[eval] builders). - Remove the hand-rolled runner/report/scorer and their scripting variants; main.rs now serves the study to the `mira` host over stdio. - Datasets unchanged, embedded via include_str!. Both eval types ported: bash agent eval (bashkit_bash / bashkit_smoke) and the scripting-tool eval (bashkit_scripting, scripted vs baseline via a `mode` axis). - Depend on mira-eval 0.2 (crates.io, in-process Subject path only, no mira-everruns); drop clap. - Update justfile recipes, specs/eval.md, crate + top-level READMEs, architecture.md, and performance-results.md. Historical pre-mira results under crates/bashkit-eval/results/ are kept as an archive and remain the /benches eval input until the site is re-wired to mira's output format (follow-up).
Update the default model matrix: Opus 4.7 -> 4.8, and use the `claude-haiku-4-5` alias instead of the dated `claude-haiku-4-5-20251001`.
Point mira's saved run folders at crates/bashkit-eval/results/mira/ so runs persist with the repo. Include the first saved run: bashkit_smoke across the updated Anthropic targets (opus-4-8, haiku-4-5, sonnet-4-6), 9/9 cases passed (OpenAI targets skipped — no key).
Full 58-task bashkit_bash run via the mira host. Results: opus-4-8 55/58, haiku-4-5 55/58, sonnet-4-6 49/58 (159 passed / 174 scored; OpenAI + codex targets skipped, no key).
The OpenAI account returning 429 insufficient_quota on every call caused runs to hang in a retry-backoff loop. Fixes: - Add a shared HTTP client with connect (15s) + total (300s) timeouts; the providers previously used reqwest::Client::new() with no timeout, so a stalled socket could hang a run indefinitely. - Classify retryability centrally (is_retryable_error): retry only transient errors (rate-limit 429s, 5xx, Anthropic 529); fast-fail on permanent ones — insufficient_quota / billing_hard_limit_reached and auth (401/403) — and on 4xx client errors (e.g. bad model id). - Apply across all three providers (anthropic, openai, openai_responses). - Unit tests for the classifier; document the guard in mira.toml + spec.
Full 58-task bashkit_bash run for the OpenAI targets, now that credits are restored and the providers fast-fail on quota/billing instead of hanging: gpt-5.5 51/58, gpt-5.3-codex 54/58 (105 passed / 116 scored; Anthropic targets skipped — separate run).
- Update top-level README and crate README with the 2026-06-27 mira run matrix (Opus 4.8 / Haiku 4.5 55/58, GPT-5.3-Codex 54/58, GPT-5.5 51/58, Sonnet 4.6 49/58), incl. per-model tokens / tool-call success / duration. - Note the bashkit gaps the eval surfaced (ls -d #2127, $HOME dir #2128). - Fix --targets examples everywhere: mira takes exact labels, not globs (README, crate README, justfile, specs/eval.md, source comments).
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
bashkit | 2ae0c6f | Commit Preview URL Branch Preview URL |
Jun 27 2026, 01:03 AM |
Add the mira-eval / mira-macros / inventory entries to Cargo.lock (exact crates.io checksums) and update bashkit-eval's locked deps (drop clap, add mira-eval). Add safe-to-deploy cargo-vet exemptions for the three new crates (inventory's rustversion dep is already exempt). Resolves the two CI prerequisites for the mira eval migration.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Reimplements
bashkit-evalas a mira evalstudy (replacing the hand-rolled harness), refreshes the model matrix, hardens
the LLM providers against quota/billing hangs, and lands a fresh full results run.
Why
mira gives us the model matrix, scheduling, retries, resume, and
JSON/JUnit/Markdown/HTML reporting for free, so bashkit only has to supply the
subject under test and the scoring. The old
runner.rs/report.rsmachinerygoes away.
How
src/mira_study.rs): each JSONL task → a miraSample;bashkit's agent loop over a persistent VFS → a mira
Subject; the deterministicexpectation checks → a mira
Scorer. The provider stack (Anthropic Messages,OpenAI Chat, OpenAI Responses) and the agent loop are reused. Depends only on
mira-eval 0.2(crates.io) — in-process Subject path, nomira-everruns.src/snapshot.rs): a miraScoreronly sees&Sample+&Transcript, so after each run the subject walks the VFS intotranscript.filesand records aSnapshot(tool stdout/stderr/exit + dir set)in
transcript.metadata. The checks (src/checks.rs) are ported verbatim fromthe old
scorer.rs— byte-for-byte the same pass/fail semantics.bashkit_bash(58 tasks, tagged by category),bashkit_smoke(3),bashkit_scripting(scripted vs baseline via amodeaxis). Datasets unchanged, embedded via
include_str!.src/provider/): shared HTTP client with connect (15s)is_retryable_error()that retries onlytransient errors (rate-limit 429s / 5xx / Anthropic 529) and fast-fails on
permanent ones —
insufficient_quota, billing limits, auth (401/403), 4xx.This fixes an observed multi-minute hang when an OpenAI account was out of quota.
runner.rs/report.rs/scorer.rs(+ scripting variants);main.rsnow serves the study to the
mirahost over stdio. Addedmira.toml(resultsdir + documented knobs). Updated
specs/eval.md,specs/architecture.md,specs/performance-results.md,AGENTS.md, both READMEs, and the justfile.Results — full
bashkit_bash(58 tasks), run 2026-06-27 on miraSaved under
crates/bashkit-eval/results/mira/.bashkit_smokeis 9/9 across thethree Anthropic models.
Tests / verification
classifier — incl. the
insufficient_quota-not-retryable case).bashkit_smokeand fullbashkit_bashacross all 5 models.cargo clippy -p bashkit-evalclean;rustfmt --checkclean.bashkit gaps the eval surfaced (filed, not fixed here)
-d/--directorynot implemented #2127 —ls -d/--directorynot implemented.username(...)→ writes to$HOMEfail #2128 —username(...)doesn't provision/home/<user>, so writes to$HOMEfail (root cause of
file_path_organizerfailing on all 5 models).This was developed in a sandbox whose egress blocks GitHub git, so the
monty/jitergit deps can't be resolved here. Two CI-side items remain:Cargo.lockto includemira-eval(and transitivesmira-macros,inventory). A normalcargo build/just builddoes this.cargo vetentries/exemptions for those new crates (thecargo vet --lockedCI gate).Everything else (code, docs, results, tests via a scoped local build) is done.
Generated by Claude Code