Skip to content

feat(eval): reimplement bashkit-eval on the mira framework#2129

Merged
chaliy merged 8 commits into
mainfrom
claude/evals-migrate-mira-dhrrpz
Jun 27, 2026
Merged

feat(eval): reimplement bashkit-eval on the mira framework#2129
chaliy merged 8 commits into
mainfrom
claude/evals-migrate-mira-dhrrpz

Conversation

@chaliy

@chaliy chaliy commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

What

Reimplements bashkit-eval as a mira eval
study (replacing the hand-rolled harness), refreshes the model matrix, hardens
the LLM providers against quota/billing hangs, and lands a fresh full results run.

Why

mira gives us the model matrix, scheduling, retries, resume, and
JSON/JUnit/Markdown/HTML reporting for free, so bashkit only has to supply the
subject under test and the scoring. The old runner.rs/report.rs machinery
goes away.

How

  • mira integration (src/mira_study.rs): each JSONL task → a mira Sample;
    bashkit's agent loop over a persistent VFS → a mira Subject; the deterministic
    expectation checks → a mira Scorer. The provider stack (Anthropic Messages,
    OpenAI Chat, OpenAI Responses) and the agent loop are reused. Depends only on
    mira-eval 0.2 (crates.io) — in-process Subject path, no mira-everruns.
  • Snapshot scoring (src/snapshot.rs): a mira Scorer only sees
    &Sample + &Transcript, so after each run the subject walks the VFS into
    transcript.files and records a Snapshot (tool stdout/stderr/exit + dir set)
    in transcript.metadata. The checks (src/checks.rs) are ported verbatim from
    the old scorer.rs — byte-for-byte the same pass/fail semantics.
  • Three evals: bashkit_bash (58 tasks, tagged by category),
    bashkit_smoke (3), bashkit_scripting (scripted vs baseline via a mode
    axis). Datasets unchanged, embedded via include_str!.
  • Provider hardening (src/provider/): shared HTTP client with connect (15s)
    • total (300s) timeouts; central is_retryable_error() that retries only
      transient errors (rate-limit 429s / 5xx / Anthropic 529) and fast-fails on
      permanent ones
      insufficient_quota, billing limits, auth (401/403), 4xx.
      This fixes an observed multi-minute hang when an OpenAI account was out of quota.
  • Removed runner.rs / report.rs / scorer.rs (+ scripting variants); main.rs
    now serves the study to the mira host over stdio. Added mira.toml (results
    dir + documented knobs). Updated specs/eval.md, specs/architecture.md,
    specs/performance-results.md, AGENTS.md, both READMEs, and the justfile.

Results — full bashkit_bash (58 tasks), run 2026-06-27 on mira

Model Score Passed Tool-call success Duration
Claude Opus 4.8 95% 55/58 96% 12.8 min
Claude Haiku 4.5 95% 55/58 94% 7.4 min
GPT-5.3-Codex 93% 54/58 85% 12.1 min
GPT-5.5 88% 51/58 90% 8.2 min
Claude Sonnet 4.6 84% 49/58 93% 19.9 min

Saved under crates/bashkit-eval/results/mira/. bashkit_smoke is 9/9 across the
three Anthropic models.

Tests / verification

  • 27 lib tests green (check semantics, sample loading, eval builders, retry
    classifier — incl. the insufficient_quota-not-retryable case).
  • Live end-to-end: bashkit_smoke and full bashkit_bash across all 5 models.
  • cargo clippy -p bashkit-eval clean; rustfmt --check clean.

bashkit gaps the eval surfaced (filed, not fixed here)

⚠️ Before merge (needs a github-git-capable environment)

This was developed in a sandbox whose egress blocks GitHub git, so the
monty/jiter git deps can't be resolved here. Two CI-side items remain:

  1. Regenerate Cargo.lock to include mira-eval (and transitives
    mira-macros, inventory). A normal cargo build/just build does this.
  2. cargo vet entries/exemptions for those new crates (the cargo vet --locked CI gate).

Everything else (code, docs, results, tests via a scoped local build) is done.


Generated by Claude Code

chaliy added 7 commits June 27, 2026 00:51
Port the LLM eval harness onto the mira framework
(github.com/everruns/mira). bashkit's agent loop becomes a mira
Subject, each JSONL task a Sample, and the deterministic expectation
checks a Scorer; mira owns the model matrix, scheduling, retries,
resume, and reporting (JSON/JUnit/Markdown/HTML).

- New modules: snapshot.rs (walks the VFS + tool trace into the
  Transcript so a Scorer can inspect post-run state), checks.rs
  (expectation checks ported verbatim from the old scorer.rs, same
  pass/fail semantics, with unit tests), mira_study.rs (samples,
  subjects, the bashkit_expectations scorer, and #[eval] builders).
- Remove the hand-rolled runner/report/scorer and their scripting
  variants; main.rs now serves the study to the `mira` host over stdio.
- Datasets unchanged, embedded via include_str!. Both eval types
  ported: bash agent eval (bashkit_bash / bashkit_smoke) and the
  scripting-tool eval (bashkit_scripting, scripted vs baseline via a
  `mode` axis).
- Depend on mira-eval 0.2 (crates.io, in-process Subject path only, no
  mira-everruns); drop clap.
- Update justfile recipes, specs/eval.md, crate + top-level READMEs,
  architecture.md, and performance-results.md.

Historical pre-mira results under crates/bashkit-eval/results/ are kept
as an archive and remain the /benches eval input until the site is
re-wired to mira's output format (follow-up).
Update the default model matrix: Opus 4.7 -> 4.8, and use the
`claude-haiku-4-5` alias instead of the dated `claude-haiku-4-5-20251001`.
Point mira's saved run folders at crates/bashkit-eval/results/mira/ so
runs persist with the repo. Include the first saved run: bashkit_smoke
across the updated Anthropic targets (opus-4-8, haiku-4-5, sonnet-4-6),
9/9 cases passed (OpenAI targets skipped — no key).
Full 58-task bashkit_bash run via the mira host. Results:
opus-4-8 55/58, haiku-4-5 55/58, sonnet-4-6 49/58
(159 passed / 174 scored; OpenAI + codex targets skipped, no key).
The OpenAI account returning 429 insufficient_quota on every call caused
runs to hang in a retry-backoff loop. Fixes:

- Add a shared HTTP client with connect (15s) + total (300s) timeouts;
  the providers previously used reqwest::Client::new() with no timeout, so
  a stalled socket could hang a run indefinitely.
- Classify retryability centrally (is_retryable_error): retry only
  transient errors (rate-limit 429s, 5xx, Anthropic 529); fast-fail on
  permanent ones — insufficient_quota / billing_hard_limit_reached and
  auth (401/403) — and on 4xx client errors (e.g. bad model id).
- Apply across all three providers (anthropic, openai, openai_responses).
- Unit tests for the classifier; document the guard in mira.toml + spec.
Full 58-task bashkit_bash run for the OpenAI targets, now that credits are
restored and the providers fast-fail on quota/billing instead of hanging:
gpt-5.5 51/58, gpt-5.3-codex 54/58 (105 passed / 116 scored; Anthropic
targets skipped — separate run).
- Update top-level README and crate README with the 2026-06-27 mira run
  matrix (Opus 4.8 / Haiku 4.5 55/58, GPT-5.3-Codex 54/58, GPT-5.5 51/58,
  Sonnet 4.6 49/58), incl. per-model tokens / tool-call success / duration.
- Note the bashkit gaps the eval surfaced (ls -d #2127, $HOME dir #2128).
- Fix --targets examples everywhere: mira takes exact labels, not globs
  (README, crate README, justfile, specs/eval.md, source comments).
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 27, 2026

Copy link
Copy Markdown

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
bashkit 2ae0c6f Commit Preview URL

Branch Preview URL
Jun 27 2026, 01:03 AM

Add the mira-eval / mira-macros / inventory entries to Cargo.lock (exact
crates.io checksums) and update bashkit-eval's locked deps (drop clap, add
mira-eval). Add safe-to-deploy cargo-vet exemptions for the three new crates
(inventory's rustversion dep is already exempt).

Resolves the two CI prerequisites for the mira eval migration.
@chaliy chaliy merged commit 87b8bc3 into main Jun 27, 2026
35 checks passed
@chaliy chaliy deleted the claude/evals-migrate-mira-dhrrpz branch June 27, 2026 03:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant