benchflow-ai · xdotli · Jun 13, 2026 · Jun 11, 2026 · Jun 11, 2026 · Jun 11, 2026
diff --git a/.claude/dev-docs/0.3-plan.md b/.claude/dev-docs/0.3-plan.md
diff --git a/.claude/dev-docs/labs.md b/.claude/dev-docs/labs.md
@@ -2,87 +2,15 @@
 
 Runnable, Docker-heavy experiments that exercise the full benchflow SDK end-to-end. Labs are distinct from unit tests (real Docker, no mocking) and from docs (executable, with expected output). Each lab is self-contained with its own README and orchestrator script.
 
-Labs live under [`labs/`](../labs/).
+> **Historical (0.2.x-era).** These labs are archived under [`docs/labs/`](../../docs/labs/). They compare benchflow 0.2.0 against 0.2.1/0.2.2 and are kept as cited security evidence; the hardening they validate still ships. The public write-up is [`docs/sandbox-hardening.md`](../../docs/sandbox-hardening.md).
 
-| Lab                                                         | Question summary                                                                 | Benchflow versions | API key needed               |
-| ----------------------------------------------------------- | -------------------------------------------------------------------------------- | ------------------ | ---------------------------- |
-| [benchjack-sandbox-hardening](#benchjack-sandbox-hardening) | Does 0.2.1 block BenchJack exploits that succeed under 0.2.0                     | 0.2.0 vs 0.2.1     | No                           |
-| [reward-hack-matrix](#reward-hack-matrix)                   | Do the same exploits succeed on real benchmark tasks, and does 0.2.2 block them? | 0.2.0 vs 0.2.2     | Optional (`DAYTONA_API_KEY`) |
+| Lab | Question summary | Benchflow versions | API key needed |
+| --- | --- | --- | --- |
+| [`benchjack-sandbox-hardening`](../../docs/labs/benchjack-sandbox-hardening/) | Does 0.2.1 block BenchJack exploits that succeed under 0.2.0? | 0.2.0 vs 0.2.1 | No |
+| [`reward-hack-matrix`](../../docs/labs/reward-hack-matrix/) | Do the same exploits succeed on real benchmark tasks, and does 0.2.2 block them? | 0.2.0 vs 0.2.2 | Optional (`DAYTONA_API_KEY`) |
 
----
-
-## benchjack-sandbox-hardening
-
-**Question:** Does sandbox hardening in benchflow 0.2.1 block BenchJack-style exploits that succeed under 0.2.0?
-
-**Location:** [`labs/benchjack-sandbox-hardening/`](../labs/benchjack-sandbox-hardening/)
-
-**Prerequisites:**
-
-- Docker daemon
-- Python 3.12+
-- `uv` on PATH
-- Network access to PyPI
-- No API keys required (uses the `oracle` agent)
-
-**Run:**
-
-```sh
-python3 labs/benchjack-sandbox-hardening/run_comparison.py
-```
-
-- `--clean` — delete `.venvs/` and `.jobs/` before running
-- First run is ~5 min (Docker builds + pip installs); subsequent runs use cached `.venvs/` (~1 min)
-
-**Key takeaways:**
-
-- Three exploit patterns (P1 conftest-hook, P2 answer-lookup, P7 pth-injection) flip reward from 0.0 → 1.0 against benchflow 0.2.0 and are blocked under 0.2.1 (reward stays 0.0).
-- Defenses are layered: `chmod 700` on `/tests` and `/solution`, non-root `sandbox_user`, and pre-verify conftest cleanup.
-- The `oracle` agent executes `solution/solve.sh` directly — deterministic and free of API costs. Swap `agent="oracle"` for `agent="claude-agent-acp"` in `_attack_runner.py` to test with a real LLM.
-
-**Related:** `comparison.ipynb` — narrative deep-dive into P1; run `run_comparison.py` first, then open with:
-
-```sh
-uv run --with jupyter jupyter notebook labs/benchjack-sandbox-hardening/comparison.ipynb
-```
-
----
-
-## reward-hack-matrix
-
-**Question:** Do the same BenchJack exploits succeed on real production benchmark tasks, and does benchflow 0.2.2's hardening block them there too?
-
-**Location:** [`labs/reward-hack-matrix/`](../labs/reward-hack-matrix/)
-
-**Prerequisites:**
-
-- `DAYTONA_API_KEY` (default) or Docker daemon (pass `--env docker`)
-- Python 3.12+
-- `uv` on PATH
-- Network access to PyPI and GitHub
-- Corpora must be cloned first:
-  ```sh
-  cd labs/reward-hack-matrix && ./fetch_corpora.sh
-  ```
-
-**Run:**
-
-```sh
-python labs/reward-hack-matrix/run_matrix.py
-```
-
-- `--cells "P1@swebench-verified/astropy__astropy-12907"` — run a single cell
-- `--sweep` — enumerate all tasks across all three corpora
-- `--clean` — remove `.venvs/`, `.jobs/`, and `.cells/`
-
-**Key takeaways:**
-
-- One tailored exploit per benchmark (P1 conftest-hook for swebench-verified, P7 pth-injection for skillsbench, P7 path-trojan for terminal-bench-2) achieves reward 1.0 against 0.2.0 and is blocked to 0.0 under 0.2.2.
-- Each benchmark has a single structural weak point; the lab demonstrates these are closed by the same layered defenses as the synthetic lab, not by benchmark-specific patches.
-- Independently corroborated by Berkeley RDI and BrachioLab (Penn) findings published concurrently in April 2026.
-
----
+Each lab's README documents its prerequisites, the one-command repro, and key takeaways. See [`docs/sandbox-hardening.md`](../../docs/sandbox-hardening.md) for the narrative and results tables.
 
 ## See also
 
-- [`.dev-docs/harden-sandbox.md`](../.dev-docs/harden-sandbox.md) — full seven-pattern BenchJack threat model and hardening audit
+- [`harden-sandbox.md`](./harden-sandbox.md) — full seven-pattern BenchJack threat model and hardening audit
diff --git a/.claude/launch.json b/.claude/launch.json
@@ -1,11 +1,4 @@
 {
   "version": "0.0.1",
-  "configurations": [
-    {
-      "name": "dashboard",
-      "runtimeExecutable": "python3",
-      "runtimeArgs": ["dashboard/serve.py"],
-      "port": 8777
-    }
-  ]
+  "configurations": []
 }
diff --git a/.claude/skills/benchflow/SKILL.md b/.claude/skills/benchflow/SKILL.md
@@ -25,7 +25,7 @@ Arguments passed: `$ARGUMENTS`
 1. Check if benchflow is installed: `uv tool list | grep benchflow`
 2. Check if API keys are set (GEMINI_API_KEY, ANTHROPIC_API_KEY, etc.)
 3. Check available agents: `bench agent list`
-4. Show recent eval results if any exist in `evaluations/` or `jobs/`
+4. Show recent eval results if any exist under `jobs/` (the default `--jobs-dir`)
 5. Point to next action based on state
 
 ### `run <task-path>` — run a single task
@@ -94,12 +94,13 @@ max_retries: 1
 ### `metrics <jobs-dir>` — analyze results
 
 ```bash
-bench eval list jobs/
+bench eval metrics jobs/      # aggregate pass-rate / tokens / cost (add --json to pipe)
+bench eval list jobs/         # per-rollout table
 ```
 
 ### `view <rollout-dir>` — view a trajectory
 
-Results are in `evaluations/<eval-name>/<rollout-name>/` or `jobs/<job-name>/<rollout-name>/`:
+Results land under `jobs/<job-name>/<rollout-name>/` (the default `--jobs-dir` is `jobs/`):
 ```
 rollout-dir/
 ├── result.json              # rewards, agent, timing
@@ -114,20 +115,38 @@ rollout-dir/
 ### `create-task` — create a new benchmark task
 
 ```bash
-bench tasks init my-task
-bench tasks init my-task --no-pytest --no-solution
+bench tasks init my-task                       # native task.md format (default)
+bench tasks init my-task --no-pytest --no-oracle
+bench tasks check tasks/my-task                # structural validation
 ```
 
-Quick structure:
+Quick structure (native `task.md` format, the default):
 ```
 my-task/
-├── task.toml          # timeouts, resources, metadata
-├── instruction.md     # what the agent should do
+├── task.md            # YAML frontmatter (config) + prompt body
 ├── environment/
 │   └── Dockerfile     # sandbox setup
-├── tests/
-│   └── test.sh        # verifier -> writes to /logs/verifier/reward.txt
-└── solution/          # optional reference solution
+├── verifier/
+│   ├── test.sh        # verifier entrypoint -> writes /logs/verifier/reward.txt
+│   └── test_outputs.py
+└── oracle/            # optional reference solution (solve.sh)
+```
+
+`--format legacy` instead scaffolds the older split layout (`task.toml` +
+`instruction.md` + `tests/` + `solution/`).
+
+### `skills` — discover and evaluate agent skills
+
+```bash
+bench skills list                                   # discover skills on disk
+bench skills eval skills/citation-management \
+  --agent claude-agent-acp                          # score a skill against its evals/evals.json
+```
+
+### `hub` — check external-environment-hub compatibility
+
+```bash
+bench hub check          # inventory/structurally-check representative Harbor-registry tasks
 ```
 
 ### `agents` — list available agents
@@ -155,15 +174,19 @@ The underlying agent's install, env vars, credentials, and skill paths are prese
 
 ### `compare` — multi-agent comparison
 
+Compare by running one config per agent (the `agent:` key lives in each YAML)
+and printing the aggregate scores:
 ```python
 import asyncio
 from benchflow.evaluation import Evaluation
 
 async def main():
-    for agent_name in ["claude-agent-acp", "gemini", "opencode"]:
-        eval_obj = Evaluation.from_yaml("benchmarks/harvey-lab/harvey-lab-gemini-flash-lite.yaml")
-        result = await eval_obj.run()
-        print(f"{agent_name}: {result.passed}/{result.total} ({result.score:.1%})")
+    for config_path in [
+        "benchmarks/harvey-lab/harvey-lab-gemini-flash-lite.yaml",
+        "benchmarks/harvey-lab/harvey-lab-harness-parity.yaml",
+    ]:
+        result = await Evaluation.from_yaml(config_path).run()
+        print(f"{config_path}: {result.passed}/{result.total} ({result.score:.1%})")
 
 asyncio.run(main())
 ```
@@ -173,7 +196,10 @@ asyncio.run(main())
 ## Setup
 
 ```bash
-uv tool install benchflow    # or: uv sync --extra dev --locked (from source)
+# 0.6 is pre-release — not yet on PyPI. Install the RC wheel from GitHub releases:
+uv tool install --prerelease allow \
+  'benchflow @ https://github.com/benchflow-ai/benchflow/releases/download/0.6.0-rc.6/benchflow-0.6.0rc6-py3-none-any.whl'
+# (replace rc.6 with the newest 0.6.0-rc.* release; or from source: uv sync --extra dev --locked)
 export GEMINI_API_KEY=...     # or ANTHROPIC_API_KEY, OPENAI_API_KEY, etc.
 export DAYTONA_API_KEY=...    # for cloud sandboxes
 ```
@@ -204,10 +230,13 @@ bench eval create \
   --agent claude-agent-acp \
   --sandbox daytona \
   --skills-dir skills/ \
+  --skill-mode with-skill \
   --agent-env BENCHFLOW_SKILL_NUDGE=name
 ```
 
-Skills are uploaded to `/skills/` in the sandbox and symlinked to agent-specific paths.
+`--skill-mode with-skill` is required whenever you pass `--skills-dir` (omitting
+it errors). Skills are uploaded to `/skills/` in the sandbox and symlinked to
+agent-specific paths.
 
 ## Tips
 

diff --git a/.claude/skills/branch-review/SKILL.md b/.claude/skills/branch-review/SKILL.md
@@ -38,7 +38,7 @@ Route changed files by path:
 
 - **tests** → `/test-review`: `tests/**/test_*.py`, `tests/**/*_test.py`
 - **src** → `/code-cleanup`: `src/benchflow/**/*.py`
-- **docs** → `/docs-review`: `*.md`, `docs/**/*`, `.dev-docs/**/*`, `src/benchflow/**/*.md`
+- **docs** → `/docs-review`: `*.md`, `docs/**/*`, `.claude/dev-docs/**/*`, `src/benchflow/**/*.md`
 
 **Skip silently** (no routing, no findings, no warning): `uv.lock`, `.venv/**`, `__pycache__/**`, `*.egg-info/**`, `dist/**`, `build/**`, `.pytest_cache/**`, generated files.
 

diff --git a/.claude/skills/docs-review/SKILL.md b/.claude/skills/docs-review/SKILL.md
@@ -32,17 +32,13 @@ The user may say `/docs-review` with an optional argument:
 
 ### Light-touch (checks 1, 2, 6 only — drift, stale refs, link integrity)
 
-- `.dev-docs/sdk-reference.md` — internal SDK surface; verify class/function
-  names + signatures still resolve in `src/benchflow/`.
-- `.dev-docs/harden-sandbox.md` — sandbox hardening notes; verify referenced
-  files / knobs / env vars still exist.
-- `.dev-docs/tested-agents.md` — matrix of agent × model × provider; verify
-  names still appear in `agents/registry.py` and `agents/providers.py`.
+- `.claude/dev-docs/harden-sandbox.md` — sandbox hardening notes; verify
+  referenced files / knobs / env vars still exist.
+- `.claude/dev-docs/tested-agents.md` — matrix of agent × model × provider;
+  verify names still appear in `agents/registry.py` and `agents/providers.py`.
 
 ### Skipped entirely
 
-- `.dev-docs/sdk-refactor-notes.md` — dated refactor record (April 2026);
-  historical, status language is expected. Do not flag or edit.
 - Anything matching `*-notes.md`, `*-archive.md`.
 - `.smoke-jobs/`, `trajectories/`, `examples/`, `fixtures/` — generated or
   sample output, not documentation.
@@ -64,7 +60,7 @@ entries. Cross-check:
   mentioned in docs, grep `src/benchflow/agents/registry.py` and
   `src/benchflow/agents/providers.py`. A name in docs but not in the
   registry dict → stale; a name in the registry but not documented where
-  expected (`docs/architecture.md` matrix, `.dev-docs/tested-agents.md`)
+  expected (`docs/architecture.md` matrix, `.claude/dev-docs/tested-agents.md`)
   → gap.
 - Env vars mentioned in docs (`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`,
   `GROQ_API_KEY`, `BENCHFLOW_*`, etc.) — still referenced in
@@ -94,11 +90,11 @@ Grep for implementation-tracking words:
 
 For each hit, ask: is this describing the *design* (stays true) or
 *in-flight work* (rots)? In-flight language belongs in commit messages,
-PR descriptions, or `.dev-docs/*-notes.md`, not user-facing reference
+PR descriptions, or `.claude/dev-docs/*-notes.md`, not user-facing reference
 docs.
 
-**Suppress for `.dev-docs/*-notes.md`** — dated refactor notes legitimately
-carry status language.
+**Suppress for `.claude/dev-docs/*-notes.md`** — dated refactor notes
+legitimately carry status language.
 
 ### 4. Duplication
 
@@ -109,11 +105,10 @@ for benchflow:
 - **SDK Run Phases** (SETUP → START → AGENT → VERIFY) — should live in
   `architecture.md`; others should link.
 - **Registry examples** — one copy in `architecture.md` + one in
-  `task-authoring.md` or `.dev-docs/sdk-reference.md` is OK if they
-  illustrate distinct use cases; two near-identical `register_agent(...)`
-  blocks is not.
+  `task-authoring.md` is OK if they illustrate distinct use cases; two
+  near-identical `register_agent(...)` blocks is not.
 - **Agent × Model × Provider matrix** — live in
-  `.dev-docs/tested-agents.md`; `architecture.md` should link, not
+  `.claude/dev-docs/tested-agents.md`; `architecture.md` should link, not
   duplicate.
 - **Env var reference** — should live in `docs/getting-started.md` or
   `docs/cli-reference.md`; not re-listed in README.
@@ -164,7 +159,7 @@ All markdown links resolve:
   "how benchflow works" overview — link to architecture for that.
 - **docs/getting-started.md / docs/labs.md**: tutorial tone; design
   rationale belongs elsewhere.
-- **.dev-docs/**: internal — can carry status language, refactor
+- **.claude/dev-docs/**: internal — can carry status language, refactor
   histories, signature tables.
 
 ## Execution
@@ -207,7 +202,7 @@ For a full review:
   prose quality or tone.
 - **Don't grow scope.** If a check isn't in the seven above, don't add it
   mid-review. File a suggestion in the punch list instead.
-- **Don't touch archives or refactor notes.** `.dev-docs/*-notes.md`
+- **Don't touch archives or refactor notes.** `.claude/dev-docs/*-notes.md`
   legitimately carry status language and reflect state at the time they
   were written; don't normalize them.
 - **Don't flag registry drift without reading the registry dict.**
@@ -229,7 +224,7 @@ Stale:
 - docs/architecture.md:39 — "Phase 1: SETUP (host)" numbering implies sequential work-in-progress; phases are always-on
 - docs/task-authoring.md:88 — "TODO: document verifier timeout knob"
 - docs/getting-started.md:121 — example task ID "demo-fizzbuzz" renamed to "examples-fizzbuzz"
-- .dev-docs/tested-agents.md:14 — lists claude-code-acp; registry only has claude-agent-acp
+- .claude/dev-docs/tested-agents.md:14 — lists claude-code-acp; registry only has claude-agent-acp
 
 Polish:
 - README.md:98-134 — full src/ tree duplicates docs/architecture.md:12-38

diff --git a/.claude/skills/launch-prep/SKILL.md b/.claude/skills/launch-prep/SKILL.md
@@ -46,7 +46,7 @@ Run A → B → C → D serially (each skill spawns its own subagents; stacking
 them saturates the pool).
 
 **A — `/docs-review`** (full). Covers `README.md`, `docs/*.md`, `AGENTS.md`,
-and the light-touch `.dev-docs/` set. Captures drift vs. code, stale refs,
+and the light-touch `.claude/dev-docs/` set. Captures drift vs. code, stale refs,
 link integrity, registry alignment. Supersedes the old ad-hoc docs pass.
 
 **B — labs/ (ad-hoc subagent)** — `/docs-review` skips labs. Spawn one
@@ -90,10 +90,36 @@ If `ruff format` changed files: `git diff --name-only`, then `git add <those fil
 
 ```bash
 source .env 2>/dev/null || true
-.venv/bin/python -m pytest -m live tests/test_smoke.py -v
+.venv/bin/python -m pytest -m live tests/test_smoke.py -v -ra \
+  --junitxml=/tmp/smoke.xml
+# A skipped live smoke is NOT green — exit 0 on a run that never executed
+# would false-green the e2e gate. pytest puts tests/skipped on the nested
+# <testsuite> elements, so sum over them and fail unless one ran clean.
+.venv/bin/python -c 'import sys,xml.etree.ElementTree as ET; \
+  es=list(ET.parse("/tmp/smoke.xml").getroot().iter("testsuite")); \
+  t=sum(int(e.get("tests",0)) for e in es); \
+  s=sum(int(e.get("skipped",0)) for e in es); \
+  sys.exit(0) if t and not s else (print("LIVE SMOKE SKIPPED OR NOT RUN — gate is RED, not green") or sys.exit(1))'
 ```
 
-If Docker is unavailable, warn and ask to skip or abort — do not skip silently. Expected: `test_hello_world_smoke` passes with reward > 0 and non-empty trajectory.
+The live smoke `skipif`s when Docker is down or the chosen model has no
+credential, and pytest exits `0` on a skip. The JUnit check above turns that
+silent pass into a hard failure so a skipped smoke cannot false-green the gate.
+Expected: `test_hello_world_smoke` passes with reward > 0 and non-empty
+trajectory.
+
+If the only credential on the machine is non-Anthropic, point the smoke at an
+agent/model it can authenticate instead of skipping (proven combo:
+openhands + deepseek):
+
+```bash
+export BENCHFLOW_SMOKE_AGENT=openhands
+export BENCHFLOW_SMOKE_MODEL=deepseek/deepseek-chat
+export DEEPSEEK_API_KEY=...  DEEPSEEK_BASE_URL=https://api.deepseek.com
+```
+
+`BENCHFLOW_SMOKE_AGENT` and `BENCHFLOW_SMOKE_MODEL` must be set together; the
+skip reason names the exact missing credential for whichever model is selected.
 
 ---