BenchFlow uses a resource-verb pattern: bench <resource> <verb>.
bench --version
bench agentis agent management only.bench agent listandbench agent showoperate on registered AI agents (Claude Code, Gemini CLI, Codex, OpenHands, …) — the programs that solve tasks. Onboarding a third-party benchmark (scaffold → drive → parity-gate abenchmarks/<name>/adoption) is a separate workflow underbench eval adopt(init→convert→verify). The legacybench agent create|run|verifystill work as hidden deprecated aliases through 0.6, printing a one-line notice; they are removed in 0.7.
List all registered agents with their protocol and native/default auth
requirements. Provider-prefixed models may use provider-specific credentials;
Azure Foundry models use AZURE_API_KEY plus AZURE_API_ENDPOINT.
bench agent listShow details for a specific agent, including native/default auth and a note about provider-specific credentials.
bench agent show geminiBring a third-party benchmark into the environment framework: scaffold a
benchmarks/<name>/ package, drive the codex CONVERT.md conversion, then
parity-gate it (init → convert → verify). These commands were previously
bench agent create|run|verify, which still work as hidden deprecated aliases
through 0.6 (they print a one-line notice and are removed in 0.7). See
Benchmark adoption for the full walkthrough.
Scaffold benchmarks/<name>/ for a new benchmark adoption. The layout mirrors
the reference benchmark benchmarks/programbench/ and the contract in
benchmarks/CONVERT.md: it writes
benchflow.py (converter), main.py, parity_test.py, run_<name>.py,
<name>.yaml, benchmark.yaml, parity_experiment.json (status template),
README.md, and __init__.py. It is fail-closed: the slug is validated
(lowercase, leading letter, single internal hyphens, max 64 chars) and the
command refuses to overwrite an existing benchmark directory.
bench eval adopt init my-bench
bench eval adopt init my-bench --benchmarks-dir ./benchmarks| Flag | Default | Description |
|---|---|---|
--benchmarks-dir |
repo benchmarks/ |
Target benchmarks/ directory |
Drive the CONVERT.md adoption workflow by launching the host codex CLI.
The command assembles the adoption context (the source, the target
benchmarks/<name>/ path, the adoption skills, and the embedded
benchmarks/CONVERT.md guide) and runs codex exec against the repo root to
drive the conversion toward a benchmarks/<name>/ pull request. It is
fail-closed on credentials: codex needs OPENAI_API_KEY (or CODEX_API_KEY)
in the environment, or a ~/.codex/auth.json from codex login, otherwise the
command exits before assembling any context. Use --dry-run to print the exact
launch command without running it (no credentials required). When --name is
omitted the slug is derived from the source basename.
# Print the codex launch command without running it
bench eval adopt convert https://github.com/org/some-benchmark --dry-run
# Launch the host codex driver against a local source
bench eval adopt convert ./vendor/some-benchmark --name my-bench --model o3| Flag | Default | Description |
|---|---|---|
--name |
derived from source | Benchmark slug (default: from source basename) |
--model |
codex default | Model for the codex driver |
--dry-run |
false |
Print the launch command, do not run |
--codex-bin |
codex |
Host codex binary |
-c, --codex-config |
— | Codex config override as key=value, passed through to codex as -c key=value; repeatable. Use it to work around host ~/.codex/config.toml drift without editing the file — e.g. -c service_tier=flex when an installed codex version rejects a stale value. |
Run the parity gate for an adopted benchmark and emit a confidence verdict. It
reads benchmarks/<name>/parity_experiment.json and scores two layers: a
deterministic conversion-faithfulness floor (every compared criterion's
converted verdict must match the original's verdict on identical inputs) and a
statistical reward-distribution layer (every legacy-vs-converted reward delta
must sit within --tolerance). The gate is parity-only — a faithful conversion
reproduces the original's behavior, including any reward-hackability the source
has; it never "improves" or sanitizes the source. The verdict is one of
parity-confirmed, parity-divergent, or insufficient-evidence (no recorded
comparisons). On any non-confirmed verdict the command exits non-zero and emits
a draft GitHub issue body for human support — printed to stdout, or written to
--issue-out. The draft is never filed automatically. Pass --roundtrip-task
to also run the structural round-trip conformance check on a concrete task
directory.
By default the gate scores the recorded parity_experiment.json — fast, but
it trusts an artifact the conversion produced about itself. Pass --rerun to
independently re-execute parity_test.py --mode side-by-side and score its
fresh output instead. --rerun is fail-closed: a missing/failing parity_test.py,
a timeout, or output that is not in the scoreable parity_experiment.json shape
all exit non-zero (rather than silently reporting insufficient-evidence).
bench eval adopt verify my-bench
bench eval adopt verify my-bench --tolerance 0.05 --issue-out divergence.md
bench eval adopt verify my-bench --roundtrip-task benchmarks/my-bench/tasks/example
bench eval adopt verify my-bench --rerun # re-run parity_test.py, score fresh output| Flag | Default | Description |
|---|---|---|
--benchmarks-dir |
repo benchmarks/ |
Target benchmarks/ directory |
--tolerance |
0.02 |
Max abs reward delta (statistical layer) |
--issue-out |
— | Write the divergence issue draft to this path instead of stdout |
--roundtrip-task |
— | Also run the structural round-trip check on this task dir |
--rerun |
false |
Re-execute parity_test.py --mode side-by-side and score its fresh output instead of the recorded parity_experiment.json |
Create and run an evaluation. Use it for YAML configs and batch runs; it also accepts a single task directory.
# From YAML config
bench eval create --config benchmarks/harvey-lab/harvey-lab-gemini-flash-lite.yaml
# From remote repo (fast Daytona batch; token usage may be unavailable)
bench eval create \
--source-repo benchflow-ai/skillsbench \
--source-path tasks \
--agent gemini \
--model gemini-3.1-flash-lite-preview \
--sandbox daytona \
--concurrency 64 \
--sandbox-setup-timeout 300
# From remote repo with required token usage telemetry
bench eval create \
--source-repo benchflow-ai/skillsbench \
--source-path tasks \
--agent gemini \
--model gemini-3.1-flash-lite-preview \
--sandbox daytona \
--usage-tracking required \
--concurrency 16 \
--sandbox-setup-timeout 300
# From local directory
bench eval create --tasks-dir ./tasks --agent gemini --model gemini-3.1-flash-lite-preview
# From a hosted PrimeIntellect / Verifiers environment
bench eval create \
--source-env primeintellect/general-agent \
--source-env-version 0.1.1 \
--source-env-arg task=calendar_scheduling_t0 \
--agent gemini \
--model google/gemini-2.5-flash-lite
# Single task with mounted skills and the recommended skill nudge
bench eval create \
--tasks-dir tasks/pdf-fix \
--agent gemini \
--model gemini-3.1-flash-lite-preview \
--sandbox daytona \
--skill-mode with-skill \
--agent-env BENCHFLOW_SKILL_NUDGE=name
# Pinned registry dataset: resolves skillsbench@1.1, verifies task digests,
# and stamps dataset identity into every result.json/config.json
bench eval create -d skillsbench@1.1 --agent gemini --model gemini-3.1-flash-lite-preview| Flag | Default | Description |
|---|---|---|
--config |
— | YAML config file |
--tasks-dir |
— | Local task dir (single task with task.toml, or parent of many) |
-d, --dataset |
— | Registry dataset to run as <name>@<version> (e.g. skillsbench@1.1). Resolves the pinned snapshot from the registry, clones tasks at their pinned commit, verifies each task's sha256 content digest, and checks the dataset's bench_version range against the installed benchflow. Each result.json/config.json is stamped with dataset_name, dataset_version, and the task's task_digest. |
--registry |
skillsbench registry | Dataset registry JSON URL or local file. Only valid with --dataset. |
--source-repo |
— | Remote repo as org/repo (e.g. benchflow-ai/skillsbench) |
--source-path |
— | Subpath within the repo (e.g. tasks) |
--source-ref |
— | Branch or tag to clone (e.g. main) |
--source-env |
— | Hosted environment source (e.g. primeintellect/general-agent) |
--source-env-version |
— | Hosted environment version |
--source-env-arg |
— | Hosted environment argument as KEY=VALUE; repeatable |
--source-env-num-examples |
1 |
Number of hosted environment examples |
--source-env-rollouts-per-example |
1 |
Rollouts per hosted environment example |
--source-env-max-tokens |
1024 |
Max tokens for hosted environment model calls |
--source-env-temperature |
0.0 |
Temperature for hosted environment model calls |
--source-env-sampling-arg |
— | Verifiers sampling argument as KEY=VALUE; repeatable (for example reasoning_effort=minimal) |
--agent |
claude-agent-acp |
Agent name |
--model |
Agent default | Model ID |
--reasoning-effort |
— | Agent reasoning/thinking effort when the agent exposes one (e.g. max) |
--sandbox |
docker |
Sandbox: docker, daytona, or modal |
--usage-tracking |
auto |
Token usage telemetry policy: auto, required, or off |
--environment-manifest |
— | Path to an Environment-plane manifest (environment.toml); applied to every rollout in the batch |
--prompt |
instruction.md |
Prompt to send to the agent; repeatable for multi-prompt runs |
--concurrency |
4 |
Max concurrent tasks (batch mode only) |
--build-concurrency |
--concurrency |
Max concurrent docker image builds; set lower (e.g. 8) when --concurrency is high to avoid overwhelming the docker daemon |
--worker-concurrency |
— | Run batch eval through isolated worker subprocesses, each with at most this many concurrent tasks; --concurrency remains the aggregate target |
--worker-retries |
1 |
Retry a crashed worker shard this many times, resuming its jobs dir |
--worker-start-stagger-sec |
1.0 |
Seconds to stagger worker starts to avoid Daytona connection storms |
--agent-idle-timeout |
(built-in default) | Abort ACP prompts after this many idle seconds; 0 disables idle detection |
--jobs-dir |
jobs |
Output directory |
--sandbox-user |
agent |
Sandbox user (null for root) |
--sandbox-setup-timeout |
120 |
Timeout in seconds for sandbox user setup |
--skills-dir |
— | Advanced custom skills directory; valid only with --skill-mode with-skill. Omit it to use each task's environment/skills. |
--skill-mode |
no-skill |
Skill mode: no-skill, with-skill, or self-gen |
--skill-creator-dir |
— | Path to a skill-creator directory (or a skills root containing it); used when --skill-mode self-gen |
--self-gen-no-internet |
false |
Disable web tools for the self-generated skill run |
--agent-env |
— | Agent environment variable as KEY=VALUE; repeatable |
--include |
— | Only run these task names; repeatable (e.g. --include jax-computing-basics --include data-to-d3) |
--exclude |
— | Skip these task names; repeatable (e.g. --exclude quantum-numerical-simulation) |
--loop-strategy |
— | Wrap each rollout in a loop, e.g. verify-retry:k=3,feedback=names or self-review:k=3 (omit for single-shot) |
--ignore-bench-version |
false |
With --dataset, skip the dataset's bench_version compatibility gate |
When mounting skills, the recommended docs default is
--agent-env BENCHFLOW_SKILL_NUDGE=name. See
Architecture: skill loading for how
with-skill mode is registered with each agent and how the nudge modes differ.
Daytona batch runs collect provider token/cost telemetry by default with a
sandbox-local LiteLLM gateway. Use --usage-tracking required when missing telemetry
should fail the rollout, or --usage-tracking off for recovery runs that should
leave provider traffic untouched.
--source-env is for external hosted environment hubs. The first supported
runner is PrimeIntellect / Verifiers: BenchFlow preserves the hosted identity
(env_uid, hub_url), installs the versioned package into an isolated local
virtual environment, and runs vf-eval. --sandbox remains the BenchFlow task
sandbox selector for local/repo task sources; Verifiers source environments own
their own harness and sandbox behavior. --model is passed to the Verifiers
model endpoint; use a model id available to that provider. Provider-specific
sampling options are not inferred; pass them explicitly with
--source-env-sampling-arg.
List completed evaluations from a jobs directory.
bench eval list jobs/Collect and display metrics (pass/fail/score, memory score, tool calls, duration)
from a jobs directory. Use --json for machine-readable output.
bench eval metrics jobs/
bench eval metrics jobs/ --jsonServe a trial trajectory viewer in the browser for a rollout or job directory.
bench eval view jobs/run/task__abc123
bench eval view jobs/ --port 9000List skills discovered under the default skills roots (or --dir).
bench skills list
bench skills list --dir ./skillsEvaluate a skill against its evals.json test cases.
bench skills eval skills/my-skill/ \
--agent gemini \
--model gemini-3.1-flash-lite-preview \
--sandbox daytonaScaffold a new benchmark task.
bench tasks init my-new-task
bench tasks init my-new-task --dir tasks/
bench tasks init my-new-task --format legacy| Flag | Default | Description |
|---|---|---|
--format |
task-md |
Task format: task-md (native single-document) or legacy (split task.toml + instruction.md layout) |
Validate a task directory (task.md or legacy task.toml + instruction.md, environment/Dockerfile, verifier/ or legacy tests/).
bench tasks check tasks/my-taskWith --level, validation runs at a chosen depth: schema, structural,
runtime-capability, publication-grade, acceptance, or acceptance-live.
Acceptance-level errors such as
acceptance validation requires benchflow.evidence mapping refer to the
benchflow.evidence schema documented in the "Assets, Provenance, And
Evidence" section of docs/task-standard.md.
Convert a legacy task.toml + instruction.md task into the unified
task.md format. By default the legacy files are kept alongside the new
task.md.
bench tasks migrate tasks/my-task
bench tasks migrate tasks/my-task --overwrite --remove-legacy| Flag | Default | Description |
|---|---|---|
--overwrite |
false |
Replace an existing task.md |
--remove-legacy |
false |
Delete split files and promote tests/solution aliases after task.md is verified |
Expand minimal task.md authoring profiles into the canonical task.md
form. Prints the normalized document to stdout unless told otherwise.
bench tasks normalize tasks/my-task
bench tasks normalize tasks/my-task --write
bench tasks normalize tasks/my-task -o normalized-task.md| Flag | Default | Description |
|---|---|---|
--output, -o |
— | Write normalized task.md to this path instead of stdout |
--write |
false |
Replace task.md in place with the normalized canonical form |
Export a task.md task to a Harbor/Pier-compatible split layout, with a
compatibility loss report written to compatibility/export-report.json in
the export directory.
bench tasks export tasks/my-task out/my-task-split
bench tasks export tasks/my-task --report-only
bench tasks export tasks/my-task out/my-task-split --target pier --overwriteArguments: TASK_DIR (task directory to export) and optional OUTPUT_DIR
(destination split-layout directory; may be omitted with --report-only).
| Flag | Default | Description |
|---|---|---|
--target |
harbor |
Compatibility target: harbor or pier |
--overwrite |
false |
Replace an existing export directory |
--report-only |
false |
Print the compatibility loss report without writing files |
Compute the content digest that pins a task's files, independent of git — the
sha256 the dataset registry keys on (matches the digests bench eval create -d
verifies and the task_digest stamped into every result.json). Recognizes
both legacy task.toml tasks and native task.md tasks. Given a single task
directory it prints the digest; given a directory of tasks it prints one
<name> <digest> line per task. Output goes to stdout via echo (not Rich), so
it is safe to pipe into machine-readable tooling.
bench tasks digest tasks/my-task # -> sha256:<hex>
bench tasks digest tasks/ # one "<name> sha256:<hex>" line per taskArguments: PATH (a task directory, or a directory of task directories).
Generate benchmark task directories from real agent traces.
bench tasks generate --from-local --project my-repo --limit 5
bench tasks generate --from-file session.jsonl --dry-run
bench tasks generate --from-hf opentraces-test --limit 50| Flag | Default | Description |
|---|---|---|
--from-local |
— | Generate from local Claude Code sessions |
--from-file |
— | Generate from a JSONL trace file |
--from-hf |
— | Generate from a HuggingFace dataset ID or alias |
--output |
tasks |
Output directory for generated tasks |
--projects-dir |
~/.claude/projects/ |
Claude Code projects directory |
--project |
— | Filter local sessions by project path substring |
--format |
auto |
Trace format override |
--split |
train |
HuggingFace dataset split |
--max-rows |
100 |
Max rows to download from HuggingFace |
--limit |
20 |
Max traces to process |
--min-steps |
2 |
Minimum steps per trace |
--outcome |
— | Filter by outcome: success, failure, unknown |
--author |
benchflow-traces |
Author name for generated task metadata |
--task-format |
task-md |
Generated task package format: task-md or legacy |
--dry-run |
false |
Preview traces without generating tasks |
List known HuggingFace trace datasets. The aliases listed here can be passed
to bench tasks generate --from-hf.
bench tasks list-sourcesLocal sandbox lifecycle: provision a task on a docker/daytona/modal backend, list active sandboxes, and reap stale ones.
Create an environment object from a task directory. This validates environment construction but does not start the sandbox.
bench sandbox create tasks/my-task --sandbox daytonaList active local (Daytona) sandboxes.
bench sandbox listClean up orphaned Daytona sandboxes. By default this deletes sandboxes older
than 24 hours; use --dry-run to preview what would be deleted.
bench sandbox cleanup --dry-run --max-age 1440Daytona-backed evals also reap orphaned sandboxes automatically at run start
(failure states such as BUILD_FAILED are reaped sooner than healthy ones, and
an idle-activity guard means concurrent live runs are never reaped). Set
BENCHFLOW_DAYTONA_AUTO_REAP to any of 0/false/no/off (case-insensitive)
to disable that automatic pass and rely on the manual command above.
bench environment is a hidden deprecated alias group, removed in 0.7. The
local lifecycle moved to bench sandbox (create/list/cleanup)
and hosted-provider browsing to bench hub env. The old
bench environment create|list|cleanup and show|inspect (plus list --provider/--hub) still work, each printing a one-line stderr notice.
External environment hubs: compatibility checks (check) and browsing a hosted
provider's environments (env).
Read-only browsing of a hosted provider's environments (PrimeIntellect
"Environments"). To run one, use bench eval create --source-env.
bench hub env list --provider primeintellect --owner primeintellect --search general-agent --limit 5
bench hub env show primeintellect/general-agent --version 0.1.1
bench hub env inspect primeintellect/general-agent --version 0.1.1 --path README.mdInventory or structurally check representative tasks from an environment hub's registry. Defaults to an inventory pass against the public Harbor registry JSON.
# Inventory the public Harbor hub registry
bench hub check
# Structural check, two tasks per dataset, JSONL output
bench hub check --level check --tasks-per-dataset 2 --out hub.jsonl| Flag | Default | Description |
|---|---|---|
--registry |
Harbor public registry URL | Harbor registry JSON URL or local file |
--tasks-per-dataset |
2 |
Representative tasks selected per dataset |
--level |
inventory |
Compatibility level: inventory or check |
--out |
— | Optional JSONL output path |
--cache-dir |
.cache/hub/harbor |
Cache directory for sparse clones |
--limit |
— | Optional cap on selected task refs |
source:
repo: benchflow-ai/skillsbench
path: tasks
environment: daytona
concurrency: 64
sandbox_setup_timeout: 300
agent: gemini
model: gemini-3.1-flash-lite-preview
skill_mode: with-skill
skills_dir: shared-skills/
agent_env:
BENCHFLOW_SKILL_NUDGE: name
max_retries: 2Use the Python API for multi-scene experiments. bench eval create --config is for
batch job configs; scene configs are loaded with benchflow._utils.yaml_loader or built
directly in Python.
task_dir: tasks/my-task
environment: daytona
sandbox_setup_timeout: 300
scenes:
- name: skill-gen
roles:
- name: creator
agent: gemini
model: gemini-3.1-flash-lite-preview
turns:
- role: creator
prompt: "Analyze the task and write a skill document to /app/generated-skill.md"
- name: solve
roles:
- name: solver
agent: gemini
model: gemini-3.1-flash-lite-preview
turns:
- role: solverResume a previous, unfinished (timed-out) openhands run to completion via
record-replay. Standalone — it does not touch the normal run path. See
Continuing timed-out runs for the full guide.
bench continue path/to/original/run-folder --tasks-dir path/to/tasksKey options: --model (override the live-continuation model; defaults to the
original run's model), --timeout, --output, --require-timeout,
--strict-divergence, --replay-only (rebuild via replay and stop at the
cut-point — no live model or API key needed), and --proxy-mode (replay
proxy placement: auto, host, or sandbox; default auto uses
sandbox-local replay for Daytona/Modal and host replay for Docker).
Continue all timed-out OpenHands runs found under a directory tree. Discovers
run folders (config.json + trajectory/llm_trajectory.jsonl) recursively,
continues each, and prints a JSON batch summary (exits 1 if any continuation
failed).
bench continue-batch path/to/jobs-root --tasks-dir path/to/tasks| Flag | Default | Description |
|---|---|---|
--tasks-dir |
— | Directory holding task sources; required unless the recorded task path exists |
--model |
original run's model | Override the live-continuation model |
--timeout |
— | Wall-clock budget per continuation |
--output |
— | Output jobs dir for continued runs |
--concurrency |
100 |
Maximum number of continuation runs in flight |
--limit |
— | Limit discovered timeout folders |
--strict-divergence |
false |
Abort a run if replay leaves the original rails |
--proxy-mode |
auto |
Replay proxy placement: auto, host, or sandbox |