fix(dogfood): four harness-lane bugs surfaced by deepseek-v4-pro/deepagents run by xdotli · Pull Request #757 · benchflow-ai/benchflow

xdotli · 2026-06-14T03:17:07Z

What

A real-model dogfood (deepseek-v4-pro, deepagents agent, 50 real SkillsBench tasks on Daytona) surfaced four bugs in the benchflow harness lane — the tasks themselves were blameless. This PR fixes all four, with the fixes verified + re-dogfooded.

#	Bug	Fix
1	deepagents install fails on base images shipping Python <3.11 (`No matching distribution found for deepagents`, 3/50)	Provision a uv-pinned Python 3.12 venv (mirrors the OpenHands install). Adds an `import deepagents` verification tail so a failed install surfaces as `rc!=0` instead of being masked by the shim-deploy's trailing `chmod +x`.
2	Idle watchdog false-fires at 600s on long-but-productive turns (9 `idle_timeout`s); trajectory lost on cancel	Stream the LangGraph trajectory (`agent.stream`, `updates` mode) emitting ACP updates per step instead of a blocking `invoke()`. Tool-call ids deduped across chunks.
3	`summary.json` records $0 cost despite 41.4M deepseek tokens	Populate `MODEL_COST_PER_TOKEN` with deepseek-v4 entries (clearly-labeled v3-class PLACEHOLDERS to verify) — deepseek-v4 is absent from LiteLLM's built-in price table.
4	Rootfs conftest-purge `find /` exceeds a 10s timeout on slow network-backed FS → false-errors valid verifiers	Prune `/proc /sys /dev` from the find, and give the walk one owned timeout (`VERIFIER_SETUP_TIMEOUT_SEC`) shared by the scoring path (`harden_before_verify`) and the soft-verify path. Kept fail-closed.

Verification

Full suite green: 4071 passed, 49 skipped; ruff check / ruff format --check / ty check all clean.
Adds a streaming/dedup unit test for the shim; updates the deepagents-install and conftest-purge assertions.
Structural-quality (thermo-nuclear) review: APPROVE — two initial blockers (install-failure masking; the timeout bump reaching only one of two call sites + a scattered literal) were fixed and re-verified.
#4 keeps the conftest-purge fail-closed (a reward-hacking defense); pruning only skips descent into virtual filesystems where a planted conftest can't affect pytest collection.

Notes

The deepseek-v4 prices in #3 are PLACEHOLDERS (v3-class) — flagged in-code to verify against published pricing before relying on absolute cost figures.

Note

Medium Risk
Touches verifier hardening timeouts and rootfs find (fail-closed anti-tamper) plus agent streaming behavior; deepseek prices are unverified placeholders.

Overview
Fixes four harness issues found during deepseek-v4-pro / deepagents dogfood on slow Daytona sandboxes.

deepagents install now provisions a uv-managed Python 3.12 venv (package needs ≥3.11; old task images ship 3.6/3.8) and ends with an import deepagents check so a failed uv pip install fails install instead of masking behind shim deploy.

The deepagents ACP shim streams LangGraph (agent.stream, updates) and emits tool/text updates per step so the 600s idle watchdog stays fed on long model turns; tool-call ids are deduped when chunks re-emit state.

LiteLLM custom pricing adds deepseek-v4-pro/flash placeholders so runs no longer record $0 cost (v4 absent from built-in table).

Verifier conftest purge find / now prunes /proc, /sys, /dev and uses a shared VERIFIER_SETUP_TIMEOUT_SEC (60s) on both soft-verify and harden_before_verify, replacing scattered 10s limits that false-failed on network-backed FS.

^{Reviewed by Cursor Bugbot for commit eee4f8e. Bugbot is set up for automated code reviews on this repo. Configure here.}

…agents run A real-model dogfood (deepseek-v4-pro, deepagents agent, 50 real SkillsBench tasks on Daytona) surfaced four bugs in the benchflow harness lane (the tasks themselves were blameless). Fixes: 1. deepagents install: provision a uv-pinned Python 3.12 venv instead of a system-python venv. deepagents requires Python >=3.11, but some task base images ship Python 3.6/3.8, so the system venv made pip report "No matching distribution found for deepagents" (3/50 install failures). Mirrors the OpenHands install pattern. Adds an `import deepagents` verification tail so a failed install surfaces as rc!=0 instead of being masked by the shim-deploy's trailing chmod. 2. deepagents shim: stream the LangGraph trajectory (agent.stream, updates mode) instead of a blocking invoke() + post-hoc emit. The blocking path emitted zero ACP updates until it returned, so benchflow's idle watchdog (which counts ACP updates) false-fired at 600s on long-but-productive turns (9 idle_timeouts), and lost the trajectory on cancel. Tool-call ids are deduped across chunks. 3. deepseek-v4 cost capture: populate MODEL_COST_PER_TOKEN with deepseek-v4 entries (clearly-labeled v3-class PLACEHOLDERS to verify), since deepseek-v4 is absent from LiteLLM's built-in price table and a run otherwise records $0. 4. verifier crash: prune /proc /sys /dev from the rootfs conftest-purge find so it stays under budget on slow network-backed FS (Daytona), and give the walk one owned timeout (VERIFIER_SETUP_TIMEOUT_SEC) shared by both the scoring path (harden_before_verify) and the soft-verify path. Previously the unpruned find exceeded a 10s timeout and false-errored valid verifiers. Kept fail-closed. Tests: full suite green (4071 passed); adds a streaming/dedup test for the shim and updates the deepagents-install and conftest-purge assertions.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eee4f8eec9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-14T03:20:06Z

+            "find / -path /proc -prune -o -path /sys -prune -o -path /dev -prune -o "
+            "-name conftest.py "
            "-not -path '/verifier/*' -not -path '/tests/*' "
            "-delete 2>/dev/null"


Keep conftest cleanup compatible with -delete

When cleanup_conftests is enabled (the default), this find expression no longer removes any conftest.py: I checked GNU find and -delete automatically enables -depth, which makes -prune invalid/ineffective and causes the command to exit before deleting matches. Because stderr is redirected and _build_cleanup_cmd() joins the remaining cleanup steps with ;, the failure is hidden and agent-planted conftests outside /tests can survive into verification, corrupting scored runs.

Useful? React with 👍 / 👎.

cursor · 2026-06-14T03:33:10Z

Bugbot couldn't run - usage limit reached

Bugbot is counted against Cursor usage for this user or team, and this run hit a usage or spend limit.

A user or team admin can review and increase usage limits in the Cursor dashboard.

(requestId: serverGenReqId_5d446d08-e284-41cc-bd55-38fef4495572)

chatgpt-codex-connector Bot reviewed Jun 14, 2026

View reviewed changes

xdotli merged commit 6ba75db into main Jun 14, 2026
4 checks passed

xdotli deleted the fix/deepseek-dogfood-issues branch June 14, 2026 04:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(dogfood): four harness-lane bugs surfaced by deepseek-v4-pro/deepagents run#757

fix(dogfood): four harness-lane bugs surfaced by deepseek-v4-pro/deepagents run#757
xdotli merged 1 commit into
mainfrom
fix/deepseek-dogfood-issues

xdotli commented Jun 14, 2026 •

edited by cursor Bot

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Uh oh!

cursor Bot commented Jun 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xdotli commented Jun 14, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Verification

Notes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot commented Jun 14, 2026

Bugbot couldn't run - usage limit reached

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xdotli commented Jun 14, 2026 •

edited by cursor Bot

Loading