fix(dogfood): four harness-lane bugs surfaced by deepseek-v4-pro/deepagents run#757
Conversation
…agents run A real-model dogfood (deepseek-v4-pro, deepagents agent, 50 real SkillsBench tasks on Daytona) surfaced four bugs in the benchflow harness lane (the tasks themselves were blameless). Fixes: 1. deepagents install: provision a uv-pinned Python 3.12 venv instead of a system-python venv. deepagents requires Python >=3.11, but some task base images ship Python 3.6/3.8, so the system venv made pip report "No matching distribution found for deepagents" (3/50 install failures). Mirrors the OpenHands install pattern. Adds an `import deepagents` verification tail so a failed install surfaces as rc!=0 instead of being masked by the shim-deploy's trailing chmod. 2. deepagents shim: stream the LangGraph trajectory (agent.stream, updates mode) instead of a blocking invoke() + post-hoc emit. The blocking path emitted zero ACP updates until it returned, so benchflow's idle watchdog (which counts ACP updates) false-fired at 600s on long-but-productive turns (9 idle_timeouts), and lost the trajectory on cancel. Tool-call ids are deduped across chunks. 3. deepseek-v4 cost capture: populate MODEL_COST_PER_TOKEN with deepseek-v4 entries (clearly-labeled v3-class PLACEHOLDERS to verify), since deepseek-v4 is absent from LiteLLM's built-in price table and a run otherwise records $0. 4. verifier crash: prune /proc /sys /dev from the rootfs conftest-purge find so it stays under budget on slow network-backed FS (Daytona), and give the walk one owned timeout (VERIFIER_SETUP_TIMEOUT_SEC) shared by both the scoring path (harden_before_verify) and the soft-verify path. Previously the unpruned find exceeded a 10s timeout and false-errored valid verifiers. Kept fail-closed. Tests: full suite green (4071 passed); adds a streaming/dedup test for the shim and updates the deepagents-install and conftest-purge assertions.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eee4f8eec9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "find / -path /proc -prune -o -path /sys -prune -o -path /dev -prune -o " | ||
| "-name conftest.py " | ||
| "-not -path '/verifier/*' -not -path '/tests/*' " | ||
| "-delete 2>/dev/null" |
There was a problem hiding this comment.
Keep conftest cleanup compatible with -delete
When cleanup_conftests is enabled (the default), this find expression no longer removes any conftest.py: I checked GNU find and -delete automatically enables -depth, which makes -prune invalid/ineffective and causes the command to exit before deleting matches. Because stderr is redirected and _build_cleanup_cmd() joins the remaining cleanup steps with ;, the failure is hidden and agent-planted conftests outside /tests can survive into verification, corrupting scored runs.
Useful? React with 👍 / 👎.
Bugbot couldn't run - usage limit reachedBugbot is counted against Cursor usage for this user or team, and this run hit a usage or spend limit. A user or team admin can review and increase usage limits in the Cursor dashboard. (requestId: serverGenReqId_5d446d08-e284-41cc-bd55-38fef4495572) |
What
A real-model dogfood (deepseek-v4-pro, deepagents agent, 50 real SkillsBench tasks on Daytona) surfaced four bugs in the benchflow harness lane — the tasks themselves were blameless. This PR fixes all four, with the fixes verified + re-dogfooded.
No matching distribution found for deepagents, 3/50)import deepagentsverification tail so a failed install surfaces asrc!=0instead of being masked by the shim-deploy's trailingchmod +x.idle_timeouts); trajectory lost on cancelagent.stream,updatesmode) emitting ACP updates per step instead of a blockinginvoke(). Tool-call ids deduped across chunks.summary.jsonrecords $0 cost despite 41.4M deepseek tokensMODEL_COST_PER_TOKENwith deepseek-v4 entries (clearly-labeled v3-class PLACEHOLDERS to verify) — deepseek-v4 is absent from LiteLLM's built-in price table.find /exceeds a 10s timeout on slow network-backed FS → false-errors valid verifiers/proc /sys /devfrom the find, and give the walk one owned timeout (VERIFIER_SETUP_TIMEOUT_SEC) shared by the scoring path (harden_before_verify) and the soft-verify path. Kept fail-closed.Verification
ruff check/ruff format --check/ty checkall clean.#4keeps the conftest-purge fail-closed (a reward-hacking defense); pruning only skips descent into virtual filesystems where a planted conftest can't affect pytest collection.Notes
#3are PLACEHOLDERS (v3-class) — flagged in-code to verify against published pricing before relying on absolute cost figures.Note
Medium Risk
Touches verifier hardening timeouts and rootfs find (fail-closed anti-tamper) plus agent streaming behavior; deepseek prices are unverified placeholders.
Overview
Fixes four harness issues found during deepseek-v4-pro / deepagents dogfood on slow Daytona sandboxes.
deepagents install now provisions a uv-managed Python 3.12 venv (package needs ≥3.11; old task images ship 3.6/3.8) and ends with an
import deepagentscheck so a faileduv pip installfails install instead of masking behind shim deploy.The deepagents ACP shim streams LangGraph (
agent.stream,updates) and emits tool/text updates per step so the 600s idle watchdog stays fed on long model turns; tool-call ids are deduped when chunks re-emit state.LiteLLM custom pricing adds deepseek-v4-pro/flash placeholders so runs no longer record $0 cost (v4 absent from built-in table).
Verifier conftest purge
find /now prunes/proc,/sys,/devand uses a sharedVERIFIER_SETUP_TIMEOUT_SEC(60s) on both soft-verify andharden_before_verify, replacing scattered 10s limits that false-failed on network-backed FS.Reviewed by Cursor Bugbot for commit eee4f8e. Bugbot is set up for automated code reviews on this repo. Configure here.