Skip to content

fix(dogfood): four harness-lane bugs surfaced by deepseek-v4-pro/deepagents run#757

Merged
xdotli merged 1 commit into
mainfrom
fix/deepseek-dogfood-issues
Jun 14, 2026
Merged

fix(dogfood): four harness-lane bugs surfaced by deepseek-v4-pro/deepagents run#757
xdotli merged 1 commit into
mainfrom
fix/deepseek-dogfood-issues

Conversation

@xdotli

@xdotli xdotli commented Jun 14, 2026

Copy link
Copy Markdown
Member

What

A real-model dogfood (deepseek-v4-pro, deepagents agent, 50 real SkillsBench tasks on Daytona) surfaced four bugs in the benchflow harness lane — the tasks themselves were blameless. This PR fixes all four, with the fixes verified + re-dogfooded.

# Bug Fix
1 deepagents install fails on base images shipping Python <3.11 (No matching distribution found for deepagents, 3/50) Provision a uv-pinned Python 3.12 venv (mirrors the OpenHands install). Adds an import deepagents verification tail so a failed install surfaces as rc!=0 instead of being masked by the shim-deploy's trailing chmod +x.
2 Idle watchdog false-fires at 600s on long-but-productive turns (9 idle_timeouts); trajectory lost on cancel Stream the LangGraph trajectory (agent.stream, updates mode) emitting ACP updates per step instead of a blocking invoke(). Tool-call ids deduped across chunks.
3 summary.json records $0 cost despite 41.4M deepseek tokens Populate MODEL_COST_PER_TOKEN with deepseek-v4 entries (clearly-labeled v3-class PLACEHOLDERS to verify) — deepseek-v4 is absent from LiteLLM's built-in price table.
4 Rootfs conftest-purge find / exceeds a 10s timeout on slow network-backed FS → false-errors valid verifiers Prune /proc /sys /dev from the find, and give the walk one owned timeout (VERIFIER_SETUP_TIMEOUT_SEC) shared by the scoring path (harden_before_verify) and the soft-verify path. Kept fail-closed.

Verification

  • Full suite green: 4071 passed, 49 skipped; ruff check / ruff format --check / ty check all clean.
  • Adds a streaming/dedup unit test for the shim; updates the deepagents-install and conftest-purge assertions.
  • Structural-quality (thermo-nuclear) review: APPROVE — two initial blockers (install-failure masking; the timeout bump reaching only one of two call sites + a scattered literal) were fixed and re-verified.
  • #4 keeps the conftest-purge fail-closed (a reward-hacking defense); pruning only skips descent into virtual filesystems where a planted conftest can't affect pytest collection.

Notes

  • The deepseek-v4 prices in #3 are PLACEHOLDERS (v3-class) — flagged in-code to verify against published pricing before relying on absolute cost figures.

Note

Medium Risk
Touches verifier hardening timeouts and rootfs find (fail-closed anti-tamper) plus agent streaming behavior; deepseek prices are unverified placeholders.

Overview
Fixes four harness issues found during deepseek-v4-pro / deepagents dogfood on slow Daytona sandboxes.

deepagents install now provisions a uv-managed Python 3.12 venv (package needs ≥3.11; old task images ship 3.6/3.8) and ends with an import deepagents check so a failed uv pip install fails install instead of masking behind shim deploy.

The deepagents ACP shim streams LangGraph (agent.stream, updates) and emits tool/text updates per step so the 600s idle watchdog stays fed on long model turns; tool-call ids are deduped when chunks re-emit state.

LiteLLM custom pricing adds deepseek-v4-pro/flash placeholders so runs no longer record $0 cost (v4 absent from built-in table).

Verifier conftest purge find / now prunes /proc, /sys, /dev and uses a shared VERIFIER_SETUP_TIMEOUT_SEC (60s) on both soft-verify and harden_before_verify, replacing scattered 10s limits that false-failed on network-backed FS.

Reviewed by Cursor Bugbot for commit eee4f8e. Bugbot is set up for automated code reviews on this repo. Configure here.

…agents run

A real-model dogfood (deepseek-v4-pro, deepagents agent, 50 real SkillsBench
tasks on Daytona) surfaced four bugs in the benchflow harness lane (the tasks
themselves were blameless). Fixes:

1. deepagents install: provision a uv-pinned Python 3.12 venv instead of a
   system-python venv. deepagents requires Python >=3.11, but some task base
   images ship Python 3.6/3.8, so the system venv made pip report "No matching
   distribution found for deepagents" (3/50 install failures). Mirrors the
   OpenHands install pattern. Adds an `import deepagents` verification tail so a
   failed install surfaces as rc!=0 instead of being masked by the shim-deploy's
   trailing chmod.

2. deepagents shim: stream the LangGraph trajectory (agent.stream, updates mode)
   instead of a blocking invoke() + post-hoc emit. The blocking path emitted zero
   ACP updates until it returned, so benchflow's idle watchdog (which counts ACP
   updates) false-fired at 600s on long-but-productive turns (9 idle_timeouts),
   and lost the trajectory on cancel. Tool-call ids are deduped across chunks.

3. deepseek-v4 cost capture: populate MODEL_COST_PER_TOKEN with deepseek-v4
   entries (clearly-labeled v3-class PLACEHOLDERS to verify), since deepseek-v4
   is absent from LiteLLM's built-in price table and a run otherwise records $0.

4. verifier crash: prune /proc /sys /dev from the rootfs conftest-purge find so
   it stays under budget on slow network-backed FS (Daytona), and give the walk
   one owned timeout (VERIFIER_SETUP_TIMEOUT_SEC) shared by both the scoring path
   (harden_before_verify) and the soft-verify path. Previously the unpruned find
   exceeded a 10s timeout and false-errored valid verifiers. Kept fail-closed.

Tests: full suite green (4071 passed); adds a streaming/dedup test for the shim
and updates the deepagents-install and conftest-purge assertions.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eee4f8eec9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +870 to 873
"find / -path /proc -prune -o -path /sys -prune -o -path /dev -prune -o "
"-name conftest.py "
"-not -path '/verifier/*' -not -path '/tests/*' "
"-delete 2>/dev/null"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep conftest cleanup compatible with -delete

When cleanup_conftests is enabled (the default), this find expression no longer removes any conftest.py: I checked GNU find and -delete automatically enables -depth, which makes -prune invalid/ineffective and causes the command to exit before deleting matches. Because stderr is redirected and _build_cleanup_cmd() joins the remaining cleanup steps with ;, the failure is hidden and agent-planted conftests outside /tests can survive into verification, corrupting scored runs.

Useful? React with 👍 / 👎.

@cursor

cursor Bot commented Jun 14, 2026

Copy link
Copy Markdown

Bugbot couldn't run - usage limit reached

Bugbot is counted against Cursor usage for this user or team, and this run hit a usage or spend limit.

A user or team admin can review and increase usage limits in the Cursor dashboard.

(requestId: serverGenReqId_5d446d08-e284-41cc-bd55-38fef4495572)

@xdotli xdotli merged commit 6ba75db into main Jun 14, 2026
4 checks passed
@xdotli xdotli deleted the fix/deepseek-dogfood-issues branch June 14, 2026 04:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant