fix(acp): don't idle-timeout an agent while a tool call is in-flight#758
Conversation
The idle watchdog fires when no tool_call/message/thought ACP update arrives for idle_timeout seconds. But a long-running tool (e.g. a build/test/solver shell `execute`) emits NO ACP updates until it returns, so a tool that runs longer than idle_timeout false-fired the watchdog and discarded real work. Surfaced in a deepseek-v4-pro/deepagents dogfood: the agent emitted a tool_call (status pending) for a shell command, the command ran >600s, and the watchdog killed a productive run (which then needed a retry to recover). Fix: while a tool call is in-flight (pending/in_progress), treat the agent as active and reset the idle clock, deferring to the wall-clock `timeout` as the hard backstop for a tool that never returns. This does NOT weaken hang detection: a genuine model-side hang has no pending tool call (the prior tool already completed via tool_call_update), so it still trips the idle path — and a completed-tool-then-hang still idle-times-out (existing test unchanged). `deadline` is loop-invariant, so the reset cannot push the wall-clock backstop out. Adds test_in_flight_tool_call_defers_idle_to_wall_clock. Full suite 4072 green.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7061a1f899
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| """A pending/in-progress tool call (agent running a long shell command) | ||
| must NOT trip the idle watchdog: such tools emit no ACP updates until | ||
| they return, so the idle path would otherwise false-kill real work. The | ||
| agent is alive (executing a tool), so it defers to the wall-clock | ||
| backstop. Contrast test_idle_timeout_info_reflects_activity_counts, where | ||
| a *completed* tool that then hangs still idles out.""" |
There was a problem hiding this comment.
Add the guarded commit to this regression docstring
The repo-level AGENTS.md for /workspace/benchflow says “Regression tests must name the PR/commit they guard,” and this newly added regression test documents the in-flight-tool watchdog behavior without naming the guarding PR or commit. Please include the relevant PR/commit identifier in this docstring so the test follows the repository convention and remains traceable.
Useful? React with 👍 / 👎.
What
The ACP idle watchdog aborts a prompt when no
tool_call/agent_message_chunk/agent_thought_chunkupdate arrives foridle_timeoutseconds (default 600s). But a long-running tool — a build/test/solver shellexecute— emits no ACP updates until it returns. So a tool that runs longer thanidle_timeoutfalse-fired the watchdog and discarded real work.Surfaced in the deepseek-v4-pro / deepagents dogfood: the agent emitted a
tool_call(statuspending) for a shell command, the command ran >600s, and the watchdog killed a productive run (42 prior tool calls), which then needed a retry to recover.Fix
While a tool call is in-flight (
pending/in_progress), treat the agent as active and reset the idle clock — deferring to the wall-clocktimeoutas the hard backstop for a tool that genuinely never returns.This does not weaken hang detection (the watchdog's actual purpose):
tool_call_update), so it still trips the idle path.test_idle_timeout_info_reflects_activity_counts(uses aCOMPLETEDtool) is unchanged and passing.deadlineis loop-invariant (fixed at loop start, never re-derived fromlast_progress), so the reset cannot push the wall-clock backstop out indefinitely. Added a load-bearing comment so a future refactor can't reintroduce that hang.Verification
test_in_flight_tool_call_defers_idle_to_wall_clock: a pending tool that hangs raises the wall-clockAgentPromptTimeoutError, notIdleTimeoutError.ruff check/ruff format --check/ty checkclean.Note
A second item considered in the same investigation — an apparent summary "under-count" (8 rollouts reported as
total: 7) — turned out not to be a bug: it's the retry mechanism (a task idle-timed-out, was retried, and the retry passed), and both the live summary and the post-hocmetrics.pyaggregation correctly dedupe retries bytask_name, keeping the best attempt. No change made there. This PR's fix also reduces why that retry was needed in the first place.Note
Medium Risk
Changes rollout timeout behavior for all ACP runs with idle detection; misclassification of pending tools could weaken hang detection or allow stuck tools to run until wall-clock expiry.
Overview
Fixes false idle timeouts during long-running shell tools: the ACP idle watchdog in
_prompt_with_idle_watchdognow treats pending/in-progress tool calls as activity and refreshes the idle clock viasession.pending_tool_call_ids(), since those tools often emit no ACP updates until they finish.Model-side hangs after a completed tool are unchanged—the idle path still fires when there is no pending tool. The wall-clock
timeoutremains the hard cap (including tools that never return). Comments clarify thatdeadlinestays fixed at loop start so resettinglast_progressfor in-flight tools cannot push the wall-clock backstop out indefinitely.Adds regression test
test_in_flight_tool_call_defers_idle_to_wall_clock: a pending tool that hangs hitsAgentPromptTimeoutErrorat the wall-clock budget, notIdleTimeoutErrorat the shorter idle threshold.Reviewed by Cursor Bugbot for commit 7061a1f. Bugbot is set up for automated code reviews on this repo. Configure here.