Skip to content

fix(acp): don't idle-timeout an agent while a tool call is in-flight#758

Merged
xdotli merged 1 commit into
mainfrom
fix/idle-heartbeat-and-summary-dedup
Jun 14, 2026
Merged

fix(acp): don't idle-timeout an agent while a tool call is in-flight#758
xdotli merged 1 commit into
mainfrom
fix/idle-heartbeat-and-summary-dedup

Conversation

@xdotli

@xdotli xdotli commented Jun 14, 2026

Copy link
Copy Markdown
Member

What

The ACP idle watchdog aborts a prompt when no tool_call / agent_message_chunk / agent_thought_chunk update arrives for idle_timeout seconds (default 600s). But a long-running tool — a build/test/solver shell execute — emits no ACP updates until it returns. So a tool that runs longer than idle_timeout false-fired the watchdog and discarded real work.

Surfaced in the deepseek-v4-pro / deepagents dogfood: the agent emitted a tool_call (status pending) for a shell command, the command ran >600s, and the watchdog killed a productive run (42 prior tool calls), which then needed a retry to recover.

Fix

While a tool call is in-flight (pending/in_progress), treat the agent as active and reset the idle clock — deferring to the wall-clock timeout as the hard backstop for a tool that genuinely never returns.

if cur_count > last_count:
    last_progress = now
    last_count = cur_count
elif session.pending_tool_call_ids():   # in-flight tool = working, not hung
    last_progress = now

This does not weaken hang detection (the watchdog's actual purpose):

  • A genuine model-side hang has no pending tool call (the prior tool already completed via tool_call_update), so it still trips the idle path.
  • A completed-tool-then-hang still idle-times-out — test_idle_timeout_info_reflects_activity_counts (uses a COMPLETED tool) is unchanged and passing.
  • deadline is loop-invariant (fixed at loop start, never re-derived from last_progress), so the reset cannot push the wall-clock backstop out indefinitely. Added a load-bearing comment so a future refactor can't reintroduce that hang.

Verification

  • New regression test test_in_flight_tool_call_defers_idle_to_wall_clock: a pending tool that hangs raises the wall-clock AgentPromptTimeoutError, not IdleTimeoutError.
  • Full suite 4072 passed; ruff check / ruff format --check / ty check clean.
  • Adversarial safety review (APPROVE-WITH-NITS): confirmed the deadline is loop-invariant, the model-hang path still fires, and there's no data race (single event loop, no threads).

Note

A second item considered in the same investigation — an apparent summary "under-count" (8 rollouts reported as total: 7) — turned out not to be a bug: it's the retry mechanism (a task idle-timed-out, was retried, and the retry passed), and both the live summary and the post-hoc metrics.py aggregation correctly dedupe retries by task_name, keeping the best attempt. No change made there. This PR's fix also reduces why that retry was needed in the first place.


Note

Medium Risk
Changes rollout timeout behavior for all ACP runs with idle detection; misclassification of pending tools could weaken hang detection or allow stuck tools to run until wall-clock expiry.

Overview
Fixes false idle timeouts during long-running shell tools: the ACP idle watchdog in _prompt_with_idle_watchdog now treats pending/in-progress tool calls as activity and refreshes the idle clock via session.pending_tool_call_ids(), since those tools often emit no ACP updates until they finish.

Model-side hangs after a completed tool are unchanged—the idle path still fires when there is no pending tool. The wall-clock timeout remains the hard cap (including tools that never return). Comments clarify that deadline stays fixed at loop start so resetting last_progress for in-flight tools cannot push the wall-clock backstop out indefinitely.

Adds regression test test_in_flight_tool_call_defers_idle_to_wall_clock: a pending tool that hangs hits AgentPromptTimeoutError at the wall-clock budget, not IdleTimeoutError at the shorter idle threshold.

Reviewed by Cursor Bugbot for commit 7061a1f. Bugbot is set up for automated code reviews on this repo. Configure here.

The idle watchdog fires when no tool_call/message/thought ACP update arrives
for idle_timeout seconds. But a long-running tool (e.g. a build/test/solver
shell `execute`) emits NO ACP updates until it returns, so a tool that runs
longer than idle_timeout false-fired the watchdog and discarded real work.
Surfaced in a deepseek-v4-pro/deepagents dogfood: the agent emitted a tool_call
(status pending) for a shell command, the command ran >600s, and the watchdog
killed a productive run (which then needed a retry to recover).

Fix: while a tool call is in-flight (pending/in_progress), treat the agent as
active and reset the idle clock, deferring to the wall-clock `timeout` as the
hard backstop for a tool that never returns. This does NOT weaken hang
detection: a genuine model-side hang has no pending tool call (the prior tool
already completed via tool_call_update), so it still trips the idle path — and a
completed-tool-then-hang still idle-times-out (existing test unchanged). `deadline`
is loop-invariant, so the reset cannot push the wall-clock backstop out.

Adds test_in_flight_tool_call_defers_idle_to_wall_clock. Full suite 4072 green.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7061a1f899

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/test_acp.py
Comment on lines +694 to +699
"""A pending/in-progress tool call (agent running a long shell command)
must NOT trip the idle watchdog: such tools emit no ACP updates until
they return, so the idle path would otherwise false-kill real work. The
agent is alive (executing a tool), so it defers to the wall-clock
backstop. Contrast test_idle_timeout_info_reflects_activity_counts, where
a *completed* tool that then hangs still idles out."""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add the guarded commit to this regression docstring

The repo-level AGENTS.md for /workspace/benchflow says “Regression tests must name the PR/commit they guard,” and this newly added regression test documents the in-flight-tool watchdog behavior without naming the guarding PR or commit. Please include the relevant PR/commit identifier in this docstring so the test follows the repository convention and remains traceable.

Useful? React with 👍 / 👎.

@xdotli xdotli merged commit 763b856 into main Jun 14, 2026
3 of 4 checks passed
@xdotli xdotli deleted the fix/idle-heartbeat-and-summary-dedup branch June 14, 2026 06:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant