fix(acp): don't idle-timeout an agent while a tool call is in-flight by xdotli · Pull Request #758 · benchflow-ai/benchflow

xdotli · 2026-06-14T06:21:06Z

What

The ACP idle watchdog aborts a prompt when no tool_call / agent_message_chunk / agent_thought_chunk update arrives for idle_timeout seconds (default 600s). But a long-running tool — a build/test/solver shell execute — emits no ACP updates until it returns. So a tool that runs longer than idle_timeout false-fired the watchdog and discarded real work.

Surfaced in the deepseek-v4-pro / deepagents dogfood: the agent emitted a tool_call (status pending) for a shell command, the command ran >600s, and the watchdog killed a productive run (42 prior tool calls), which then needed a retry to recover.

Fix

While a tool call is in-flight (pending/in_progress), treat the agent as active and reset the idle clock — deferring to the wall-clock timeout as the hard backstop for a tool that genuinely never returns.

if cur_count > last_count:
    last_progress = now
    last_count = cur_count
elif session.pending_tool_call_ids():   # in-flight tool = working, not hung
    last_progress = now

This does not weaken hang detection (the watchdog's actual purpose):

A genuine model-side hang has no pending tool call (the prior tool already completed via tool_call_update), so it still trips the idle path.
A completed-tool-then-hang still idle-times-out — test_idle_timeout_info_reflects_activity_counts (uses a COMPLETED tool) is unchanged and passing.
deadline is loop-invariant (fixed at loop start, never re-derived from last_progress), so the reset cannot push the wall-clock backstop out indefinitely. Added a load-bearing comment so a future refactor can't reintroduce that hang.

Verification

New regression test test_in_flight_tool_call_defers_idle_to_wall_clock: a pending tool that hangs raises the wall-clock AgentPromptTimeoutError, not IdleTimeoutError.
Full suite 4072 passed; ruff check / ruff format --check / ty check clean.
Adversarial safety review (APPROVE-WITH-NITS): confirmed the deadline is loop-invariant, the model-hang path still fires, and there's no data race (single event loop, no threads).

Note

A second item considered in the same investigation — an apparent summary "under-count" (8 rollouts reported as total: 7) — turned out not to be a bug: it's the retry mechanism (a task idle-timed-out, was retried, and the retry passed), and both the live summary and the post-hoc metrics.py aggregation correctly dedupe retries by task_name, keeping the best attempt. No change made there. This PR's fix also reduces why that retry was needed in the first place.

Note

Medium Risk
Changes rollout timeout behavior for all ACP runs with idle detection; misclassification of pending tools could weaken hang detection or allow stuck tools to run until wall-clock expiry.

Overview
Fixes false idle timeouts during long-running shell tools: the ACP idle watchdog in _prompt_with_idle_watchdog now treats pending/in-progress tool calls as activity and refreshes the idle clock via session.pending_tool_call_ids(), since those tools often emit no ACP updates until they finish.

Model-side hangs after a completed tool are unchanged—the idle path still fires when there is no pending tool. The wall-clock timeout remains the hard cap (including tools that never return). Comments clarify that deadline stays fixed at loop start so resetting last_progress for in-flight tools cannot push the wall-clock backstop out indefinitely.

Adds regression test test_in_flight_tool_call_defers_idle_to_wall_clock: a pending tool that hangs hits AgentPromptTimeoutError at the wall-clock budget, not IdleTimeoutError at the shorter idle threshold.

^{Reviewed by Cursor Bugbot for commit 7061a1f. Bugbot is set up for automated code reviews on this repo. Configure here.}

The idle watchdog fires when no tool_call/message/thought ACP update arrives for idle_timeout seconds. But a long-running tool (e.g. a build/test/solver shell `execute`) emits NO ACP updates until it returns, so a tool that runs longer than idle_timeout false-fired the watchdog and discarded real work. Surfaced in a deepseek-v4-pro/deepagents dogfood: the agent emitted a tool_call (status pending) for a shell command, the command ran >600s, and the watchdog killed a productive run (which then needed a retry to recover). Fix: while a tool call is in-flight (pending/in_progress), treat the agent as active and reset the idle clock, deferring to the wall-clock `timeout` as the hard backstop for a tool that never returns. This does NOT weaken hang detection: a genuine model-side hang has no pending tool call (the prior tool already completed via tool_call_update), so it still trips the idle path — and a completed-tool-then-hang still idle-times-out (existing test unchanged). `deadline` is loop-invariant, so the reset cannot push the wall-clock backstop out. Adds test_in_flight_tool_call_defers_idle_to_wall_clock. Full suite 4072 green.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7061a1f899

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-14T06:23:27Z

+        """A pending/in-progress tool call (agent running a long shell command)
+        must NOT trip the idle watchdog: such tools emit no ACP updates until
+        they return, so the idle path would otherwise false-kill real work. The
+        agent is alive (executing a tool), so it defers to the wall-clock
+        backstop. Contrast test_idle_timeout_info_reflects_activity_counts, where
+        a *completed* tool that then hangs still idles out."""


Add the guarded commit to this regression docstring

The repo-level AGENTS.md for /workspace/benchflow says “Regression tests must name the PR/commit they guard,” and this newly added regression test documents the in-flight-tool watchdog behavior without naming the guarding PR or commit. Please include the relevant PR/commit identifier in this docstring so the test follows the repository convention and remains traceable.

Useful? React with 👍 / 👎.

chatgpt-codex-connector Bot reviewed Jun 14, 2026

View reviewed changes

xdotli merged commit 763b856 into main Jun 14, 2026
3 of 4 checks passed

xdotli deleted the fix/idle-heartbeat-and-summary-dedup branch June 14, 2026 06:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(acp): don't idle-timeout an agent while a tool call is in-flight#758

fix(acp): don't idle-timeout an agent while a tool call is in-flight#758
xdotli merged 1 commit into
mainfrom
fix/idle-heartbeat-and-summary-dedup

xdotli commented Jun 14, 2026 •

edited by cursor Bot

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xdotli commented Jun 14, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Fix

Verification

Note

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xdotli commented Jun 14, 2026 •

edited by cursor Bot

Loading