fix: capture ACP agent stderr and extract error.data from JSON-RPC errors#885
Conversation
…rors Closes #854. - Pipe agent stderr and log each line at WARN level (scoped to agent command name) so operators see the real cause in kubectl logs. - Add optional `data` field to `JsonRpcError` struct to capture the JSON-RPC error.data payload that agents like codex-acp include. - Surface `error.data.message` in the Discord-facing error display (as a blockquote below the coded error) so users get actionable detail instead of only the opaque "-32603 Internal error". - Deduplicate: if data.message already appears in the top-level message, it is not repeated.
OpenAB PR ScreeningThis is auto-generated by the OpenAB project-screening flow for context collection and reviewer handoff.
Screening reportscreened PR #885 and moved project item `PVTI_lADOEFbZWM4BUUALzgtaguY` from `Incoming` to `PR-Screening`.GitHub comment: #885 (comment) IntentMake ACP runtime failures diagnosable. It fixes discarded child-agent stderr and surfaces FeatFix. Captures ACP stderr, preserves JSON-RPC Who It ServesAgent runtime operators and Discord users. Rewritten PromptImplement ACP error observability: pipe stderr safely, log it with agent context, parse optional JSON-RPC Merge PitchWorth advancing. Small, focused observability fix. Reviewer concerns: stderr noise, sensitive data leakage, lifecycle cleanup, and defensive parsing. Best-Practice ComparisonAligns with ACP/OpenClaw/Hermes expectations around stdout as transport and stderr as diagnostics. Does not yet add redaction, structured run logs, doctor checks, or categorization. Implementation OptionsConservative: merge current focused fix. Balanced: add truncation/redaction and light docs before merge. Ambitious: build full ACP error observability with categorized errors, run logs, and recovery behavior. Comparison Table
RecommendationBalanced path. Keep the fix scoped, but add bounded output and basic redaction if reviewers can absorb that before merge. |
feiyun968-agent
left a comment
There was a problem hiding this comment.
PR Review — fix: capture ACP agent stderr and extract error.data from JSON-RPC errors
Four Core Questions
1. What problem does it solve?
ACP agent(如 codex-acp)runtime error 時,真正原因在 stderr 和 error.data.message,但 openab 兩者都丟棄。用戶只看到 ⚠️ Internal Error (code: -32603) / Internal error,無法診斷。
2. How does it solve it?
connection.rs:Stdio::null()→Stdio::piped(),spawn tokio task 逐行讀 stderr,以WARNlevel 寫入 bridge logsprotocol.rs:JsonRpcError加data: Option<Value>+data_message()helperadapter.rs: 傳err.data_message()給format_coded_errorerror_display.rs:format_coded_error接受data_message,以 blockquote 附加(dedup 防重複)
3. Were alternatives considered?
PR body 有完整 prior art research(acpx、Hermes、OpenClaw、opencode、open-design),符合 contribution guidelines。
4. Is this the best approach?
是。兩條診斷路徑都覆蓋:error.data 有值時用結構化訊息,沒有時 stderr 仍可見於 kubectl logs。
🟢 INFO — PR body 品質極高
Prior art research 涵蓋 ACP spec、acpx、Hermes、OpenClaw、opencode、open-design,完整說明 -32603 在不同 agent 的行為差異。
🟢 INFO — 測試完整
新增 test_format_coded_error_with_data_message 和 test_format_coded_error_data_message_not_duplicated,覆蓋 happy path 和 dedup 邏輯。所有舊 test 已更新新 signature。
🟢 INFO — ACP spec 合規
stderr capture 符合 spec:"Clients MAY capture, forward, or ignore this logging." 選擇 capture 是正確的。
🟡 NIT — stderr task 沒有 abort handle (connection.rs:370)
spawn 的 stderr reader task 在 AcpConnection drop 時不會被 cancel,可能在 process 結束後短暫 linger。建議存入 struct 以便 drop 時 abort:
// store in AcpConnection so it's aborted on drop
let stderr_task = tokio::spawn(async move { ... });非 blocker,但長期運行的 pod 可能累積 zombie tasks。
🟡 NIT — data_message() 只取 "message" key (protocol.rs:67)
codex-acp 的 error.data 還有 codex_error_info 欄位,其他 agent 可能用 detail、reason 等不同 key。目前只取 message 是合理的保守做法,建議在 comment 說明這是 convention 而非 spec 要求,方便日後擴展。
Verdict: 🟢 LGTM — 兩個 NIT 都不是 blocker。核心邏輯正確,prior art 充分,測試完整。
Review by 司馬懿 🪖
Strip control characters (except tab) from stderr lines before emitting to tracing::warn, preventing log injection or terminal escape sequences from reaching kubectl logs. Addresses review feedback from 普渡法師.
…_message convention - Store stderr reader JoinHandle in AcpConnection struct; abort it on drop to prevent lingering tasks in long-running pods. - Add doc comment to data_message() clarifying that the "message" key is a convention (codex-acp, JSON-RPC practice), not an ACP spec requirement — extend if other agents use different keys. Addresses NIT feedback from 司馬懿.
Summary
Closes #854.
When an ACP agent (e.g.
codex-acp) hits a runtime error, the real cause is emitted on stderr and in the JSON-RPCerror.data.messagefield — but openab was discarding both. Users only saw the opaque⚠️ Internal Error (code: -32603) / Internal errorin Discord, and operators had zero visibility inkubectl logs.This PR fixes both gaps:
WARNlevel, scoped to the agent command name.error.data— extracterror.data.messagefrom JSON-RPC error responses and surface it in the Discord error display as a blockquote.Before / After
Data Flow (ASCII)
Investigation: How the ACP Ecosystem Handles This
ACP Spec (agentclientprotocol.com/protocol/transports)
The spec explicitly defines the contract:
openab was choosing "ignore" — this PR switches to "capture and forward to bridge logs".
acpx (github.com/openclaw/acpx) — Headless ACP Client
acpx is the most mature headless ACP client (supports codex, claude, gemini, copilot, hermes, kiro, etc.). It handles this by:
--format text(default): stderr lines shown to user--format json --json-strict: stderr suppressed--verbose: full stderr forwarding for debuggingsession/loador transparently fall back tosession/newsessionId,requestId,seqfor correlationHermes Agent (github.com/NousResearch/hermes-agent) — Agent Side
Hermes explicitly documents the stderr contract in their ACP Internals:
configure stderr loggingas a step-32603error response is sent on stdoutOpenClaw Gateway (docs.openclaw.ai/tools/acp-agents)
OpenClaw's ACP integration (via the acpx plugin) handles errors at multiple levels:
/tmp/openclaw/openclaw-YYYY-MM-DD.log)error.data: documents specific patterns like "Vendor auth error from the harness" and "Model-not-found from the harness" — detected from the JSON-RPC errordatafield/acp doctor: provides health probes for the backendanomalyco/opencode#25568 — ACP session/new always returns -32603
OpenCode's ACP mode returns
-32603 Internal errorwith"data": {}(empty data object). The user gets zero diagnostic info from the JSON-RPC response alone. Key observations:initializesucceeds butsession/newalways fails with opaque-32603datafield is{}— completely empty, nomessagekeyerror.datais empty, stderr is the only diagnostic pathnexu-io/open-design#443 — session/set_model returns -32603 then timeout
Open Design's ACP client treats ALL
-32603errors as fatal (fail()→child.kill(SIGTERM)→ prompt never sent → 180s timeout). Their analysis:-32603fromsession/set_modelis recoverable (just skip model switching, use default)-32603fromsession/newis fatal (session broken)error.data.message, you cannot distinguish recoverable vs fatalsession/set_modelfails, proceed with default modelKey Takeaway Across All Repos
-32603 Internal erroris overloaded across agents for wildly different failure modes:{"message":"The gpt-5.2-codex model is not supported...","codex_error_info":"other"}{}(empty)The two diagnostic channels are:
error.data.message— structured, when the agent populates itstderr— always available, contains the real stack trace / error detailOur fix captures both.
Changes
src/acp/connection.rsStdio::null()→Stdio::piped()+ tokio task logs each stderr line at WARNsrc/acp/protocol.rsdata: Option<Value>toJsonRpcError+data_message()helper + updatedDisplaysrc/adapter.rserr.data_message()toformat_coded_errorsrc/error_display.rsformat_coded_erroraccepts optionaldata_message, appends as blockquote (deduped)Testing
format_coded_errorsignaturedata_messageextraction and deduplicationcodex-acpusing a ChatGPT account (no gpt-5.2-codex access) → stderr now visible in pod logs, Discord shows the real causehttps://discord.com/channels/1491295327620169908/1491365150664560881/1506975049779773540