Skip to content

fix: capture ACP agent stderr and extract error.data from JSON-RPC errors#885

Merged
thepagent merged 3 commits into
mainfrom
fix/854-capture-stderr-and-error-data
May 21, 2026
Merged

fix: capture ACP agent stderr and extract error.data from JSON-RPC errors#885
thepagent merged 3 commits into
mainfrom
fix/854-capture-stderr-and-error-data

Conversation

@chaodu-agent
Copy link
Copy Markdown
Collaborator

@chaodu-agent chaodu-agent commented May 21, 2026

Summary

Closes #854.

When an ACP agent (e.g. codex-acp) hits a runtime error, the real cause is emitted on stderr and in the JSON-RPC error.data.message field — but openab was discarding both. Users only saw the opaque ⚠️ Internal Error (code: -32603) / Internal error in Discord, and operators had zero visibility in kubectl logs.

This PR fixes both gaps:

  1. Capture agent stderr — pipe it and log each line at WARN level, scoped to the agent command name.
  2. Parse error.data — extract error.data.message from JSON-RPC error responses and surface it in the Discord error display as a blockquote.

Before / After

BEFORE (Discord):
⚠️ **Internal Error** (code: -32603)
Internal error

AFTER (Discord):
⚠️ **Internal Error** (code: -32603)
Internal error
> The 'gpt-5.2-codex' model is not supported when using Codex with a ChatGPT account.

Data Flow (ASCII)

┌─────────────────────────────────────────────────────────────────────┐
│                        openab bridge process                        │
│                                                                     │
│  ┌──────────────┐         spawn          ┌───────────────────────┐ │
│  │ AcpConnection├────────────────────────►│  codex-acp (child)    │ │
│  │              │◄─── stdout (JSON-RPC) ──┤                       │ │
│  │              │                         │  stdout: ACP messages │ │
│  │              │◄─── stderr (logging) ───┤  stderr: diagnostics  │ │
│  └──────┬───────┘                         └───────────────────────┘ │
│         │                                                           │
│         │  JSON-RPC error response:                                 │
│         │  {"error":{"code":-32603,                                 │
│         │            "message":"Internal error",                    │
│         │            "data":{"message":"The gpt-5.2-codex..."}}}   │
│         │                                                           │
│         ▼                                                           │
│  ┌──────────────┐                                                   │
│  │ adapter.rs   │── format_coded_error(code, msg, data_message) ──► │
│  └──────────────┘                                                   │
│         │                                                           │
│         ▼                                                           │
│  ┌──────────────┐                                                   │
│  │   Discord    │  "⚠️ **Internal Error** (code: -32603)            │
│  │   channel    │   Internal error                                  │
│  │              │   > The gpt-5.2-codex model is not supported..."  │
│  └──────────────┘                                                   │
│                                                                     │
│  Meanwhile, stderr lines appear in kubectl logs:                    │
│  WARN agent="codex-acp" ERROR codex_acp: Unhandled error...        │
└─────────────────────────────────────────────────────────────────────┘

Investigation: How the ACP Ecosystem Handles This

ACP Spec (agentclientprotocol.com/protocol/transports)

The spec explicitly defines the contract:

"The agent MAY write UTF-8 strings to its standard error (stderr) for logging purposes. Clients MAY capture, forward, or ignore this logging. The agent MUST NOT write anything to its stdout that is not a valid ACP message."

  • stdout = reserved for JSON-RPC ACP messages (the transport channel)
  • stderr = designated channel for human-readable logs/errors
  • The spec leaves it to the client to decide whether to capture stderr

openab was choosing "ignore" — this PR switches to "capture and forward to bridge logs".

acpx (github.com/openclaw/acpx) — Headless ACP Client

acpx is the most mature headless ACP client (supports codex, claude, gemini, copilot, hermes, kiro, etc.). It handles this by:

  • Capturing agent stderr and forwarding it based on output format:
    • --format text (default): stderr lines shown to user
    • --format json --json-strict: stderr suppressed
    • --verbose: full stderr forwarding for debugging
  • Crash reconnect: dead agent processes are detected; sessions are resumed via session/load or transparently fall back to session/new
  • Structured error output: JSON events include stable envelopes with sessionId, requestId, seq for correlation

Hermes Agent (github.com/NousResearch/hermes-agent) — Agent Side

Hermes explicitly documents the stderr contract in their ACP Internals:

  • "Stdout is reserved for ACP JSON-RPC transport. Human-readable logs go to stderr."
  • Boot flow explicitly includes configure stderr logging as a step
  • All runtime errors, tracebacks, and diagnostic info go to stderr
  • Unhandled errors during a turn are logged to stderr before a -32603 error response is sent on stdout

OpenClaw Gateway (docs.openclaw.ai/tools/acp-agents)

OpenClaw's ACP integration (via the acpx plugin) handles errors at multiple levels:

  • Gateway logging: captures console output to rolling JSON-line log files (/tmp/openclaw/openclaw-YYYY-MM-DD.log)
  • Error categorization from error.data: documents specific patterns like "Vendor auth error from the harness" and "Model-not-found from the harness" — detected from the JSON-RPC error data field
  • /acp doctor: provides health probes for the backend
  • Troubleshooting table with actionable fixes for each error category
  • Redaction: sensitive tokens are masked before log output leaves the process

anomalyco/opencode#25568ACP session/new always returns -32603

OpenCode's ACP mode returns -32603 Internal error with "data": {} (empty data object). The user gets zero diagnostic info from the JSON-RPC response alone. Key observations:

  • initialize succeeds but session/new always fails with opaque -32603
  • The error data field is {} — completely empty, no message key
  • This proves stderr capture is essential — when error.data is empty, stderr is the only diagnostic path
  • Issue was closed by PR #25591 (fix was on the agent side), but the client-side observability gap remains

nexu-io/open-design#443session/set_model returns -32603 then timeout

Open Design's ACP client treats ALL -32603 errors as fatal (fail()child.kill(SIGTERM) → prompt never sent → 180s timeout). Their analysis:

  • -32603 from session/set_model is recoverable (just skip model switching, use default)
  • -32603 from session/new is fatal (session broken)
  • Without error.data.message, you cannot distinguish recoverable vs fatal
  • Their fix: gracefully degrade when session/set_model fails, proceed with default model
  • Explicitly calls out: "The ideal fix would be... at minimum to return a proper error response that the client can handle gracefully"

Key Takeaway Across All Repos

-32603 Internal error is overloaded across agents for wildly different failure modes:

Agent Failure error.data
codex-acp Model not supported for account type {"message":"The gpt-5.2-codex model is not supported...","codex_error_info":"other"}
opencode Session init failure (unknown cause) {} (empty)
hermes-agent Model switching unsupported No data field

The two diagnostic channels are:

  1. error.data.message — structured, when the agent populates it
  2. stderr — always available, contains the real stack trace / error detail

Our fix captures both.

Changes

File Change
src/acp/connection.rs Stdio::null()Stdio::piped() + tokio task logs each stderr line at WARN
src/acp/protocol.rs Add data: Option<Value> to JsonRpcError + data_message() helper + updated Display
src/adapter.rs Pass err.data_message() to format_coded_error
src/error_display.rs format_coded_error accepts optional data_message, appends as blockquote (deduped)

Testing

  • Existing unit tests updated for new format_coded_error signature
  • Added tests for data_message extraction and deduplication
  • Manual: deploy with codex-acp using a ChatGPT account (no gpt-5.2-codex access) → stderr now visible in pod logs, Discord shows the real cause

https://discord.com/channels/1491295327620169908/1491365150664560881/1506975049779773540

…rors

Closes #854.

- Pipe agent stderr and log each line at WARN level (scoped to agent
  command name) so operators see the real cause in kubectl logs.
- Add optional `data` field to `JsonRpcError` struct to capture the
  JSON-RPC error.data payload that agents like codex-acp include.
- Surface `error.data.message` in the Discord-facing error display
  (as a blockquote below the coded error) so users get actionable detail
  instead of only the opaque "-32603 Internal error".
- Deduplicate: if data.message already appears in the top-level message,
  it is not repeated.
@chaodu-agent chaodu-agent requested a review from thepagent as a code owner May 21, 2026 11:28
@github-actions github-actions Bot added pending-screening closing-soon PR missing Discord Discussion URL — will auto-close in 3 days labels May 21, 2026
@shaun-agent
Copy link
Copy Markdown
Contributor

shaun-agent commented May 21, 2026

OpenAB PR Screening

This is auto-generated by the OpenAB project-screening flow for context collection and reviewer handoff.
Click 👍 if you find this useful. Human review will be done within 24 hours. We appreciate your support and contribution 🙏

Screening report screened PR #885 and moved project item `PVTI_lADOEFbZWM4BUUALzgtaguY` from `Incoming` to `PR-Screening`.

GitHub comment: #885 (comment)
Project action: https://github.com/orgs/openabdev/projects/1

Intent

Make ACP runtime failures diagnosable. It fixes discarded child-agent stderr and surfaces error.data.message instead of only showing opaque -32603.

Feat

Fix. Captures ACP stderr, preserves JSON-RPC error.data, extracts data.message, and shows it in Discord error output.

Who It Serves

Agent runtime operators and Discord users.

Rewritten Prompt

Implement ACP error observability: pipe stderr safely, log it with agent context, parse optional JSON-RPC error.data.message, render it in Discord coded errors, and test extraction/deduping.

Merge Pitch

Worth advancing. Small, focused observability fix. Reviewer concerns: stderr noise, sensitive data leakage, lifecycle cleanup, and defensive parsing.

Best-Practice Comparison

Aligns with ACP/OpenClaw/Hermes expectations around stdout as transport and stderr as diagnostics. Does not yet add redaction, structured run logs, doctor checks, or categorization.

Implementation Options

Conservative: merge current focused fix.

Balanced: add truncation/redaction and light docs before merge.

Ambitious: build full ACP error observability with categorized errors, run logs, and recovery behavior.

Comparison Table

Option Speed Complexity Reliability Maintainability User Impact Fit
Conservative High Low Medium High Medium High
Balanced Medium Medium High High High Highest
Ambitious Low High Highest Medium High Medium

Recommendation

Balanced path. Keep the fix scoped, but add bounded output and basic redaction if reviewers can absorb that before merge.

@github-actions github-actions Bot added pending-maintainer and removed closing-soon PR missing Discord Discussion URL — will auto-close in 3 days labels May 21, 2026
Copy link
Copy Markdown
Contributor

@feiyun968-agent feiyun968-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review — fix: capture ACP agent stderr and extract error.data from JSON-RPC errors

Four Core Questions

1. What problem does it solve?
ACP agent(如 codex-acp)runtime error 時,真正原因在 stderr 和 error.data.message,但 openab 兩者都丟棄。用戶只看到 ⚠️ Internal Error (code: -32603) / Internal error,無法診斷。

2. How does it solve it?

  • connection.rs: Stdio::null()Stdio::piped(),spawn tokio task 逐行讀 stderr,以 WARN level 寫入 bridge logs
  • protocol.rs: JsonRpcErrordata: Option<Value> + data_message() helper
  • adapter.rs: 傳 err.data_message()format_coded_error
  • error_display.rs: format_coded_error 接受 data_message,以 blockquote 附加(dedup 防重複)

3. Were alternatives considered?
PR body 有完整 prior art research(acpx、Hermes、OpenClaw、opencode、open-design),符合 contribution guidelines。

4. Is this the best approach?
是。兩條診斷路徑都覆蓋:error.data 有值時用結構化訊息,沒有時 stderr 仍可見於 kubectl logs


🟢 INFO — PR body 品質極高
Prior art research 涵蓋 ACP spec、acpx、Hermes、OpenClaw、opencode、open-design,完整說明 -32603 在不同 agent 的行為差異。

🟢 INFO — 測試完整
新增 test_format_coded_error_with_data_messagetest_format_coded_error_data_message_not_duplicated,覆蓋 happy path 和 dedup 邏輯。所有舊 test 已更新新 signature。

🟢 INFO — ACP spec 合規
stderr capture 符合 spec:"Clients MAY capture, forward, or ignore this logging." 選擇 capture 是正確的。

🟡 NIT — stderr task 沒有 abort handle (connection.rs:370)
spawn 的 stderr reader task 在 AcpConnection drop 時不會被 cancel,可能在 process 結束後短暫 linger。建議存入 struct 以便 drop 時 abort:

// store in AcpConnection so it's aborted on drop
let stderr_task = tokio::spawn(async move { ... });

非 blocker,但長期運行的 pod 可能累積 zombie tasks。

🟡 NIT — data_message() 只取 "message" key (protocol.rs:67)
codex-acp 的 error.data 還有 codex_error_info 欄位,其他 agent 可能用 detailreason 等不同 key。目前只取 message 是合理的保守做法,建議在 comment 說明這是 convention 而非 spec 要求,方便日後擴展。


Verdict: 🟢 LGTM — 兩個 NIT 都不是 blocker。核心邏輯正確,prior art 充分,測試完整。

Review by 司馬懿 🪖

chaodu-agent added 2 commits May 21, 2026 18:36
Strip control characters (except tab) from stderr lines before emitting
to tracing::warn, preventing log injection or terminal escape sequences
from reaching kubectl logs.

Addresses review feedback from 普渡法師.
…_message convention

- Store stderr reader JoinHandle in AcpConnection struct; abort it on
  drop to prevent lingering tasks in long-running pods.
- Add doc comment to data_message() clarifying that the "message" key
  is a convention (codex-acp, JSON-RPC practice), not an ACP spec
  requirement — extend if other agents use different keys.

Addresses NIT feedback from 司馬懿.
@thepagent thepagent merged commit 192f32f into main May 21, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bridge: ACP agent stderr swallowed — only opaque -32603 Internal error reaches Discord

4 participants