fro-bot · marcusrbrown · May 31, 2026 · May 31, 2026 · May 31, 2026
@@ -0,0 +1,215 @@
+---
+title: Gateway OpenCode mention-loop best practices
+date: 2026-05-30
+category: best-practices
+module: gateway
+problem_type: best_practice
+component: assistant
+severity: high
+related_components:
+  - runtime
+  - workspace-agent
+  - discord
+  - coordination
+applies_when:
+  - remote-attaching a gateway to a workspace-bound OpenCode server
+  - forwarding authorization across both HTTP and SSE event streams
+  - recovering stale execution runs after a restart
+  - streaming partial agent output during long-running sessions
+  - enforcing single-run ownership and timeout cleanup
+tags:
+  - gateway
+  - opencode
+  - remote-attach
+  - sse
+  - bearer-token
+  - recovery
+  - stream-flush
+  - abort-signal
+---
+
+# Gateway OpenCode mention-loop best practices
+
+Patterns from the gateway `@fro-bot` mention loop — an authorized Discord mention in a
+bound channel drives an OpenCode session against the cloned repo. Captured after the work
+cleared an 11-reviewer review pass plus a Fro Bot review that caught a P0 (recovery lock
+ownership) and a blocking UX gap (partial output discarded on failure). These are the
+load-bearing decisions worth reusing in later gateway units.
+
+Adjacent docs (different angle, not duplicates):
+- `discord-slash-command-orchestration-patterns-2026-05-27.md` — the `/add-project`
+  orchestration side (members.fetch auth, partial-failure recovery, IAT handling).
+- `signed-webhook-ingress-hardening-2026-05-29.md` — the inbound webhook trust boundary
+  (fail-closed auth, never-log-secret, no auth oracle).
+- `../code-quality/architectural-issues-type-safety-and-resource-cleanup.md` — the
+  finally-guaranteed cleanup discipline this doc's dual-finally sharpens.
+
+## Context
+
+Unit 6 needed the gateway (one container) to drive OpenCode against a repo checked out in a
+sandboxed workspace container. The instinct is to treat "OpenCode in another container" as a
+special transport. It is not — and the wrong instinct leads to either breaking the sandbox
+(running OpenCode in the gateway against a shared volume) or rebuilding the whole execution
+loop inside the workspace. The patterns below are the middle path: reuse the proven HTTP+SSE
+transport, put a thin auth boundary in front of it, and harden the failure and recovery edges
+that only show up under crashes, timeouts, and concurrent triggers.
+
+## Guidance
+
+### 1. Remote attach is the same transport as in-process
+
+`createOpencode()` is `createOpencodeServer()` + `createOpencodeClient({baseUrl})` talking over
+HTTP+SSE. "Remote attach" is just pointing `baseUrl` at another container — no special mode.
+Crucially, the SDK's SSE path (`event.subscribe`) is **fetch-based, not `EventSource`**, so a
+custom `Authorization` header survives on the `/event` stream, not only on HTTP calls. Verify
+this kind of transport assumption with a throwaway spike before building on it.
+
+```ts
+// packages/runtime/src/agent/remote-client.ts
+const client = createOpencodeClient({baseUrl, headers})
+return {client, server: {url: baseUrl, close: () => {}}, shutdown: () => {}}
+```
+```ts
+// packages/gateway/src/execute/opencode-attach.ts — header injected here, never logged
+export function attachOpencode(baseURL: string, token: string): OpenCodeServerHandle {
+  return createRemoteOpenCodeHandle(baseURL, {Authorization: `Bearer ${token}`})
+}
+```
+
+### 2. Bearer-token reverse proxy fronting a loopback-bound server
+
+OpenCode's SDK server has **no native auth**. So bind it to `127.0.0.1` only and make a
+bearer-token reverse proxy the sole sandbox-net-reachable surface. The proxy validates with
+`timingSafeEqual` (length pre-check, then constant-time compare) and rejects with a fixed 401
+body. Network isolation + proxy auth together are the trust model; neither alone is.
+
+```ts
+// apps/workspace-agent/src/main.ts
+const OPENCODE_PORT = 54321
+const OPENCODE_HOSTNAME = '127.0.0.1' // loopback only — never sandbox-net
+const PROXY_PORT = 9200               // the only reachable surface
+```
+```ts
+// apps/workspace-agent/src/opencode-proxy.ts
+let authorized = false
+if (presentedBuf.length === expectedBuf.length) authorized = timingSafeEqual(presentedBuf, expectedBuf)
+if (authorized === false) { res.writeHead(401, {'Content-Type': 'text/plain'}); res.end(UNAUTHORIZED_BODY); return }
+```
+
+### 3. Gate stale-run lock release on `run_id` ownership
+
+Startup recovery sweeps runs left `EXECUTING` by a crash → transitions them `FAILED` and
+releases the repo lock. But it must **verify the current lock record's `run_id` matches the
+stale run before releasing**. A stale run-state whose lease already expired may have had its
+lock re-acquired by a newer, live run; releasing blindly deletes the newer run's lock and
+permits concurrent execution against the same repo. This was a P0.
+
+```ts
+// packages/gateway/src/execute/recovery.ts
+const lockFetch = await fetchLockRecord(coordinationConfig, lockKeyResult.data, logger)
+if (lockFetch !== null) {
+  if (lockFetch.runId === run.run_id) {
+    await releaseLock(coordinationConfig, repo, lockFetch.etag, coordLogger)
+  } else {
+    logger.warn({runId: run.run_id, repo, lockRunId: lockFetch.runId}, 'recovery: lock owned by another run — skip release')
+  }
+}
+```
+
+An unparseable/missing `run_id` resolves to `null` → skip the release (fail safe: never delete
+a lock you cannot prove belongs to the stale run).
+
+### 4. Flush partial output on failure paths
+
+A streaming sink flushed only on the success path silently discards everything on
+timeout/error — so the most expensive failure (a 600s timeout) yields the least information.
+Flush buffered output best-effort **in the catch path, before the coarse error reply**, in its
+own try/catch so a flush failure cannot mask the original error. Guard against double-post.
+
+```ts
+// packages/gateway/src/execute/run.ts (catch path)
+if (sink !== null) {
+  await sink.flush().catch((flushError: unknown) =>
+    logger.warn({repo, runId, err: String(flushError)}, 'run: sink.flush failed in error path'))
+}
+// ...then send the coarse, detail-free user message
+```
+
+### 5. Bounded execution + guaranteed resource release
+
+Thread `AbortSignal.timeout(runTimeoutMs)` into the event loop so a hung stream can't run
+forever. Release resources in nested `finally` blocks: inner releases the lock + stops the
+heartbeat, outer **always** releases the concurrency slot — both survive the timeout/throw
+path.
+
+```ts
+// packages/gateway/src/execute/run.ts
+const timeoutSignal = AbortSignal.timeout(runTimeoutMs)
+try {
+  try {
+    await runOpenCodeCore({handle, directory: binding.workspacePath, promptText, sink, signal: timeoutSignal, logger})
+  } finally {
+    if (heartbeatStopped === false) await heartbeat.stop().catch(() => {})
+    await releaseLock(coordinationConfig, repo, lockEtag, coordLogger)
+  }
+} finally {
+  concurrency.release(channelId) // always, even on throw
+}
+```
+
+### 6. EOF before the terminal signal is a failure; classify auth by status, not text
+
+A stream that ends **without** `session.idle` must throw (run marked `FAILED`), not resolve as
+`COMPLETED` — otherwise a dropped connection looks like success while the agent may still be
+mutating the repo. Separately, classify auth errors by **numeric status (401/403) only**;
+substring-matching "401"/"unauthorized"/"forbidden" in a stringified error misclassifies
+unrelated errors whose payload happens to contain those tokens.
+
+```ts
+// packages/gateway/src/execute/run-core.ts
+// signal aborted → 'timeout'; stream ended without session.idle → 'stream-ended' (both throw)
+// auth detection: trust response.status === 401 || response.status === 403 only
+```
+
+## Why This Matters
+
+- **Transport clarity** — knowing remote attach is just a `baseUrl` swap (with fetch-based SSE
+  that honors headers) is what makes the bearer-proxy boundary viable instead of a rebuild.
+- **The proxy is the trust boundary** because the server has no auth of its own.
+- **Ownership-gated release** prevents one stale run from deleting a newer run's lock — a
+  silent concurrency-corruption P0 that no happy-path test catches.
+- **Failure-path flush** preserves the only useful output in exactly the cases users care about
+  most (timeouts, mid-run failures).
+- **Dual-finally + AbortSignal.timeout** make hangs and throws non-catastrophic instead of
+  slot/lock leaks that wedge a channel until the next restart.
+- **Terminal-signal correctness** stops dropped streams from masquerading as completed work;
+  **numeric auth classification** avoids false "workspace unreachable" replies.
+
+## When to Apply
+
+- Attaching OpenCode (or any SDK server) across containers/hosts over HTTP+SSE.
+- Fronting a no-auth local server with a bearer-token boundary.
+- Recovering stale leases/locks at startup where another holder may have taken over.
+- Streaming output to a user-visible sink during long-running async work.
+- Any long-running orchestration with concurrency caps and a per-resource lock.
+- Consuming SSE where the terminal event (not EOF) defines success.
+
+## Examples
+
+**Remote attach** — `createRemoteOpenCodeHandle(baseUrl, {Authorization: 'Bearer ' + token})`;
+`event.subscribe()` carries the header because the SDK SSE path is fetch-based.
+
+**Proxy security** — bind OpenCode to `127.0.0.1`; expose only the reverse proxy; verify the
+bearer with `timingSafeEqual`; reject with a fixed 401 body.
+
+**Recovery** — read the current lock record; release only when `lockFetch.runId === run.run_id`;
+skip on mismatch or unparseable record.
+
+**Failure handling** — best-effort `sink.flush()` inside catch, then the coarse user reply;
+never let a flush failure hide the real error.
+
+**Lifecycle** — `AbortSignal.timeout(runTimeoutMs)`; inner finally = heartbeat stop + lock
+release; outer finally = concurrency slot release.
+
+**Stream correctness** — `session.idle` = success; EOF without idle = `stream-ended` (throw);
+auth detection by `401/403` status, not message text.