Skip to content

feat(gateway): fail-fast provider semantics self-test at boot#739

Merged
marcusrbrown merged 2 commits into
mainfrom
feat/gateway-safety-wins
Jun 3, 2026
Merged

feat(gateway): fail-fast provider semantics self-test at boot#739
marcusrbrown merged 2 commits into
mainfrom
feat/gateway-safety-wins

Conversation

@marcusrbrown
Copy link
Copy Markdown
Collaborator

Wires the coordination provider semantics self-test into the gateway boot sequence so a misconfigured object-store provider fails startup before the gateway connects to Discord or serves traffic.

What

The gateway's lock and run-state coordination depends on the object-store honoring conditional writes (If-None-Match / If-Match). That contract was validated by validateProviderSemantics but never exercised at boot — a provider that silently ignores conditional headers would only surface as corruption during a real lock operation.

This runs the self-test fail-fast after the S3 adapter is created and before client.login(). A provider that doesn't honor conditional semantics now aborts startup with a clear error instead of booting into a broken coordination state.

  • Injected as runProviderSelfTest on the program deps (real impl in main.ts, stubbed in tests), consistent with the existing readiness/login injection pattern.
  • Two boot tests: self-test runs once before login; boot rejects and does not connect when the self-test fails.

Also reconciles the plan docs: the no-fro-bot block role is dropped as a redundant deny-list over the existing allow-list authorization model (exclude a user by not granting the trigger role).

fro-bot
fro-bot previously approved these changes Jun 3, 2026
Copy link
Copy Markdown
Owner

@fro-bot fro-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: PASS

This is a clean, well-scoped safety improvement. It wires the existing validateProviderSemantics self-test into the boot sequence so a misconfigured object-store provider (one that silently ignores If-None-Match/If-Match) aborts startup before the gateway connects to Discord or serves traffic. The injection follows the established GatewayProgramDeps pattern, the placement is correct, and the new tests cover both the ordering guarantee and the fail-fast path.

Verification performed

  • Traced the error-propagation chain: validateProviderSemantics (Result) → validateProviderSemanticsEffect (Effect<void, Error>) → main.ts Effect.runPromise (rejects on failure) → program.ts Effect.tryPromise (catches + re-fails). The catch normalizes non-Error rejections via new Error(String(error)), so the message survives the double-wrap. The boot fails fast test confirms login is never called.
  • Confirmed placement (program.ts:138-141): the self-test runs after createS3Adapter and before startAnnounceServer, installShutdownHandlers, and login. A failed probe therefore never opens the HTTP ingress or registers shutdown handlers — clean abort, no partial boot.
  • Ran pnpm --filter @fro-bot/gateway test: 681 passed (34 files).

Blocking issues

None.

Non-blocking concerns

  • program.ts:131-137 constructs selfTestCoordConfig with the same five fields (storeAdapter, storeConfig, lockTtlSeconds, heartbeatIntervalMs, staleThresholdMs) duplicated at mentionDeps (lines 259-265) and recoverStaleRuns (lines 333-339). Three literal copies of the coordination-config shape now exist in one function; a small makeCoordinationConfig(s3Adapter, config) helper would remove the drift risk. Not blocking — purely a maintainability nit, and the TTL/heartbeat/stale constants are the module defaults in every case.
  • The self-test issues a real write+delete probe against the object store on every boot. This is the documented intent (fail-fast over silent corruption) and the probe key is namespaced (self-test/_probe/locks/semantics.json), so it won't collide with live coordination state. Worth keeping in mind for environments where boot frequency is high, but acceptable for a single-process gateway.

Missing tests

None that block. The two new tests (self-test runs once before login; boot rejects without connecting when the self-test fails) cover the behavior this PR adds. The underlying semantics matrix (IfNoneMatch/IfMatch phases) is already exercised by runtime/src/coordination/self-test.test.ts, and the Effect wrapper by runtime-effect.test.ts. One optional addition: assert the self-test runs before startAnnounceServer (currently only ordering vs. login is asserted) — minor, since the announce server starts after the self-test in source order regardless.

Risk assessment (LOW/MED/HIGH): LOW

Additive, dependency-injected change confined to the gateway boot path. No public API surface changes, no behavior change for an already-correctly-configured provider (the probe succeeds and boot proceeds identically). Worst-case regression is a gateway that refuses to start against a non-compliant store — which is the explicit, desired fail-closed behavior. Blast radius is limited to startup; runtime request handling is untouched. The plan-doc edits are documentation reconciliation only.


Run Summary
Field Value
Event pull_request
Repository fro-bot/agent
Run ID 26856145401
Cache hit
Session ses_1751ef036ffe1akU51JQzX4G55

Copy link
Copy Markdown
Owner

@fro-bot fro-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: PASS

Re-review after commit 020732d. The core change is unchanged from my prior review (which was dismissed): it wires the existing validateProviderSemantics self-test into the boot sequence so a misconfigured object-store provider aborts startup before the gateway connects to Discord or serves traffic. The new commit extracts a makeCoordinationConfig helper, directly resolving the maintainability nit I raised previously. The change is clean and behavior-preserving.

Verification performed

  • Confirmed makeCoordinationConfig (program.ts:34-42) is a pure helper typed as CoordinationConfig, returning the exact same five fields (storeAdapter, storeConfig, lockTtlSeconds, heartbeatIntervalMs, staleThresholdMs) that were previously duplicated. It is now applied at all three call sites: self-test (:147), mention handler (:269), and stale-run recovery (:337). No values changed — pure deduplication.
  • Re-traced the fail-fast chain: validateProviderSemantics (Result) → validateProviderSemanticsEffect (Effect<void, Error>) → main.ts Effect.runPromiseprogram.ts Effect.tryPromise (catch normalizes non-Error via new Error(String(error))). Placement is still correct: self-test runs after createS3Adapter and before startAnnounceServer, installShutdownHandlers, and login, so a failed probe never opens the HTTP ingress.
  • Ran pnpm --filter @fro-bot/gateway test: 681 passed (34 files).

Blocking issues

None.

Non-blocking concerns

  • The self-test issues a real write+delete probe against the object store on every boot. This is the documented intent (fail-fast over silent corruption) and the probe key is namespaced (self-test/_probe/locks/semantics.json), so it won't collide with live coordination state. Acceptable for a single-process gateway; flagged only for awareness in high-boot-frequency environments.

Missing tests

None that block. The two boot tests (self-test runs once before login; boot rejects without connecting when the self-test fails) cover the added behavior; the semantics matrix is covered by runtime/src/coordination/self-test.test.ts and the Effect wrapper by runtime-effect.test.ts. The makeCoordinationConfig extraction needs no dedicated test — it is exercised transitively by the existing 681 passing tests and changes no values. Optional (unchanged from prior review): assert self-test ordering vs. startAnnounceServer, not just login.

Risk assessment (LOW/MED/HIGH): LOW

Additive, dependency-injected change confined to the gateway boot path, plus a behavior-preserving local refactor. No public API surface changes; no behavior change for a correctly-configured provider. Worst-case regression is a gateway that refuses to start against a non-compliant store — the explicit, desired fail-closed behavior. The refactor's blast radius is internal to makeGatewayProgram and fully covered by the green test suite. Plan-doc edits are documentation reconciliation only.


Run Summary
Field Value
Event pull_request
Repository fro-bot/agent
Run ID 26856437316
Cache hit
Session ses_1751ef036ffe1akU51JQzX4G55

@marcusrbrown marcusrbrown merged commit b0c8d9b into main Jun 3, 2026
13 of 22 checks passed
@marcusrbrown marcusrbrown deleted the feat/gateway-safety-wins branch June 3, 2026 00:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants