Skip to content

docs(rfc): expand RFC 0001 to cover operational concerns#222

Merged
ankitgoswami merged 5 commits into
mainfrom
docs/rfc-0001-fleet-node-amendments
May 14, 2026
Merged

docs(rfc): expand RFC 0001 to cover operational concerns#222
ankitgoswami merged 5 commits into
mainfrom
docs/rfc-0001-fleet-node-amendments

Conversation

@ankitgoswami

@ankitgoswami ankitgoswami commented May 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Amends RFC 0001 — fleet node + cloud-server split in place to address eight operational concerns surfaced in review. The original RFC was directionally right but underspecified on local control, HA, bidirectional security, deployment/updates, host telemetry, network-config assumptions and on-prem-instance preservation.

Stats: 1 file, +186 / -61.

What changed in the RFC

Section Change
Architecture > Key properties Network-assumptions bullet: outbound 443 / HTTP/2 only; no inbound port required
Local control during WAN outage (new) Local degraded-mode UI on :8443, separate local-admin credential, idempotent local commands with on-reconnect reconciliation
High availability and self-hosted topologies (new) Server HA via stateless replicas + Postgres advisory-lock scheduler election; Tier 0–2 support matrix; every tier fully self-hostable
Deployment and live updates (new) Cloud reference architecture, ControlGoaway graceful drain, signed node releases, opt-in NodeSelfUpdate
Host observability (new) /metrics endpoints, golden + domain signals, combined-mode parity, optional cloud forwarding
Authentication Split into Node identity (ingress) and Command issuance (egress): per-command (actor, seq, signature) over a separate cloud command-issuance key, replay protection, audit, blast radius enumerated
Credentials Per-org data key stored in a separate 0600-protected file from identity/miner-signing keys; same-host reinstall backup includes the persisted monotonic_seq counter
Phased rollout Phase 1 row enumerates the proto-field surface that lands up front (signing fields, ControlGoaway, NodeSelfUpdate); each phase row gains a Combined-mode parity column

The terminology rename (agent → fleet node) is applied throughout the document

🤖 Generated with Claude Code

@github-actions github-actions Bot added the documentation Improvements or additions to documentation label May 12, 2026
@github-actions

github-actions Bot commented May 12, 2026

Copy link
Copy Markdown

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (a2db45bbd9e381650603bbeb53e2a95994a2baa8...267d8d4eb9498257185aa0c571cd7445de8af54b, exact PR three-dot diff)
  • Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: HIGH

Findings

[HIGH] Org-wide credential key breaks node-scoped ownership

  • Category: Cryptostealing/Pool Hijack
  • Location: docs/rfcs/0001-agent-server-split.md:198
  • Description: The accepted design places one AES-256-GCM data key for the whole org on every node. That conflicts with the new multi-node-per-site model and the stated node_device ownership boundary: compromising any one node gives access to credentials for miners owned by every other node in the org.
  • Impact: A compromised node can decrypt pool/miner credentials outside its ownership scope. If miners for sibling nodes are reachable on the same LAN, this can bypass device-scoped authorization and enable hashrate redirection or pool credential theft.
  • Recommendation: Use per-node or per-device envelope encryption and only provision credentials needed for devices owned by that node. Do not rely on LAN segmentation as an authorization boundary.

[HIGH] Warm-standby failover can create split-brain miner control

  • Category: Reliability
  • Location: docs/rfcs/0001-agent-server-split.md:89
  • Description: Automatic standby promotion is based on the active node’s heartbeat lapsing, but the RFC does not specify fencing for the old active node. Because the old node can still reach miners locally, especially during WAN loss, both active and standby nodes can control the same devices.
  • Impact: Network partitions can lead to duplicate or conflicting reboot, curtailment, power-target, or pool-update commands. This is operationally dangerous and can corrupt telemetry/audit history.
  • Recommendation: Add an explicit fencing model before automatic promotion: lease epochs enforced by node command dispatch, old-active write shutdown after lease expiry, operator-confirmed promotion, or another mechanism that prevents two nodes from issuing writes to the same miners.

[MEDIUM] Local offline command reconciliation uses unsafe last-write-wins semantics

  • Category: Reliability
  • Location: docs/rfcs/0001-agent-server-split.md:70
  • Description: Offline local commands are reconciled by backfilling logs to the cloud, with conflict resolution defined as last-write-wins per miner. For operational controls like curtailment, start/stop mining, reboot, and power targets, last-write-wins is too coarse.
  • Impact: A stale cloud command or delayed local command can silently override a newer safety or operational decision when connectivity returns.
  • Recommendation: Define command-specific reconciliation rules with epochs/version checks, explicit operator conflict surfacing, and precedence for safety/curtailment actions instead of generic last-write-wins.

[MEDIUM] Command signing key has fleet-wide blast radius and no rotation protocol

  • Category: Auth
  • Location: docs/rfcs/0001-agent-server-split.md:186
  • Description: The RFC describes a cloud command-issuance Ed25519 key pinned by nodes, but does not scope the key per org/node or define a secure key-rotation/revocation protocol.
  • Impact: Compromise of that signing key appears to allow valid command signing for any node until manual remediation. This includes miner-control and pool-update commands.
  • Recommendation: Scope command signing keys per tenant or per node, include key IDs and validity windows, and define a rotation path that nodes can verify without trusting a compromised old key.

Notes

The diff only changes docs/rfcs/0001-agent-server-split.md; there are no implementation hunks, migrations, protobuf changes, frontend changes, or Docker changes to audit in this PR.


Generated by Codex Security Review |
Triggered by: @ankitgoswami |
Review workflow run

Amends RFC 0001 (fleet node + cloud-server split) in place to address
operational concerns surfaced in review.

New or rewritten sections:
- Local control during WAN outage: local degraded-mode UI on the node
  with a separate local-admin credential and on-reconnect reconciliation.
- High availability and self-hosted topologies: multi-node-per-site with
  disjoint device ownership in Phase 3; warm-standby node pairs in
  Phase 6 (operator-confirmed pair, cloud-issued active lease, automatic
  promotion on heartbeat lapse). Multi-replica server HA is deferred to
  a follow-up RFC; non-miner device integration trimmed to a brief
  out-of-scope note.
- Deployment and live updates: cloud reference architecture, graceful
  drain via ControlGoaway, signed release artifacts, zero-downtime
  upgrade via the warm-standby swap. NodeSelfUpdate is not in scope.
- Host observability: /metrics endpoints with golden + domain signals,
  combined-mode parity, optional cloud forwarding.
- Authentication: split into node identity (ingress) and per-command
  signing (egress) with replay protection via monotonic_seq.
- Credentials: per-org data key stored in its own 0600 file; same-host
  reinstall backup includes the persisted monotonic_seq counter.
- Phased rollout: phase table gains a combined-mode parity column and
  the Phase 1 row enumerates every proto-field surface used by later
  phases.

Terminology: agent -> fleet node throughout. Code-level identifiers
(agentbootstrap, agentauth, agentgateway.proto) are unchanged here
and rename in a separate PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ankitgoswami ankitgoswami force-pushed the docs/rfc-0001-fleet-node-amendments branch from 4f38a79 to 22f8aea Compare May 12, 2026 20:06
@ankitgoswami ankitgoswami marked this pull request as ready for review May 12, 2026 20:07
@ankitgoswami ankitgoswami requested a review from a team as a code owner May 12, 2026 20:07
Copilot AI review requested due to automatic review settings May 12, 2026 20:07

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 22f8aea855

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread docs/rfcs/0001-agent-server-split.md Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates RFC 0001 (fleet node + cloud-server split) to better specify operational behavior and rollout planning for split deployments, including local operations during WAN outages, HA/topologies, deployment/update strategy, host observability, and a more detailed security model.

Changes:

  • Expands the RFC with new sections covering local-degraded mode, HA/self-hosted tiers, deployment/updates, and host metrics.
  • Refines security/credentials guidance (separating node identity vs per-command issuance; key storage and replay protection).
  • Updates the phased rollout plan, including combined-mode parity expectations.
Comments suppressed due to low confidence (1)

docs/rfcs/0001-agent-server-split.md:180

  • This section also calls the ownership map node_device, but the repo uses fleet_node_device in the DB schema / generated models. Aligning terminology here would make the doc easier to map onto the actual implementation.
- Gateway streams require both a long-lived bearer api_key (authorization) and a short-lived session token minted from a unary handshake (proof-of-possession). A leaked api_key alone cannot impersonate the node from another host.
- Device-scoped actions are gated by a `node_device` ownership map populated only by operator-confirmed pairing; discovery never auto-claims ownership.

Comment thread docs/rfcs/0001-agent-server-split.md Outdated
Comment thread docs/rfcs/0001-agent-server-split.md Outdated
Comment thread docs/rfcs/0001-agent-server-split.md
Comment thread docs/rfcs/0001-agent-server-split.md
Comment thread docs/rfcs/0001-agent-server-split.md
@ankitgoswami ankitgoswami changed the title docs(rfc): expand RFC 0001 to cover 8 operational concerns docs(rfc): expand RFC 0001 to cover operational concerns May 12, 2026
ankitgoswami and others added 3 commits May 12, 2026 15:28
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Ankit Goswami <ankit.goswami@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Ankit Goswami <ankit.goswami@gmail.com>
Signed-off-by: Ankit Goswami <ankit.goswami@gmail.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 62920debd1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread docs/rfcs/0001-agent-server-split.md
Comment thread docs/rfcs/0001-agent-server-split.md
Signed-off-by: Ankit Goswami <ankit.goswami@gmail.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 267d8d4eb9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@block block deleted a comment from chatgpt-codex-connector Bot May 14, 2026
@ankitgoswami ankitgoswami merged commit a36703e into main May 14, 2026
26 checks passed
@ankitgoswami ankitgoswami deleted the docs/rfc-0001-fleet-node-amendments branch May 14, 2026 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants