docs(rfc): expand RFC 0001 to cover operational concerns#222
Conversation
🔐 Codex Security Review
Review SummaryOverall Risk: HIGH Findings[HIGH] Org-wide credential key breaks node-scoped ownership
[HIGH] Warm-standby failover can create split-brain miner control
[MEDIUM] Local offline command reconciliation uses unsafe last-write-wins semantics
[MEDIUM] Command signing key has fleet-wide blast radius and no rotation protocol
NotesThe diff only changes Generated by Codex Security Review | |
Amends RFC 0001 (fleet node + cloud-server split) in place to address operational concerns surfaced in review. New or rewritten sections: - Local control during WAN outage: local degraded-mode UI on the node with a separate local-admin credential and on-reconnect reconciliation. - High availability and self-hosted topologies: multi-node-per-site with disjoint device ownership in Phase 3; warm-standby node pairs in Phase 6 (operator-confirmed pair, cloud-issued active lease, automatic promotion on heartbeat lapse). Multi-replica server HA is deferred to a follow-up RFC; non-miner device integration trimmed to a brief out-of-scope note. - Deployment and live updates: cloud reference architecture, graceful drain via ControlGoaway, signed release artifacts, zero-downtime upgrade via the warm-standby swap. NodeSelfUpdate is not in scope. - Host observability: /metrics endpoints with golden + domain signals, combined-mode parity, optional cloud forwarding. - Authentication: split into node identity (ingress) and per-command signing (egress) with replay protection via monotonic_seq. - Credentials: per-org data key stored in its own 0600 file; same-host reinstall backup includes the persisted monotonic_seq counter. - Phased rollout: phase table gains a combined-mode parity column and the Phase 1 row enumerates every proto-field surface used by later phases. Terminology: agent -> fleet node throughout. Code-level identifiers (agentbootstrap, agentauth, agentgateway.proto) are unchanged here and rename in a separate PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4f38a79 to
22f8aea
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 22f8aea855
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
Updates RFC 0001 (fleet node + cloud-server split) to better specify operational behavior and rollout planning for split deployments, including local operations during WAN outages, HA/topologies, deployment/update strategy, host observability, and a more detailed security model.
Changes:
- Expands the RFC with new sections covering local-degraded mode, HA/self-hosted tiers, deployment/updates, and host metrics.
- Refines security/credentials guidance (separating node identity vs per-command issuance; key storage and replay protection).
- Updates the phased rollout plan, including combined-mode parity expectations.
Comments suppressed due to low confidence (1)
docs/rfcs/0001-agent-server-split.md:180
- This section also calls the ownership map
node_device, but the repo usesfleet_node_devicein the DB schema / generated models. Aligning terminology here would make the doc easier to map onto the actual implementation.
- Gateway streams require both a long-lived bearer api_key (authorization) and a short-lived session token minted from a unary handshake (proof-of-possession). A leaked api_key alone cannot impersonate the node from another host.
- Device-scoped actions are gated by a `node_device` ownership map populated only by operator-confirmed pairing; discovery never auto-claims ownership.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Ankit Goswami <ankit.goswami@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Ankit Goswami <ankit.goswami@gmail.com>
Signed-off-by: Ankit Goswami <ankit.goswami@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 62920debd1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: Ankit Goswami <ankit.goswami@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 267d8d4eb9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Summary
Amends RFC 0001 — fleet node + cloud-server split in place to address eight operational concerns surfaced in review. The original RFC was directionally right but underspecified on local control, HA, bidirectional security, deployment/updates, host telemetry, network-config assumptions and on-prem-instance preservation.
Stats: 1 file, +186 / -61.
What changed in the RFC
:8443, separate local-admin credential, idempotent local commands with on-reconnect reconciliationControlGoawaygraceful drain, signed node releases, opt-inNodeSelfUpdate/metricsendpoints, golden + domain signals, combined-mode parity, optional cloud forwarding(actor, seq, signature)over a separate cloud command-issuance key, replay protection, audit, blast radius enumerated0600-protected file from identity/miner-signing keys; same-host reinstall backup includes the persistedmonotonic_seqcounterControlGoaway,NodeSelfUpdate); each phase row gains a Combined-mode parity columnThe terminology rename (agent → fleet node) is applied throughout the document
🤖 Generated with Claude Code