feat(fleetnode): server-initiated discovery via ControlStream#235
Conversation
🔐 Codex Security Review
Review SummaryOverall Risk: HIGH Findings[HIGH] Hostname Discovery Bypasses Private-Address Enforcement
[MEDIUM] Fleet-Node Reports Can Claim Unowned Discovery Rows By Identifier Alone
[MEDIUM] Fleet-Node Pairing Can Race Cloud Pairing
NotesNo cryptostealing or pool-hijack behavior was evident in the reviewed diff. Generated protobuf/sqlc changes were considered only as they reflected source changes. Generated by Codex Security Review | |
9839915 to
0f7b702
Compare
b4d4ad4 to
be68841
Compare
e5c338a to
717ac76
Compare
65c5fa2 to
e48be40
Compare
Three findings from a fresh re-review of PR #235 after the server/agent split (the original Codex inline comments on this PR target a file that now lives on a different branch). - ProcedurePermissions catalog drift: Pair/Unpair/ListFleetNodeDevices/DiscoverOnFleetNode were listed as "UNIMPLEMENTED STUB" in ProceduresPendingMigration even though their handlers are fully implemented and gated via RequirePermission. The contract test reads this map as the source of truth, so a regression that dropped the gate would have gone unnoticed. Moved all four entries into ProcedurePermissions with the right key (manage / read). - Unbounded IPList / ports counts in DiscoverOnFleetNode: an operator with fleetnode:manage could submit 1M IPs * 10k ports. Added (buf.validate.field).repeated.max_items on the proto (4096 on ip_addresses, 256 on ports in all three modes) plus defense-in-depth maxIPListEntries / maxPortsPerMode checks in normalizeDiscoverRequest so the limits hold even if the validator interceptor is misconfigured. - Silent event drop in fleetnodecontrol.Registry: the non-blocking publish to a 16-slot buffer dropped batches silently when the operator stream fell behind. Bumped the buffer to 64 and added an atomic dropped-event counter exposed via Registry.DroppedEvents so callers and tests have a signal that batches were lost. Two items deliberately deferred (see PR description): RFC1918 / private-range gating on discovery targets, and surfacing the dropped-batch count to the operator UX. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8f24a99 to
9254c2d
Compare
Three findings from a fresh re-review of PR #235 after the server/agent split (the original Codex inline comments on this PR target a file that now lives on a different branch). - ProcedurePermissions catalog drift: Pair/Unpair/ListFleetNodeDevices/DiscoverOnFleetNode were listed as "UNIMPLEMENTED STUB" in ProceduresPendingMigration even though their handlers are fully implemented and gated via RequirePermission. The contract test reads this map as the source of truth, so a regression that dropped the gate would have gone unnoticed. Moved all four entries into ProcedurePermissions with the right key (manage / read). - Unbounded IPList / ports counts in DiscoverOnFleetNode: an operator with fleetnode:manage could submit 1M IPs * 10k ports. Added (buf.validate.field).repeated.max_items on the proto (4096 on ip_addresses, 256 on ports in all three modes) plus defense-in-depth maxIPListEntries / maxPortsPerMode checks in normalizeDiscoverRequest so the limits hold even if the validator interceptor is misconfigured. - Silent event drop in fleetnodecontrol.Registry: the non-blocking publish to a 16-slot buffer dropped batches silently when the operator stream fell behind. Bumped the buffer to 64 and added an atomic dropped-event counter exposed via Registry.DroppedEvents so callers and tests have a signal that batches were lost. Two items deliberately deferred (see PR description): RFC1918 / private-range gating on discovery targets, and surfacing the dropped-batch count to the operator UX. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
27fec06 to
764d1a2
Compare
25676b7 to
2c33c35
Compare
0e240cd to
a6605bd
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fa3cf42025
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
PR 2 of a stack. Layers operator-initiated discovery on top of the pairing + agent-reporting surface in PR 1 (#332). Builds on the existing fleetnodepairing.UpsertDiscoveredDevices ingestion path; an in-memory registry correlates server-issued ControlCommand requests with the agent's eventual ReportDiscoveredDevices batches. What's in this PR: - fleetnodecontrol.Registry: single-instance in-memory map of fleet_node_id -> active ControlStream + per-command_id event channel (CommandEvent { Batch | Ack }). Newest-wins eviction signaled via a done channel (so outgoing channel is never closed under a publisher); Send selects on done to bail cleanly. Publishers hold the mutex through the bounded non-blocking send to avoid panicking on a closed channel when cleanup races. Dropped-event counter on a 64-slot buffer, exposed via DroppedEvents(). - FleetNodeGateway.ControlStream: bidi handler. Hello receive is wrapped in a 5s timeout (HelloTimeout var) so an authenticated-but-idle agent cannot hold a server goroutine + HTTP/2 stream indefinitely. After Hello, registers the stream and pumps outgoing ControlCommand requests + incoming ControlAck responses through a side goroutine (2-buffer to avoid linger on exit). - ReportDiscoveredDevices: rejects reports without a command_id or whose command_id is not in flight for this fleet_node (binds to server-issued ControlCommand). UpsertDiscoveredDevices now returns acceptedIdx []int instead of an opaque count; only the rows the store actually accepted are forwarded to the operator's command stream so ownership-rejected rows can't leak. - FleetNodeAdmin.DiscoverOnFleetNode: operator-facing streaming RPC. Validates target is CONFIRMED, normalizes IPRange to IPList (capped at 4096 expanded addresses), rejects MDNS, forwards IPList/Nmap. Wraps the operator ctx with DiscoverCommandTimeout (5m default, var for test override) so a buggy/silent agent cannot pin operator streams + registry entries forever. Returns CodeDeadlineExceeded on timeout. Uses id.GenerateID() for command_id and proto.Marshal for the payload. - discovered_by_fleet_node_id is immutable origin tracking. Set on first agent report; never cleared by PairDevice / UnpairDevice / RevokeFleetNode. Cloud-side pairing.PairDevices refuses to dial any discovered_device with DiscoveredByFleetNodeID != nil so an agent-reported private IP cannot redirect cloud credentialing later. Migration 000064 adds the column + FK + partial index. - UpsertDiscoveredDeviceFromFleetNode reconciles auto:* identifiers per (fleet_node, ip, port) endpoint so re-keyed scans collapse onto one row; mac:/serial: identifiers pass through unchanged. - pairing.proto: buf.validate count caps on DiscoverRequest modes (4096 IPs, 256 ports per mode). - middleware: DiscoverOnFleetNode gated on fleetnode:manage. Review fixes folded in: - Migration 000065 widens discovered_device.url_scheme from VARCHAR(10) to VARCHAR(32) to match the gateway proto's advertised max_len. Schemes of 11-32 chars (e.g. "stratum+tcp") passed validation but overflowed the column, failing the whole batch as an internal error. - UpsertDiscoveredDevices tallies accepted/rejected into per-attempt locals reset on closure entry, so a RunInTx retry after a retryable Postgres/commit failure can no longer double-count a batch. Adds a unit test for the retry path and a DB-backed test for the 32-char scheme. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mcharles-square
left a comment
There was a problem hiding this comment.
A synthesized fingerprint (auto:...) otherwise: a hash of the fleet
node's identity plus the address, port, and device type.
I wonder if this is too forgiving/brittle? would it be better to bail if devices cant cant be identified but mac address or serial? How is this handled by other MMSs?
antminers don't advertise MAC or serial via an unauthenticated API. Foreman actually asks for credentials even before discovery so that's a very different UX. |
1. Overview
This PR adds the server-initiated discovery path, where
an operator in the web UI can kick off a scan on a chosen fleet node and watch
results stream in live.
The cast
flowchart LR Operator((Operator)) -->|web UI| ProtoFleet ProtoFleet -->|RPC over internet| fleetd[fleetd cloud server] fleetd -. "cannot reach directly" .-> Miners fleetd <-->|persistent control stream| Agent[Fleet node agent] subgraph LAN["Private miner network"] Agent -->|miner plugins| Miners[(Miners)] endThe dashed line is the whole point: the cloud cannot dial the miners. The
agent can.
2. Fleet node lifecycle
Before a node can scan anything, it has to be enrolled and trusted. This is a
one-time setup per node.
stateDiagram-v2 [*] --> PENDING: agent starts enrollment PENDING --> AWAITING_CONFIRMATION: identity handshake completes AWAITING_CONFIRMATION --> CONFIRMED: operator approves in the UI CONFIRMED --> REVOKED: operator revokes REVOKED --> [*]UI. They paste it into the agent, which performs an identity handshake with
the server (proving it holds a private key it generated locally).
operator approves it. The server mints an API key the agent stores
locally. Only confirmed nodes can be asked to scan.
heartbeats and the control stream. It refreshes its session credentials
automatically before they expire.
longer authenticate, and its device pairings are released.
Everything in this guide assumes the node is CONFIRMED and running.
3. The control stream: the backbone
A running agent opens one long-lived, two-way connection to the server called
the control stream. Think of it as an always-open phone line:
Key properties:
running, the agent answers "busy" and the operator is told to retry shortly.
blip), the new connection replaces the old one cleanly.
increasing backoff delay so a flapping network does not hammer the server.
single-server design; it is not yet built for a horizontally-scaled cloud.
Separately, the agent sends a periodic heartbeat so the server knows it is
alive even when no scan is running.
4. The discovery workflow
This is the main feature. An operator starts a scan and watches results arrive
in real time until the scan finishes.
sequenceDiagram participant Op as Operator (ProtoFleet) participant S as fleetd (cloud) participant A as Fleet node agent participant M as Miners (LAN) Op->>S: DiscoverOnFleetNode(node, request) Note over S: Validate request,<br/>compute allowed "scope",<br/>assign a command id S->>A: command (over control stream) alt agent already busy A-->>S: ack BUSY S-->>Op: "node busy, retry shortly" else accepted loop for each address:port A->>M: probe / nmap end loop results in batches A->>S: ReportDiscoveredDevices(batch) Note over S: Drop anything out of scope,<br/>not a private IP, or over quota S-->>Op: stream accepted devices end A-->>S: final ack (OK or PARTIAL) S-->>Op: stream completes endStep by step:
scan request, then calls
DiscoverOnFleetNode. This requires thefleetnode:managepermission. It is a streaming call: the connectionstays open and devices appear as they are found.
checks the request is sane (see Guardrails), generates a
unique command id, computes the scope (exactly which addresses and
ports the agent is allowed to report for this command), and sends the command
down the control stream.
miner plugins to connect and identify the device, or by running an
nmapsweep. Probes run in parallel with a per-probe time limit so one slow address
cannot stall the whole scan.
via
ReportDiscoveredDevices, tagged with the command id. It does notwait until the end; results stream up incrementally.
enforces the guardrails, stores the ones it accepts, and streams those back
to the operator's open connection.
finish) or PARTIAL (some results uploaded, but it ran out of time or hit
a limit). The operator's stream then completes.
Scan modes
The operator chooses how to describe what to scan:
nmap-style target (single IP, CIDR block, orA.B.C.D-Nrange)Ports can be specified explicitly, or left empty to use each plugin's default
discovery ports.
5. How devices are identified
Every discovered device needs a stable identifier so that re-scanning the
same network updates the existing record instead of creating duplicates. The
agent picks one in this order:
mac:...) if the device reports one.serial:...) if it reports one.auto:...) otherwise: a hash of the fleetnode's identity plus the address, port, and device type.
The fleet node's identity is mixed into that fingerprint on purpose. Two
different networks often reuse the same private address ranges (for example,
both using
192.168.1.x). Including the node identity stops a miner on onenetwork from being mistaken for a different miner at the same address on another
network.
6. Discovered devices, attribution, and pairing
Attribution
Every discovered device record remembers which fleet node found it
("attribution"). This single fact drives an important safety rule.
The cloud-exclusion rule (important)
The cloud has its own, separate discovery path for miners it can reach
directly. To prevent the cloud from ever trying to dial a private LAN address it
cannot actually reach (and should not be probing), the rule is:
In other words, once a device is attributed to a fleet node, only that node's
reports (and the pairing flow) touch it. The cloud leaves it alone.
Pairing
To actively manage a discovered device through a fleet node, the operator
pairs it (
PairDeviceToFleetNode). Pairing rules:told to unpair it from the cloud first. Allowing both would leave the node
unable to refresh the device while the system reported it as fleet-node
managed.)
keep the record fresh.
flowchart TD Found[Device discovered by Node A] -->|attributed to A| Excluded[Hidden from cloud dial list] Found -->|operator pairs to A| Paired[Managed via Node A] Paired -->|operator unpairs| Found7. Guardrails
Because a fleet node runs on someone's network and reports back over the
internet, the server treats every report as untrusted input. These checks
all live on the server side, so a buggy or compromised agent cannot bypass them.
DiscoverOnFleetNode, pairing, confirm, and revoke requirefleetnode:manage; listing requiresfleetnode:readjavascript:) reaching the operator's browser8. Outcomes and what the operator sees
Each scan ends with a coded acknowledgement. The server translates these into
clear outcomes for the operator:
9. Tricky lifecycle cases worth knowing
replacement node is enrolled, the replacement can reclaim those device
records the first time it re-scans and finds the same machines. They do not
get stuck pointing at the dead node. Attribution simply moves to the new node,
so the cloud-exclusion rule still holds and the cloud never starts dialing
those addresses.
cannot be paired to a fleet node until it is unpaired from the cloud first
(see Pairing).
distinct because the fleet node identity is folded into synthesized device
fingerprints (see How devices are identified).
10. Orientation map
If you want to trace the feature in the codebase, the moving parts are:
server/internal/handlers/fleetnode/admin/server/internal/handlers/fleetnode/gateway/server/internal/domain/fleetnode/control/server/internal/domain/fleetnode/pairing/server/cmd/fleetnode/server/internal/domain/discoverylimits/,server/internal/domain/nmaptarget/server/migrations/000069_*,server/migrations/000070_*Test plan
go build ./...andgo vet ./...golangci-lintclean on changed packagesDB_PASSWORD=fleet go test -raceacross the registry, discovery, pairing, and handler packages (registry race tests, command-id binding, quota, scope binding, attribution transfer, and the DB-backed admin/gateway suites)🤖 Generated with Claude Code