Skip to content

feat(fleetnode): carry concurrent commands over the ControlStream#389

Merged
ankitgoswami merged 7 commits into
mainfrom
fleetnode-commands
Jun 5, 2026
Merged

feat(fleetnode): carry concurrent commands over the ControlStream#389
ankitgoswami merged 7 commits into
mainfrom
fleetnode-commands

Conversation

@ankitgoswami

Copy link
Copy Markdown
Contributor

Summary

Groundwork for sending commands to miners paired via fleet nodes. Today the
fleet-node ControlStream carries exactly one thing: server-initiated discovery
(#235). This PR generalizes the control plane so it can carry many concurrent
commands per node, which is the prerequisite for routing operator miner commands
(reboot, curtail, …) down a node's stream via RFC 0001's remote-node Miner
adapter (a later PR).

No behavior change to discovery. This is Phase 1 of a 3-phase effort; it ships
no user-visible capability on its own.

What changed

  • Typed AgentCommand envelope (proto/pairing/v1): the ControlCommand.payload
    is now a pairing.v1.AgentCommand oneof so the node can tell command kinds apart.
    Discovery migrates into the discover arm (done exactly once, on both the admin
    send side and the node decode side). Fields 2/3 are left for the upcoming
    miner-command and pairing arms, coordinated with the parallel pairing effort so the
    discovery payload is never migrated twice.
  • Registry (internal/domain/fleetnode/control): replaces the
    single-in-flight-command-per-node model with a command_id-keyed map.
    Report-bearing commands (discovery, and later pairing) stream batches + a terminal
    ack on an events channel and admit device reports against the existing scope/quota;
    ack-only commands (per-miner commands) take their terminal ack on a per-command
    channel via a new blocking SendCommand. The outbound queue is buffered, and teardown
    frees every in-flight command. Channel-ownership invariants are preserved.
  • Node (cmd/fleetnode/control.go): replaces the single command worker with a
    bounded (16) worker pool, so a long (up to 10-min) discovery no longer
    head-of-line-blocks quick commands and many commands run concurrently; BUSY is acked
    only at the pool ceiling. handleCommand dispatches on envelope kind.

Test plan

  • go test -race ./internal/domain/fleetnode/control/... ./cmd/fleetnode/... — green.
    New/updated coverage: many concurrent commands not rejected, ack-routing by kind,
    SendCommand block/disconnect/ctx-cancel/teardown, node worker pool runs a quick
    command while a long discovery is in flight, BUSY past the pool ceiling, unknown
    envelope kind → BAD_REQUEST.
  • DB-backed DiscoverOnFleetNode integration suite — green (end-to-end envelope
    round-trip through the real handler + registry).
  • Full server build, golangci-lint, and pre-push tsc — clean.

Follow-ups (later PRs)

  • Phase 2: MinerCommand proto + remote-node Miner adapter + fleet_node_device
    resolution branch + node-side executor (end-to-end for the virtual driver).
  • Phase 3: firmware/logs out-of-band transfer, password credential-edit flow, unpair.

🤖 Generated with Claude Code

Groundwork for routing operator miner commands to fleet-node-paired miners
(RFC 0001's remote-node Miner adapter). No behavior change to discovery.

- Wrap the ControlStream payload in a typed pairing.v1.AgentCommand envelope so
  the node can tell command kinds apart; discovery migrates into the `discover`
  arm. Fields 2 and 3 are left for the upcoming miner-command and pairing arms
  so the discovery payload is migrated exactly once.
- Registry: replace the single-in-flight-command-per-node model with a
  command_id-keyed map. Report-bearing commands (discovery, later pairing)
  stream batches + a terminal ack on an events channel; ack-only commands take
  the terminal ack on a per-command channel via a new blocking SendCommand. The
  outbound queue is buffered and teardown frees every in-flight command.
- Node: replace the single command worker with a bounded worker pool, so a long
  discovery no longer head-of-line-blocks quick commands; BUSY is acked only at
  the pool ceiling. handleCommand dispatches on envelope kind.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added javascript Pull requests that update javascript code client server shared labels Jun 4, 2026
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (617bc9030489ec4ea4bf84cc4abd4da61dead233...124e6b64d30cc3782abe8ec78ea13b8f19d7cd2e, exact PR three-dot diff)
  • Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: HIGH

Findings

[HIGH] ControlCommand payload change breaks mixed-version fleetd/fleetnode deployments

  • Category: Protobuf
  • Location: server/internal/domain/fleetnode/discovery/service.go:102 and server/cmd/fleetnode/control.go:240
  • Description: Discovery commands are now marshaled as pairing.v1.AgentCommand, and the fleet node now only decodes that envelope. Previously the same opaque ControlCommand.payload contained a bare DiscoverRequest. Because this is inside a bytes field, protobuf service compatibility checks will not catch it, and there is no version/capability negotiation or fallback decoder.
  • Impact: Rolling upgrades or mismatched server/fleetnode versions will break remote discovery. A new server talking to an old fleet node sends an envelope the old node interprets as the wrong DiscoverRequest shape; an old server talking to a new fleet node sends a bare request the new node rejects as a malformed or unrecognized AgentCommand.
  • Recommendation: Add a transition path: have the new fleet node accept both AgentCommand and legacy bare DiscoverRequest, or negotiate command payload capabilities during ControlHello/ControlAccepted before sending the envelope. Keep the fallback until all supported fleet nodes are guaranteed upgraded.

[MEDIUM] Commands are registered before enqueueing, allowing unbounded per-node inflight growth

  • Category: Reliability
  • Location: server/internal/domain/fleetnode/control/registry.go:172
  • Description: addCmd inserts each command into conn.cmds before Send/SendCommand attempts to enqueue it to the bounded outgoing channel. If the 64-slot outgoing queue is full or the control stream is slow to drain, every additional caller blocks in enqueue while still occupying an entry in conn.cmds. There is no cap on conn.cmds.
  • Impact: An authenticated caller issuing many concurrent DiscoverOnFleetNode requests, or a slow/misbehaving connected fleet node, can retain many inflight command records, channels, goroutines, and request contexts until cancellation or the 12-minute discovery timeout. This regresses from the previous single-inflight rejection model and can become a memory/goroutine exhaustion path.
  • Recommendation: Enforce a per-node total inflight/queued cap before inserting into conn.cmds, and fail fast with ResourceExhausted/FailedPrecondition when full. Alternatively, acquire a bounded queue/token first and only register commands that are guaranteed to be deliverable, while preserving ack routing race safety.

Notes

No auth bypass, SQL injection, command injection, cryptostealing/pool hijack, or frontend XSS issues were apparent in the scoped diff.


Generated by Codex Security Review |
Triggered by: @ankitgoswami |
Review workflow run

ankitgoswami and others added 3 commits June 4, 2026 09:24
Condense the AgentCommand envelope doc (drop the future-arm field list),
trim a speculative line on reportBearing, and remove changelog-style
"unlike before" narration on Send. Regenerate pairing proto output for
the reworded AgentCommand doc.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Send and SendCommand duplicated the outbound-queue select (enqueue, or fail on
disconnect/ctx). Extract Registry.enqueue so both share one definition of the
enqueue-or-fail semantics and error mapping. No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reconcile with #365 (fan-out nmap discovery), which refactored fleetnode discovery:
- registry.go / registry_test.go: keep the concurrent-command generalization and
  re-add ConnectedFleetNodeIDs (+ its test) from main.
- admin/handler.go: take main's slimmed handler; DiscoverOnFleetNode moved into the
  new fleetnode/discovery service.
- discovery/service.go: wrap the discovery payload in the AgentCommand envelope (the
  migration follows the handler here) so the node decodes it; updated its test.
- control_test.go: stub discoverer clones reports, since the worker pool runs probes
  concurrently.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ankitgoswami added a commit that referenced this pull request Jun 4, 2026
Bring #390 onto the reconciled #389 (which now includes main + #365). Only the
shared pairing proto conflicted: keep the MinerCommand additions and the
AgentCommand envelope (field 2 = miner_command now implemented, 3 = pair still
reserved), then regenerate pairing Go/TS with the pinned toolchain. All other
files merged cleanly; #365's discovery refactor and the AgentCommand migration
arrive via #389.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ankitgoswami and others added 2 commits June 4, 2026 11:45
…mands

Discovery is a heavy network scan; running two concurrently on one node doubles the
load for no benefit. Give report-bearing commands (discovery, and the pairing
effort's future pair) an exclusive single-flight slot per node, so a second
concurrent discovery is acked BUSY. Quick per-miner commands keep the broader worker
pool, so a long discovery never head-of-line-blocks them. The node classifies by
AgentCommand arm (it has no registry).

Adds a regression test asserting a second concurrent discovery gets ACK_CODE_BUSY.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… discovery ack

- decodeAgentCommand parses ControlCommand.payload once in the receive loop and
  hands it to handleCommand, removing the redundant second unmarshal.
- deliverAck evicts one best-effort batch to deliver the terminal ack when a
  report-bearing event buffer is full, so RunOnNode no longer strands until
  DiscoverCommandTimeout. Safe under Registry.mu (every events producer holds it).
- Add TestRegistry_TerminalAckDeliveredWhenEventBufferFull.
- Drop em-dashes from comments.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ankitgoswami ankitgoswami marked this pull request as ready for review June 4, 2026 20:12
@ankitgoswami ankitgoswami requested a review from a team as a code owner June 4, 2026 20:12
Copilot AI review requested due to automatic review settings June 4, 2026 20:12

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c7aade407a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/internal/domain/fleetnode/discovery/service.go

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR lays the groundwork for routing multiple concurrent server→fleet-node commands over the existing ControlStream by (1) introducing a typed AgentCommand envelope for the payload, (2) refactoring the server-side ControlStream registry to support many in-flight commands keyed by command_id, and (3) updating the fleet node to process commands concurrently (with a bounded pool) so long-running discovery doesn’t block other commands.

Changes:

  • Wrap discovery payloads in a new pairing.v1.AgentCommand oneof envelope and update decode paths/tests accordingly.
  • Replace the single in-flight command model with a command_id→inflight map, adding SendCommand for ack-only commands and improving teardown/unblock behavior.
  • Update fleet node control loop to a two-lane concurrency model (single-flight discovery + pooled quick commands), plus new/updated tests for concurrency and error cases.

Reviewed changes

Copilot reviewed 11 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
server/internal/handlers/fleetnode/admin/handler_discover_test.go Updates admin discovery tests to decode AgentCommand envelope.
server/internal/domain/fleetnode/discovery/service.go Sends discovery as AgentCommand{discover: ...} and updates comments.
server/internal/domain/fleetnode/discovery/service_test.go Adjusts discovery service tests to decode the new envelope.
server/internal/domain/fleetnode/control/stream.go Registry stream now supports multiple in-flight commands; routes ack by command kind.
server/internal/domain/fleetnode/control/session.go Refactors report-bearing Send to use new in-flight command map and enqueue helper.
server/internal/domain/fleetnode/control/registry.go Introduces per-connection outgoing buffer + in-flight map + add/remove helpers.
server/internal/domain/fleetnode/control/registry_test.go Adds coverage for concurrent commands, ack routing, teardown, buffer-full ack delivery.
server/internal/domain/fleetnode/control/command.go Adds SendCommand API for ack-only commands (blocking until terminal ack).
server/cmd/fleetnode/control.go Replaces single worker with concurrency lanes + envelope decoding + dispatch by command kind.
server/cmd/fleetnode/control_test.go Updates payload generation to AgentCommand envelope; adds new behavior tests.
proto/pairing/v1/pairing.proto Adds AgentCommand message (oneof envelope).
server/generated/grpc/pairing/v1/pairing.pb.go Generated Go output for the new proto message.
client/src/protoFleet/api/generated/pairing/v1/pairing_pb.ts Generated TS output for the new proto message.

Comment thread server/cmd/fleetnode/control.go Outdated
…plicitly

slot is already per-iteration (declared in the loop body), so the closure
capture was correct, but cmd/env/parseErr were passed as args while the lane
was captured. Pass the lane too, so each handler provably releases the lane it
acquired and the pattern is uniform.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread server/cmd/fleetnode/control.go
Comment thread proto/pairing/v1/pairing.proto
@ankitgoswami ankitgoswami merged commit 40e1571 into main Jun 5, 2026
74 checks passed
@ankitgoswami ankitgoswami deleted the fleetnode-commands branch June 5, 2026 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

client javascript Pull requests that update javascript code server shared

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants