Skip to content

fleetnode: harden enrollment transport + stage install layout (stack 1/3)#322

Merged
ankitgoswami merged 2 commits into
mainfrom
ankitg/fleetnode-install-enrollment
May 27, 2026
Merged

fleetnode: harden enrollment transport + stage install layout (stack 1/3)#322
ankitgoswami merged 2 commits into
mainfrom
ankitg/fleetnode-install-enrollment

Conversation

@ankitgoswami

Copy link
Copy Markdown
Contributor

Summary

PR 1 of 3 in a stack that splits up #256. Carries the install + enrollment surface so the agent can be built and registered before any daemon/discovery code lands.

  • Scheme-aware gateway HTTP client. TLS-validating http2.Transport for https://, h2c for http://. The previous shared AllowHTTP + DialTLSContext shim silently downgraded https:// requests to plaintext, defeating ValidateServerURL's non-loopback requirement. `NewGatewayClient` and `NewAuthenticatedGatewayClient` now return error so a bad scheme has somewhere to surface.
  • Per-RPC handshake timeout. `handshakeStepTimeout` is now a var (test override seam). `Register` is wrapped via `withHandshakeTimeout` so a blackholed server cannot hang `fleetnode enroll` while the state lock is held.
  • State-lock contention diagnostics. `LOCK_EX | LOCK_NB` with the owner PID written under the lock; contention surfaces an actionable error instead of blocking.
  • Shared h2c test helpers in `testutil/h2c.go`; the existing `fake_gateway_test.go` now serves over h2c to match production wire behavior.
  • just build-fleetnode stages `server/.fleetnode/{fleetnode, nmap, plugins/}`; `.gitignore` covers the artifact dir. Operators can build and enroll today; the run + discovery layers land in stack PRs 2 and 3.
  • README skeleton covers Overview, Subcommands (enroll/refresh/status), State directory and lock, Build, Enrollment flow, Security model (transport + state file + lock), Development.

Stack

  1. PR 1 (this one): install + enrollment hardening → `main`
  2. PR 2: `fleetnode run` heartbeat + plugin scaffolding → PR 1
  3. PR 3: discovery (`ControlStream` + nmap + IPList normalizer) → PR 2

Test plan

  • `go build ./server/...`
  • `go vet ./server/...`
  • `golangci-lint run` on cmd/fleetnode + internal/fleetnodebootstrap → 0 issues
  • `go test -race ./server/cmd/fleetnode/... ./server/internal/fleetnodebootstrap/...`
  • Build the agent: `just build-fleetnode`; verify `server/.fleetnode/fleetnode` and `server/.fleetnode/plugins/` are staged
  • Smoke-enroll against a local fleet-api (see server/cmd/fleetnode/README.md)

🤖 Generated with Claude Code

- Scheme-aware gateway HTTP client: TLS-validating http2.Transport for
  https://, h2c for http://. NewGatewayClient and NewAuthenticatedGateway-
  Client return error so the scheme check has a place to surface.
- Per-RPC handshake timeout (var so tests can shorten it). Register wrapped
  so a blackholed server cannot hang fleetnode enroll while the state lock
  is held.
- State lock: LOCK_EX | LOCK_NB plus owner-PID write under the lock so a
  contending invocation gets an actionable error instead of blocking.
- Shared testutil.NewH2CServer / NewH2CClient helpers; fake_gateway_test
  serves the agent client over h2c to match production behavior.
- just build-fleetnode recipe stages server/.fleetnode/{fleetnode, nmap,
  plugins/}; .gitignore covers the artifact dir.
- README skeleton: install/enroll surface only; run + discovery sections
  land in later stack PRs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 27, 2026 19:22
@ankitgoswami ankitgoswami requested a review from a team as a code owner May 27, 2026 19:22
@github-actions github-actions Bot added documentation Improvements or additions to documentation automation server labels May 27, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ab1b22c09b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/internal/fleetnodebootstrap/client.go Outdated
Comment thread server/internal/fleetnodebootstrap/client.go
@github-actions

github-actions Bot commented May 27, 2026

Copy link
Copy Markdown

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (ba6177e23c96a4b503012f64a44fd2b1a7a59a03...83ff5297ddbac311a333307f959aaeb81092ebeb, exact PR three-dot diff)
  • Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: MEDIUM

Findings

[MEDIUM] Heartbeat RPCs Can Hang Indefinitely After Client Timeout Removal

  • Category: Reliability
  • Location: server/internal/fleetnodebootstrap/client.go:52
  • Description: The new HTTP client no longer sets the previous 30s http.Client.Timeout. The PR adds per-step timeouts for register/handshake calls, but daemon heartbeats still call UploadHeartbeat with the long-lived daemon context and no deadline.
  • Impact: A fleet server or network path that accepts a connection but stalls can wedge fleetnode run inside one heartbeat indefinitely. That stops future heartbeats, retries, and session refreshes until the process is killed.
  • Recommendation: Add a per-RPC timeout around heartbeat sends, for example in sendHeartbeat, or use separate bounded unary clients and unbounded streaming clients if future control streams need long-lived requests. Add a blackhole heartbeat test matching the new register timeout coverage.

Notes

No high-impact auth, SQLi, command injection, protobuf wire-format, cryptostealing/pool-hijack, or plugin trust-boundary issues were evident in the reviewed diff. Tests were not run; this was a diff-only review against .git/codex-review.diff.


Generated by Codex Security Review |
Triggered by: @ankitgoswami |
Review workflow run

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR (stack 1/3) hardens the fleetnode enrollment/handshake transport and improves operational ergonomics around testing, state locking, and local build staging so the agent can be built + enrolled safely ahead of later “run/discovery” work.

Changes:

  • Make gateway clients scheme-aware (TLS-validating HTTP/2 for https://, h2c for http://) and plumb scheme/URL errors via NewGatewayClient(...)(..., error) / NewAuthenticatedGatewayClient(...)(..., error).
  • Add per-RPC handshake timeouts and improve state-lock contention diagnostics (fast-fail + PID reporting).
  • Add shared h2c test utilities; update tests and just build-fleetnode to stage artifacts into server/.fleetnode/ (and ignore that directory).

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
server/internal/testutil/h2c.go Adds shared h2c server/client helpers for HTTP/2 cleartext testing.
server/internal/fleetnodebootstrap/refresh.go Propagates NewGatewayClient errors and uses the new client factory shape.
server/internal/fleetnodebootstrap/locking_unix.go Makes state lock non-blocking and records/reads holder PID for actionable contention errors.
server/internal/fleetnodebootstrap/handshake.go Introduces per-step handshake timeouts and a test override seam.
server/internal/fleetnodebootstrap/handshake_test.go Switches tests to h2c server helper; adapts to new NewGatewayClient signature.
server/internal/fleetnodebootstrap/enrollment.go Adds per-call timeout for Register; adapts to new gateway client constructors.
server/internal/fleetnodebootstrap/enrollment_test.go Adds a blackhole timeout regression test for Register.
server/internal/fleetnodebootstrap/client.go Implements scheme-aware HTTP/2 transport selection and returns errors for invalid schemes; removes global client timeout.
server/internal/fleetnodebootstrap/client_test.go Adds TLS validation regression coverage; updates redirect test to h2c and new constructor signature.
server/internal/fleetnodebootstrap/authclient.go Returns (client, error) and uses the new scheme-aware HTTP client builder.
server/internal/fleetnodebootstrap/authclient_test.go Updates to h2c server helper and new authenticated client signature.
server/cmd/fleetnode/run.go Updates client factory signature to return errors and handles construction failures.
server/cmd/fleetnode/run_test.go Updates test stubs for the new clientFactory signature.
server/cmd/fleetnode/README.md Adds initial CLI documentation, state/lock semantics, build and security model notes.
server/cmd/fleetnode/fake_gateway_test.go Switches fake gateway tests to serve over h2c to match production wire behavior.
justfile Stages fleetnode + plugins + nmap symlink into server/.fleetnode/; makes asicrs build output directory configurable.
.gitignore Ignores server/.fleetnode/ staging output.

Comment thread server/internal/fleetnodebootstrap/enrollment_test.go
Comment thread server/cmd/fleetnode/run.go
Comment thread server/cmd/fleetnode/README.md Outdated
- Replace bare http2.Transport with net/http Transport on the https branch
  so ALPN negotiates H2 when available and falls back to HTTP/1.1.
  Packaged nginx terminates TLS and proxies upstream over HTTP/1.1, which
  the H2-only transport refused (Codex P1).
- Stop calling t.Cleanup from the blackhole accept-loop goroutine; collect
  conns under a mutex and close them via a single Cleanup registered on
  the test goroutine. testing.T.Cleanup is not goroutine-safe.
- Replace the "fleetnode run lands in the run-heartbeat PR" placeholder
  with a real Subcommands row: run is already registered in main.go.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 83ff5297dd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/internal/fleetnodebootstrap/client.go
@ankitgoswami

Copy link
Copy Markdown
Contributor Author

Codex Security Review response

Heartbeat RPCs can hang forever after removing client timeout (MEDIUM) — Per-RPC timeout for the daemon RPCs lands in PR #323: sendHeartbeat wraps each UploadHeartbeat in a 30s context.WithTimeout. PR #322 has no daemon caller yet, so the structural gap exists but is unreachable from this PR's surface. Kept the underlying client timeout-free so streaming RPCs (ControlStream in PR #324) aren't capped.

Comment thread server/cmd/fleetnode/README.md
Comment thread server/cmd/fleetnode/README.md
@ankitgoswami ankitgoswami merged commit fbe3efb into main May 27, 2026
78 checks passed
@ankitgoswami ankitgoswami deleted the ankitg/fleetnode-install-enrollment branch May 27, 2026 21:27
@ankitgoswami ankitgoswami restored the ankitg/fleetnode-install-enrollment branch May 27, 2026 21:27
rongxin-liu added a commit that referenced this pull request May 29, 2026
Brings in:
- #325 RBAC swap: List/GetActive/Update curtailment RPCs now gated via
  RequirePermission; new entries in ProcedurePermissions.
- #329 UpdateCurtailmentEvent client wiring + edit-modal UI.
- #321 Curtailment start/restore action UI.
- #322 / #323 fleetnode enrollment + heartbeat stack.

Conflict resolution: server/internal/handlers/middleware/rpc_permissions.go.
Main added CurtailmentService reads + Update + DeviceCollection +
DeviceSet + ErrorQuery + FleetManagement entries; the branch added
IngestCurtailmentSignal. Took the union — folded the Ingest entry
into the CurtailmentService block alongside Update/reads/AdminTerminate.

handler.go auto-merged cleanly; build + targeted tests pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automation documentation Improvements or additions to documentation server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants