fix(fleetnode): drop http.Client.Timeout on authenticated gateway client#234
fix(fleetnode): drop http.Client.Timeout on authenticated gateway client#234ankitgoswami wants to merge 1 commit into
Conversation
The authenticated gateway client (used by the daemon and by future streaming RPCs) was reusing newGatewayHTTPClient() with its 30s http.Client.Timeout. http.Client.Timeout covers the entire request lifecycle including reading the response body, so it would forcibly tear down every long-lived streaming RPC at the 30s mark — making the client unsuitable for the very methods it was named to support (ControlStream, UploadTelemetry, UploadEvents). Latent in PR #227 (the only RPC the daemon makes today is the unary UploadHeartbeat, which completes well under 30s), but the next PR that wires up any streaming RPC would discover this the hard way. Fixing the foundation now keeps the streaming PR focused on its actual feature. Changes: - New newStreamingGatewayHTTPClient() in authclient.go: no top-level Timeout, keeps the no-redirect CheckRedirect. Used by NewAuthenticatedGatewayClient. - newGatewayHTTPClient() (bootstrap unary path: Register, BeginAuth, CompleteAuth) is unchanged. Those calls are bounded and the 30s circuit-breaker is correct for the interactive enroll/refresh flows. - The daemon's sendHeartbeat now wraps each call in a per-call context.WithTimeout(heartbeatRequestTimeout). This preserves the "single hung heartbeat doesn't pin the daemon" guarantee that the client-level Timeout used to provide, but applies it only to unary calls; future streaming methods will pass the loop ctx without an artificial deadline. - Regression test asserts newStreamingGatewayHTTPClient().Timeout is zero and that CheckRedirect still refuses redirects. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
🔐 Codex Security Review
Review SummaryOverall Risk: NONE FindingsNo security, correctness, or reliability findings in the reviewed diff. NotesThe PR scope is limited to replacing the authenticated fleet-node gateway client’s top-level I did not run tests due to the read-only sandbox, but the added test coverage matches the intended behavior: no top-level timeout on the streaming-capable authenticated client, redirect rejection preserved, and heartbeat calls bounded by context. Generated by Codex Security Review | |
There was a problem hiding this comment.
Pull request overview
This PR fixes an important foundation issue for the fleetnode authenticated gateway client: it removes the http.Client.Timeout that would otherwise hard-cut long-lived streaming RPCs after 30 seconds, while preserving redirect hardening and maintaining unary-call safety via per-call context timeouts.
Changes:
- Switch
NewAuthenticatedGatewayClientto use a new HTTP client constructor intended for streaming (no top-levelhttp.Client.Timeout, still rejects redirects). - Add a regression test to pin
Timeout == 0and redirect rejection behavior for the streaming client. - Add a per-heartbeat
context.WithTimeoutso a hung unary heartbeat can’t block the daemon loop indefinitely.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
server/internal/fleetnodebootstrap/authclient.go |
Introduces newStreamingGatewayHTTPClient() and uses it for the authenticated gateway client to avoid breaking streaming RPCs. |
server/internal/fleetnodebootstrap/authclient_test.go |
Adds a regression test ensuring the streaming HTTP client has no top-level timeout and still rejects redirects. |
server/cmd/fleetnode/run.go |
Wraps each UploadHeartbeat call with a per-call context timeout to preserve “no hung heartbeat pins daemon” behavior. |
|
Closing in favor of bundling the streaming-friendly HTTP client into the first PR that actually uses streaming. Commit b0d5e1a stays on the |
Adds the first remote-agent discovery flow under RFC 0001 phase 2: fleet_nodes scan their LAN, report findings, and an operator pairs devices to nodes through a new admin RPC. Bundled server + agent in one PR so the end-to-end smoke is the merge gate. Protocol - ReportDiscoveredDevices: unary, batched up to 1024 devices per request, proto-validated. - Three pairing admin RPCs: PairDeviceToFleetNode, UnpairDevice, ListFleetNodeDevices. Session-only auth. Server - Migration 000050 adds discovered_device.discovered_by_fleet_node_id and backfills from existing fleet_node_device pairings so already- paired legacy rows can't be retargeted by another node. - fleetnodepairing.Service wraps pair/unpair in a transactor; both fleet_node_device and discovered_device.discovered_by_fleet_node_id update atomically. PairDevice transfers attribution to the new owner; UnpairDevice clears it. - UpsertDiscoveredDeviceFromFleetNode's WHERE blocks claims from a different fleet node and from NULL-attributed rows already paired to a different node. 0 rows on conflict signals rejection without raising an error. - validateReport restricts ip_address to RFC1918/RFC4193 private ranges; rejects loopback, link-local, multicast, unspecified, and public addresses. URL scheme allowlist: http, https. - Gateway handler ReportDiscoveredDevices derives fleet_node_id + org_id from the auth Subject (never from the body); admin handler exposes the three pairing RPCs under SessionOnlyProcedures. Agent (server/cmd/fleetnode) - New --scan-cidr / --scan-port / --discovery-interval flags. If no --scan-cidr is supplied, discovery no-ops (heartbeat still runs). - Bounded-concurrency probes (32 in-flight, 3s per probe). CIDR expansion capped at 4096 hosts; IPv6 CIDRs rejected for v1. - Production SDK drivers (Antminer/ProtoRig/...) leave DeviceIdentifier empty; the agent synthesizes "mac:<addr>" or "serial:<sn>" so reports are stable across rescans without a server-side reconciliation pass. - Heartbeat and discovery run on separate goroutines. The discovery tick uses tickCtx for both probes and reports, hard-capping the tick at 60s. Heartbeat owns the session lifecycle; discovery delegates refresh to it. sync.Mutex guards SessionToken across the two loops; refreshAndSave does the handshake against a shallow copy so the network call doesn't block discovery token reads. Notes - discovered_device.ip_address is the LAN IP as seen by the reporting agent; the cloud never dials it. Per-fleet-node CIDR allowlist is a follow-up before any server-side code reads these IPs for outbound connections. - http.Client.Timeout fix from PR #234 deliberately not folded in; rolls in with the first streaming RPC (UploadTelemetry or ControlStream). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
The authenticated gateway client introduced in #227 reuses `newGatewayHTTPClient()` whose `http.Client.Timeout: 30s` covers the entire request lifecycle (including reading the response body). For long-lived streaming RPCs that's the streamed message channel itself — so any `ControlStream` / `UploadTelemetry` / `UploadEvents` open via this client would be forcibly torn down at the 30-second mark.
Latent today (`fleetnode run` only does the unary `UploadHeartbeat`, which finishes well under 30s), but the next PR that wires up any streaming RPC would have discovered it immediately. Fixing the foundation here keeps that PR focused on its actual feature.
Changes
Test plan
🤖 Generated with Claude Code