Skip to content

fix(fleetnode): drop http.Client.Timeout on authenticated gateway client#234

Closed
ankitgoswami wants to merge 1 commit into
mainfrom
ankitg/fleetnode-streaming-timeout
Closed

fix(fleetnode): drop http.Client.Timeout on authenticated gateway client#234
ankitgoswami wants to merge 1 commit into
mainfrom
ankitg/fleetnode-streaming-timeout

Conversation

@ankitgoswami

@ankitgoswami ankitgoswami commented May 14, 2026

Copy link
Copy Markdown
Contributor

Summary

The authenticated gateway client introduced in #227 reuses `newGatewayHTTPClient()` whose `http.Client.Timeout: 30s` covers the entire request lifecycle (including reading the response body). For long-lived streaming RPCs that's the streamed message channel itself — so any `ControlStream` / `UploadTelemetry` / `UploadEvents` open via this client would be forcibly torn down at the 30-second mark.

Latent today (`fleetnode run` only does the unary `UploadHeartbeat`, which finishes well under 30s), but the next PR that wires up any streaming RPC would have discovered it immediately. Fixing the foundation here keeps that PR focused on its actual feature.

Changes

  • New `newStreamingGatewayHTTPClient()` in `authclient.go`: no top-level `Timeout`; keeps the no-redirect `CheckRedirect`. Used by `NewAuthenticatedGatewayClient`.
  • Unchanged: `newGatewayHTTPClient()` (used by bootstrap's unary `Register` / `BeginAuthHandshake` / `CompleteAuthHandshake`). Those calls are bounded and the 30s circuit-breaker is correct for the interactive enroll/refresh flows.
  • Daemon: `sendHeartbeat` now wraps each call in `context.WithTimeout(heartbeatRequestTimeout)`. Preserves the "single hung heartbeat doesn't pin the daemon" guarantee that the client-level `Timeout` used to provide, but applies it only to unary calls; future streaming methods will pass the loop ctx without an artificial deadline.
  • Regression test pins that `newStreamingGatewayHTTPClient().Timeout == 0` and that `CheckRedirect` still refuses redirects.

Test plan

  • `PATH=./bin:$PATH go build ./server/...`
  • `PATH=./bin:$PATH go vet ./server/...`
  • `cd server && PATH=../bin:$PATH golangci-lint run -c .golangci.yaml ./...` — 0 issues
  • `DB_PASSWORD=fleet PATH=./bin:$PATH go test ./server/... -count=1` — 68 packages OK
  • CI runs the full suite

🤖 Generated with Claude Code

The authenticated gateway client (used by the daemon and by future
streaming RPCs) was reusing newGatewayHTTPClient() with its 30s
http.Client.Timeout. http.Client.Timeout covers the entire request
lifecycle including reading the response body, so it would forcibly
tear down every long-lived streaming RPC at the 30s mark — making
the client unsuitable for the very methods it was named to support
(ControlStream, UploadTelemetry, UploadEvents).

Latent in PR #227 (the only RPC the daemon makes today is the
unary UploadHeartbeat, which completes well under 30s), but the
next PR that wires up any streaming RPC would discover this the
hard way. Fixing the foundation now keeps the streaming PR focused
on its actual feature.

Changes:

- New newStreamingGatewayHTTPClient() in authclient.go: no top-level
  Timeout, keeps the no-redirect CheckRedirect. Used by
  NewAuthenticatedGatewayClient.
- newGatewayHTTPClient() (bootstrap unary path: Register, BeginAuth,
  CompleteAuth) is unchanged. Those calls are bounded and the 30s
  circuit-breaker is correct for the interactive enroll/refresh
  flows.
- The daemon's sendHeartbeat now wraps each call in a per-call
  context.WithTimeout(heartbeatRequestTimeout). This preserves the
  "single hung heartbeat doesn't pin the daemon" guarantee that the
  client-level Timeout used to provide, but applies it only to
  unary calls; future streaming methods will pass the loop ctx
  without an artificial deadline.
- Regression test asserts newStreamingGatewayHTTPClient().Timeout
  is zero and that CheckRedirect still refuses redirects.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 14, 2026 15:15
@ankitgoswami ankitgoswami requested a review from a team as a code owner May 14, 2026 15:15
@github-actions

Copy link
Copy Markdown

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (dbb1a4f9534c8cb88a74d1a11db5908f2bca3dd1...b0d5e1a58c5e6cd979e0fdee631d67d781e4fe92, exact PR three-dot diff)
  • Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: NONE

Findings

No security, correctness, or reliability findings in the reviewed diff.

Notes

The PR scope is limited to replacing the authenticated fleet-node gateway client’s top-level http.Client.Timeout with a streaming-safe client, while adding a per-heartbeat context.WithTimeout around UploadHeartbeat. I verified the authenticated client is only used by the fleetnode daemon heartbeat path in this diff, and enrollment/refresh still use the existing bounded NewGatewayClient.

I did not run tests due to the read-only sandbox, but the added test coverage matches the intended behavior: no top-level timeout on the streaming-capable authenticated client, redirect rejection preserved, and heartbeat calls bounded by context.


Generated by Codex Security Review |
Triggered by: @ankitgoswami |
Review workflow run

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an important foundation issue for the fleetnode authenticated gateway client: it removes the http.Client.Timeout that would otherwise hard-cut long-lived streaming RPCs after 30 seconds, while preserving redirect hardening and maintaining unary-call safety via per-call context timeouts.

Changes:

  • Switch NewAuthenticatedGatewayClient to use a new HTTP client constructor intended for streaming (no top-level http.Client.Timeout, still rejects redirects).
  • Add a regression test to pin Timeout == 0 and redirect rejection behavior for the streaming client.
  • Add a per-heartbeat context.WithTimeout so a hung unary heartbeat can’t block the daemon loop indefinitely.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
server/internal/fleetnodebootstrap/authclient.go Introduces newStreamingGatewayHTTPClient() and uses it for the authenticated gateway client to avoid breaking streaming RPCs.
server/internal/fleetnodebootstrap/authclient_test.go Adds a regression test ensuring the streaming HTTP client has no top-level timeout and still rejects redirects.
server/cmd/fleetnode/run.go Wraps each UploadHeartbeat call with a per-call context timeout to preserve “no hung heartbeat pins daemon” behavior.

@ankitgoswami

Copy link
Copy Markdown
Contributor Author

Closing in favor of bundling the streaming-friendly HTTP client into the first PR that actually uses streaming. Commit b0d5e1a stays on the ankitg/fleetnode-streaming-timeout branch for cherry-pick when needed.

@ankitgoswami ankitgoswami deleted the ankitg/fleetnode-streaming-timeout branch May 14, 2026 17:40
ankitgoswami added a commit that referenced this pull request May 14, 2026
Adds the first remote-agent discovery flow under RFC 0001 phase 2:
fleet_nodes scan their LAN, report findings, and an operator pairs
devices to nodes through a new admin RPC. Bundled server + agent in
one PR so the end-to-end smoke is the merge gate.

Protocol
- ReportDiscoveredDevices: unary, batched up to 1024 devices per
  request, proto-validated.
- Three pairing admin RPCs: PairDeviceToFleetNode, UnpairDevice,
  ListFleetNodeDevices. Session-only auth.

Server
- Migration 000050 adds discovered_device.discovered_by_fleet_node_id
  and backfills from existing fleet_node_device pairings so already-
  paired legacy rows can't be retargeted by another node.
- fleetnodepairing.Service wraps pair/unpair in a transactor; both
  fleet_node_device and discovered_device.discovered_by_fleet_node_id
  update atomically. PairDevice transfers attribution to the new
  owner; UnpairDevice clears it.
- UpsertDiscoveredDeviceFromFleetNode's WHERE blocks claims from a
  different fleet node and from NULL-attributed rows already paired
  to a different node. 0 rows on conflict signals rejection without
  raising an error.
- validateReport restricts ip_address to RFC1918/RFC4193 private
  ranges; rejects loopback, link-local, multicast, unspecified, and
  public addresses. URL scheme allowlist: http, https.
- Gateway handler ReportDiscoveredDevices derives fleet_node_id +
  org_id from the auth Subject (never from the body); admin handler
  exposes the three pairing RPCs under SessionOnlyProcedures.

Agent (server/cmd/fleetnode)
- New --scan-cidr / --scan-port / --discovery-interval flags. If no
  --scan-cidr is supplied, discovery no-ops (heartbeat still runs).
- Bounded-concurrency probes (32 in-flight, 3s per probe). CIDR
  expansion capped at 4096 hosts; IPv6 CIDRs rejected for v1.
- Production SDK drivers (Antminer/ProtoRig/...) leave
  DeviceIdentifier empty; the agent synthesizes "mac:<addr>" or
  "serial:<sn>" so reports are stable across rescans without a
  server-side reconciliation pass.
- Heartbeat and discovery run on separate goroutines. The discovery
  tick uses tickCtx for both probes and reports, hard-capping the
  tick at 60s. Heartbeat owns the session lifecycle; discovery
  delegates refresh to it. sync.Mutex guards SessionToken across the
  two loops; refreshAndSave does the handshake against a shallow
  copy so the network call doesn't block discovery token reads.

Notes
- discovered_device.ip_address is the LAN IP as seen by the
  reporting agent; the cloud never dials it. Per-fleet-node CIDR
  allowlist is a follow-up before any server-side code reads these
  IPs for outbound connections.
- http.Client.Timeout fix from PR #234 deliberately not folded in;
  rolls in with the first streaming RPC (UploadTelemetry or
  ControlStream).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants