Skip to content

feat(fleetnode): long-running daemon (heartbeat-first)#227

Merged
ankitgoswami merged 3 commits into
mainfrom
ankitg/fleetnode-run
May 14, 2026
Merged

feat(fleetnode): long-running daemon (heartbeat-first)#227
ankitgoswami merged 3 commits into
mainfrom
ankitg/fleetnode-run

Conversation

@ankitgoswami

@ankitgoswami ankitgoswami commented May 13, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds the `fleetnode run` subcommand: a long-running daemon that holds an authenticated session and posts `UploadHeartbeat` on a ticker. This is the smallest viable slice of the "agent runtime" described in RFC 0001 Phase 2, and lays the scaffolding that `UploadTelemetry` / `UploadEvents` / `ControlStream` will reuse without further plumbing.

heart beating

What's in this PR

Server side

  • New `UpdateFleetNodeLastSeenAt` sqlc query + Store wrapper.
  • `fleetnodeenrollment.Service.UpdateLastSeen` (0 rows → NotFound).
  • `Handler.UploadHeartbeat` resolves the Subject from `FleetNodeAuthInterceptor` and stamps `last_seen_at`.

Client library (`fleetnodebootstrap`)

  • New `NewAuthenticatedGatewayClient(serverURL, tokenSource)` — a connect-rpc client that attaches `Authorization: Bearer ` on every call, re-reading the token from `tokenSource` so a daemon that rotates its own `state.SessionToken` via `Refresh` picks up the new value without rebuilding the client. Streaming variants wired so future `ControlStream` work needs no client-side surgery.

CLI
\\text
fleetnode run [--heartbeat-interval=30s]
[--state-dir=]
\\

  • Refuses to start when state is missing or `api_key` is empty (no state-lock side effect on never-enrolled hosts).
  • Refreshes the session on entry if it's within 1h of expiry, surfacing `ErrBeginAuthRejected` with the standard "could be revoked / mismatched / drifted" guidance.
  • Tick loop: refresh-on-near-expiry, send heartbeat, retry once after refresh on `Unauthenticated`; transient errors are logged and the daemon keeps running.
  • `signal.NotifyContext(SIGINT, SIGTERM)` for clean shutdown; tests inject a cancellable parent ctx for deterministic shutdown.

Test coverage

  • `TestUpdateLastSeen*` (integration) — service + SQL.
  • `TestUploadHeartbeat_*` (handler integration) — happy path advances `last_seen_at`, missing Subject is rejected, unknown fleet_node returns NotFound.
  • `TestAuthenticatedClient_*` — bearer header is attached per-call and re-read after rotation; empty token returns Unauthenticated.
  • `TestRunCmd_*` — happy path 3 ticks, near-expiry refresh, Unauthenticated triggers refresh-and-retry, missing-state and missing-api_key bailouts (no state-lock side effect), initial-refresh `ErrBeginAuthRejected` propagation.

Out of scope (explicitly)

  • `UploadTelemetry` / `UploadEvents` / `ControlStream` (next PRs; same daemon scaffolding).
  • Plugin orchestration.
  • `fleetd --mode=server` wiring.
  • Per-org credential encryption, per-agent JWT signing, device-scoped authz.

Test plan

  • `PATH=./bin:$PATH go build ./server/...`
  • `PATH=./bin:$PATH go vet ./server/...`
  • `cd server && PATH=../bin:$PATH golangci-lint run -c .golangci.yaml ./...` — 0 issues
  • `DB_PASSWORD=fleet PATH=./bin:$PATH go test ./server/... -count=1` — 64 packages OK
  • CI runs the full suite
  • Manual smoke (below)

Smoke

  1. `just dev` brings up the gateway on `:4000`.
  2. Mint an enrollment code (admin cookie + curl, as in PR feat(fleetnode): operator CLI for enroll + refresh + status #221 smoke).
  3. `just build-fleetnode && ./server/fleetnode enroll --server-url=http://localhost:4000 --allow-insecure-transport`.
  4. `./server/fleetnode run --heartbeat-interval=2s` in one terminal; watch heartbeat lines.
  5. In another terminal, hit `ListFleetNodes` admin RPC and observe `last_seen_at` advancing.
  6. SIGINT (Ctrl-C) the daemon; assert it exits cleanly within a second.

🤖 Generated with Claude Code

Adds the `fleetnode run` subcommand: a long-running daemon that
holds an authenticated session and posts UploadHeartbeat on a
ticker. Closes the smallest possible slice of the Phase 2 "agent
runtime" described in RFC 0001 and lays the scaffolding that
UploadTelemetry/UploadEvents/ControlStream will reuse.

Server side:
- New sqlc query UpdateFleetNodeLastSeenAt + Store wrapper.
- New fleetnodeenrollment.Service.UpdateLastSeen (0 rows -> NotFound).
- Implement UploadHeartbeat on the gateway handler: resolves the
  Subject from the FleetNodeAuthInterceptor and stamps last_seen_at.
- Service + handler tests (integration with a real DB through
  testutil).

Client library:
- fleetnodebootstrap.NewAuthenticatedGatewayClient: a connect-rpc
  client that attaches Authorization: Bearer <token> on every call,
  re-reading the token from a getter so a daemon that mutates
  state.SessionToken via Refresh picks up the new value without
  rebuilding the client. Streaming variants are wired so future
  ControlStream work needs no client-side surgery.
- TestAuthenticatedClient_AttachesBearerHeaderPerCall pins the
  per-call read contract.

CLI:
- `fleetnode run [--heartbeat-interval=30s] [--state-dir=...]`
- Refuses to start when state is missing or api_key is empty (no
  state-lock side effect on never-enrolled hosts).
- Refreshes the session on entry if it's within 1h of expiry,
  surfacing ErrBeginAuthRejected with the standard
  "could be revoked / mismatched / drifted" guidance.
- Tick loop: refresh-on-near-expiry, send heartbeat, retry once
  after refresh on Unauthenticated; transient errors are logged
  and the daemon continues.
- signal.NotifyContext(SIGINT, SIGTERM) for clean shutdown; tests
  inject a cancellable parent ctx for deterministic shutdown.
- All shutdown paths preserve the on-disk session (final SaveState
  on each refresh; no shutdown-time write).

Out of scope (later PRs): UploadTelemetry, UploadEvents,
ControlStream, plugin orchestration, fleetd --mode=server.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 13, 2026 17:56
@ankitgoswami ankitgoswami requested a review from a team as a code owner May 13, 2026 17:56
@ankitgoswami ankitgoswami marked this pull request as draft May 13, 2026 17:57
@github-actions

github-actions Bot commented May 13, 2026

Copy link
Copy Markdown

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (0d8d763de9bff8939761b14971011c0aed5ba01e...219d70d2f6ee583a82f76116ca3a12a1efdbef3e, exact PR three-dot diff)
  • Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: MEDIUM

Findings

[MEDIUM] Heartbeats Can Be Driven Into Unbounded DB Write Load

  • Category: Reliability
  • Location: server/internal/handlers/fleetnodegateway/handler.go:69
  • Description: UploadHeartbeat performs a last_seen_at database update for every authenticated heartbeat. The new daemon also accepts any positive --heartbeat-interval, so a misconfigured or compromised fleet node can send a tight loop of authenticated heartbeats and force continuous writes.
  • Impact: A single valid fleet-node session can create disproportionate DB write load; at fleet scale this can saturate Postgres and degrade unrelated server operations.
  • Recommendation: Add server-side debounce/rate limiting per fleet node. For example, only update last_seen_at if it is null or older than a minimum interval, and treat “too soon” as success rather than NotFound. Also clamp the daemon heartbeat interval to a sane minimum.

Notes

No auth bypass, SQL injection, command injection, cryptostealing/pool hijack, protobuf wire-format issue, or frontend/infrastructure regression was found in the reviewed diff. I did not run tests; this was a static review of .git/codex-review.diff.


Generated by Codex Security Review |
Triggered by: @ankitgoswami |
Review workflow run

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2191e4e1bc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/cmd/fleetnode/run.go Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the first “agent runtime” slice for FleetNode: a long-running fleetnode run daemon that maintains an authenticated session and periodically uploads heartbeats, with server-side persistence of fleet_node.last_seen_at.

Changes:

  • Server: add UpdateFleetNodeLastSeenAt query + service/store plumbing, and implement UploadHeartbeat handler that stamps last_seen_at.
  • Client lib: add an authenticated Connect-RPC gateway client that injects Authorization: Bearer <token> per call (token read dynamically from a source function).
  • CLI: add fleetnode run daemon command with refresh-on-near-expiry and retry-on-Unauthenticated behavior, plus supporting fakes/tests.

Reviewed changes

Copilot reviewed 12 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
server/sqlc/queries/fleetnode_enrollment.sql Adds sqlc query to update fleet_node.last_seen_at.
server/internal/domain/stores/sqlstores/fleetnodeenrollment.go Store wrapper for UpdateFleetNodeLastSeenAt.
server/internal/domain/fleetnodeenrollment/service.go Service method UpdateLastSeen with NotFound on 0 rows.
server/internal/handlers/fleetnodegateway/handler.go Adds UploadHeartbeat RPC handler.
server/internal/handlers/fleetnodegateway/handler_heartbeat_test.go Integration tests for heartbeat handler behavior.
server/internal/fleetnodebootstrap/authclient.go Adds authenticated gateway client + bearer interceptor.
server/internal/fleetnodebootstrap/authclient_test.go Tests per-call bearer header + empty-token behavior.
server/internal/domain/fleetnodeenrollment/integration_test.go Integration tests for updating last_seen_at.
server/cmd/fleetnode/main.go Wires new run subcommand into CLI.
server/cmd/fleetnode/run.go Implements daemon loop: refresh, heartbeat tick, retry-on-Unauthenticated, shutdown handling.
server/cmd/fleetnode/run_test.go Tests daemon behavior (ticks, refresh flows, missing-state/api_key, etc.).
server/cmd/fleetnode/fake_gateway_test.go Extends fake gateway to record/validate heartbeats.
server/generated/sqlc/fleetnode_enrollment.sql.go Generated sqlc output for new query (auto-generated).
server/generated/sqlc/db.go Generated sqlc statement preparation/close wiring (auto-generated).
Comments suppressed due to low confidence (5)

server/cmd/fleetnode/run_test.go:121

  • ed25519.GenerateKey(nil) can panic due to a nil randomness source; pass rand.Reader instead (and import crypto/rand).
	// Arrange
	dir := t.TempDir()
	pub, _, err := ed25519.GenerateKey(nil)
	require.NoError(t, err)

server/cmd/fleetnode/run_test.go:132

  • ed25519.GenerateKey(nil) may panic; use ed25519.GenerateKey(rand.Reader) to avoid nil reader usage.
	srv := newFakeServer(t, fake)

	pubKey, priv, err := ed25519.GenerateKey(nil)
	require.NoError(t, err)

server/cmd/fleetnode/run_test.go:176

  • ed25519.GenerateKey(nil) may panic because nil is not a valid entropy source. Switch to rand.Reader.
	// Arrange
	dir := t.TempDir()
	pubKey, priv, err := ed25519.GenerateKey(nil)
	require.NoError(t, err)

server/cmd/fleetnode/run_test.go:250

  • ed25519.GenerateKey(nil) is unsafe (nil reader) and can panic. Use ed25519.GenerateKey(rand.Reader) here too.
	// Arrange
	dir := t.TempDir()
	pubKey, priv, err := ed25519.GenerateKey(nil)
	require.NoError(t, err)

server/cmd/fleetnode/run_test.go:275

  • ed25519.GenerateKey(nil) can panic; pass rand.Reader instead so this test doesn’t crash during key generation.
	// Arrange
	dir := t.TempDir()
	pubKey, priv, err := ed25519.GenerateKey(nil)
	require.NoError(t, err)

Comment thread server/cmd/fleetnode/run_test.go
Comment thread server/cmd/fleetnode/run.go
Comment thread server/cmd/fleetnode/run.go Outdated
Comment thread server/internal/fleetnodebootstrap/authclient.go
Five fixes from Codex + Copilot reviews:

1. run.go: ValidateServerURL on every entry, not just the refresh
   path. A tampered or stale state with a fresh session_token would
   have let bearer heartbeats hit a plaintext non-loopback URL.
   Added TestRunCmd_ValidatesServerURLBeforeBuildingClient. (Codex
   MEDIUM; Copilot inline.)

2. run.go: Move signal.NotifyContext above the initial refresh so
   SIGINT can interrupt a startup refresh against a slow/unreachable
   gateway. Previously the initial refreshAndSave used
   context.Background() and could hang up to the 30s HTTP timeout
   before shutting down. (Codex inline #1, Copilot #3.)

3. run.go: Bump the "heartbeat sent" log line from Debug to Info.
   The default slog handler is Info, so the daemon was silent under
   normal operation; a stuck daemon would be indistinguishable from
   a working one. (Copilot #4.)

4. authclient.go: Streaming wrapper now matches unary on empty
   token. Previously WrapStreamingClient silently omitted the
   Authorization header when tokenSource returned empty, letting an
   unauthenticated stream open and fail later in harder-to-debug
   ways. Now returns a failingStreamingClientConn whose Send/Receive
   surface Unauthenticated immediately, matching the unary path's
   fail-fast contract. (Copilot #5; affects future ControlStream
   work, not currently exercised.)

5. run_test.go: ed25519.GenerateKey(nil) -> rand.Reader explicitly.
   The stdlib actually handles nil by falling back to crypto/rand,
   so this isn't a real panic risk, but explicit is clearer and
   matches the integration tests' style. (Copilot #2.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ankitgoswami

Copy link
Copy Markdown
Contributor Author

Replying to the two Codex MEDIUM findings:

MEDIUM #1 — Authenticated daemon skips server URL validation for fresh sessions

Fixed in 9c583a9. runLocked now calls ValidateServerURL unconditionally on every entry, before building the authenticated client. A tampered or stale state with a fresh session_token and a downgraded http:// non-loopback server_url will now be rejected at startup, before any heartbeat could leak the bearer token over plaintext. Added TestRunCmd_ValidatesServerURLBeforeBuildingClient that pins this with a stub gateway client and asserts zero heartbeats are attempted.

MEDIUM #2 — Heartbeat endpoint permits unbounded database writes

Acknowledged and deferred. The DB op behind UploadHeartbeat is a bounded single-row UPDATE last_seen_at WHERE id = $1 AND org_id = $2 against an indexed table; the auth interceptor's session lookup also caps how cheaply a request can reach the handler. An authenticated fleet node spamming heartbeats only updates its own row — the cost is bounded per-node, not per-org.

Per-session rate limiting is a reasonable hardening but isn't blocking this PR — the broader story for telemetry/events/control-stream will introduce similar concerns at higher volume, and the rate-limit policy is better designed once those land. Will file as a follow-up.

@ankitgoswami ankitgoswami marked this pull request as ready for review May 13, 2026 22:56

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9c583a91ac

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/internal/fleetnodebootstrap/authclient.go Outdated
Comment thread server/cmd/fleetnode/run.go
Comment thread server/cmd/fleetnode/run.go Outdated
…fast

Three follow-ups on round-3 review:

- run.go: tick() now returns an error; the loop exits on permanent
  conditions instead of looping forever. Two cases trigger exit:
  - Tick refresh fails with ErrBeginAuthRejected (revoked api_key,
    identity mismatch, expired challenge, clock drift). Daemon
    can't recover; operator must re-enroll. (flesher #12.)
  - Heartbeat returns CodeNotFound (fleet_node soft-deleted
    server-side). Same exit-and-re-enroll guidance. (flesher #13.)
  Transient errors (network, server down) still log + retry on the
  next tick. Two new tests: TestRunCmd_ExitsOnCodeNotFoundHeartbeat
  and TestRunCmd_ExitsWhenTickRefreshHitsBeginAuthRejected.

- authclient.go: failingStreamingClientConn no longer embeds a real
  conn. Codex flagged that CloseRequest() was delegated to the
  underlying conn, so CloseAndReceive() would open an
  unauthenticated stream before failing. The stub now implements
  every StreamingClientConn method itself (no underlying conn is
  constructed when the token is empty), so a tokenless stream open
  never touches the network. (Codex P2 #11.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ankitgoswami

Copy link
Copy Markdown
Contributor Author

Round-4 update on the two Codex re-run findings:

MEDIUM — Heartbeats can drive unbounded database writes — same posture as the previous reply: bounded single-row UPDATE per fleet node, gated by session auth. Rate-limiting is a real follow-up but designing it alongside telemetry/events/control-stream is cleaner than landing a per-RPC limiter for just heartbeats. Not blocking this PR.

LOW — last_seen_at can move backward under clock skew — accepted. The handler stamps time.Now().UTC() on the server side, so all writes for one fleet_node row come from the same fleetd process and the same monotonic clock. Backwards skew only happens if a different fleetd instance handles a later heartbeat AND that instance's wall clock is behind — which would require fleetd replicas with diverged clocks, a deployment shape we don't ship today. Worth documenting if/when we go multi-replica.

219d70d addresses the three actionable inline threads (exit on revoked api_key, exit on fleet_node deleted, stream fail-fast hardening). Build/lint/68 server test packages all green.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 219d70d2f6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/internal/fleetnodebootstrap/authclient.go
@ankitgoswami ankitgoswami merged commit dbb1a4f into main May 14, 2026
66 checks passed
@ankitgoswami ankitgoswami deleted the ankitg/fleetnode-run branch May 14, 2026 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants