feat(fleetnode): long-running daemon (heartbeat-first)#227
Conversation
Adds the `fleetnode run` subcommand: a long-running daemon that holds an authenticated session and posts UploadHeartbeat on a ticker. Closes the smallest possible slice of the Phase 2 "agent runtime" described in RFC 0001 and lays the scaffolding that UploadTelemetry/UploadEvents/ControlStream will reuse. Server side: - New sqlc query UpdateFleetNodeLastSeenAt + Store wrapper. - New fleetnodeenrollment.Service.UpdateLastSeen (0 rows -> NotFound). - Implement UploadHeartbeat on the gateway handler: resolves the Subject from the FleetNodeAuthInterceptor and stamps last_seen_at. - Service + handler tests (integration with a real DB through testutil). Client library: - fleetnodebootstrap.NewAuthenticatedGatewayClient: a connect-rpc client that attaches Authorization: Bearer <token> on every call, re-reading the token from a getter so a daemon that mutates state.SessionToken via Refresh picks up the new value without rebuilding the client. Streaming variants are wired so future ControlStream work needs no client-side surgery. - TestAuthenticatedClient_AttachesBearerHeaderPerCall pins the per-call read contract. CLI: - `fleetnode run [--heartbeat-interval=30s] [--state-dir=...]` - Refuses to start when state is missing or api_key is empty (no state-lock side effect on never-enrolled hosts). - Refreshes the session on entry if it's within 1h of expiry, surfacing ErrBeginAuthRejected with the standard "could be revoked / mismatched / drifted" guidance. - Tick loop: refresh-on-near-expiry, send heartbeat, retry once after refresh on Unauthenticated; transient errors are logged and the daemon continues. - signal.NotifyContext(SIGINT, SIGTERM) for clean shutdown; tests inject a cancellable parent ctx for deterministic shutdown. - All shutdown paths preserve the on-disk session (final SaveState on each refresh; no shutdown-time write). Out of scope (later PRs): UploadTelemetry, UploadEvents, ControlStream, plugin orchestration, fleetd --mode=server. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
🔐 Codex Security Review
Review SummaryOverall Risk: MEDIUM Findings[MEDIUM] Heartbeats Can Be Driven Into Unbounded DB Write Load
NotesNo auth bypass, SQL injection, command injection, cryptostealing/pool hijack, protobuf wire-format issue, or frontend/infrastructure regression was found in the reviewed diff. I did not run tests; this was a static review of Generated by Codex Security Review | |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2191e4e1bc
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
Adds the first “agent runtime” slice for FleetNode: a long-running fleetnode run daemon that maintains an authenticated session and periodically uploads heartbeats, with server-side persistence of fleet_node.last_seen_at.
Changes:
- Server: add
UpdateFleetNodeLastSeenAtquery + service/store plumbing, and implementUploadHeartbeathandler that stampslast_seen_at. - Client lib: add an authenticated Connect-RPC gateway client that injects
Authorization: Bearer <token>per call (token read dynamically from a source function). - CLI: add
fleetnode rundaemon command with refresh-on-near-expiry and retry-on-Unauthenticated behavior, plus supporting fakes/tests.
Reviewed changes
Copilot reviewed 12 out of 14 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| server/sqlc/queries/fleetnode_enrollment.sql | Adds sqlc query to update fleet_node.last_seen_at. |
| server/internal/domain/stores/sqlstores/fleetnodeenrollment.go | Store wrapper for UpdateFleetNodeLastSeenAt. |
| server/internal/domain/fleetnodeenrollment/service.go | Service method UpdateLastSeen with NotFound on 0 rows. |
| server/internal/handlers/fleetnodegateway/handler.go | Adds UploadHeartbeat RPC handler. |
| server/internal/handlers/fleetnodegateway/handler_heartbeat_test.go | Integration tests for heartbeat handler behavior. |
| server/internal/fleetnodebootstrap/authclient.go | Adds authenticated gateway client + bearer interceptor. |
| server/internal/fleetnodebootstrap/authclient_test.go | Tests per-call bearer header + empty-token behavior. |
| server/internal/domain/fleetnodeenrollment/integration_test.go | Integration tests for updating last_seen_at. |
| server/cmd/fleetnode/main.go | Wires new run subcommand into CLI. |
| server/cmd/fleetnode/run.go | Implements daemon loop: refresh, heartbeat tick, retry-on-Unauthenticated, shutdown handling. |
| server/cmd/fleetnode/run_test.go | Tests daemon behavior (ticks, refresh flows, missing-state/api_key, etc.). |
| server/cmd/fleetnode/fake_gateway_test.go | Extends fake gateway to record/validate heartbeats. |
| server/generated/sqlc/fleetnode_enrollment.sql.go | Generated sqlc output for new query (auto-generated). |
| server/generated/sqlc/db.go | Generated sqlc statement preparation/close wiring (auto-generated). |
Comments suppressed due to low confidence (5)
server/cmd/fleetnode/run_test.go:121
ed25519.GenerateKey(nil)can panic due to a nil randomness source; passrand.Readerinstead (and importcrypto/rand).
// Arrange
dir := t.TempDir()
pub, _, err := ed25519.GenerateKey(nil)
require.NoError(t, err)
server/cmd/fleetnode/run_test.go:132
ed25519.GenerateKey(nil)may panic; useed25519.GenerateKey(rand.Reader)to avoid nil reader usage.
srv := newFakeServer(t, fake)
pubKey, priv, err := ed25519.GenerateKey(nil)
require.NoError(t, err)
server/cmd/fleetnode/run_test.go:176
ed25519.GenerateKey(nil)may panic becausenilis not a valid entropy source. Switch torand.Reader.
// Arrange
dir := t.TempDir()
pubKey, priv, err := ed25519.GenerateKey(nil)
require.NoError(t, err)
server/cmd/fleetnode/run_test.go:250
ed25519.GenerateKey(nil)is unsafe (nil reader) and can panic. Useed25519.GenerateKey(rand.Reader)here too.
// Arrange
dir := t.TempDir()
pubKey, priv, err := ed25519.GenerateKey(nil)
require.NoError(t, err)
server/cmd/fleetnode/run_test.go:275
ed25519.GenerateKey(nil)can panic; passrand.Readerinstead so this test doesn’t crash during key generation.
// Arrange
dir := t.TempDir()
pubKey, priv, err := ed25519.GenerateKey(nil)
require.NoError(t, err)
Five fixes from Codex + Copilot reviews: 1. run.go: ValidateServerURL on every entry, not just the refresh path. A tampered or stale state with a fresh session_token would have let bearer heartbeats hit a plaintext non-loopback URL. Added TestRunCmd_ValidatesServerURLBeforeBuildingClient. (Codex MEDIUM; Copilot inline.) 2. run.go: Move signal.NotifyContext above the initial refresh so SIGINT can interrupt a startup refresh against a slow/unreachable gateway. Previously the initial refreshAndSave used context.Background() and could hang up to the 30s HTTP timeout before shutting down. (Codex inline #1, Copilot #3.) 3. run.go: Bump the "heartbeat sent" log line from Debug to Info. The default slog handler is Info, so the daemon was silent under normal operation; a stuck daemon would be indistinguishable from a working one. (Copilot #4.) 4. authclient.go: Streaming wrapper now matches unary on empty token. Previously WrapStreamingClient silently omitted the Authorization header when tokenSource returned empty, letting an unauthenticated stream open and fail later in harder-to-debug ways. Now returns a failingStreamingClientConn whose Send/Receive surface Unauthenticated immediately, matching the unary path's fail-fast contract. (Copilot #5; affects future ControlStream work, not currently exercised.) 5. run_test.go: ed25519.GenerateKey(nil) -> rand.Reader explicitly. The stdlib actually handles nil by falling back to crypto/rand, so this isn't a real panic risk, but explicit is clearer and matches the integration tests' style. (Copilot #2.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Replying to the two Codex MEDIUM findings: MEDIUM #1 — Authenticated daemon skips server URL validation for fresh sessions Fixed in 9c583a9. MEDIUM #2 — Heartbeat endpoint permits unbounded database writes Acknowledged and deferred. The DB op behind UploadHeartbeat is a bounded single-row Per-session rate limiting is a reasonable hardening but isn't blocking this PR — the broader story for telemetry/events/control-stream will introduce similar concerns at higher volume, and the rate-limit policy is better designed once those land. Will file as a follow-up. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9c583a91ac
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…fast
Three follow-ups on round-3 review:
- run.go: tick() now returns an error; the loop exits on permanent
conditions instead of looping forever. Two cases trigger exit:
- Tick refresh fails with ErrBeginAuthRejected (revoked api_key,
identity mismatch, expired challenge, clock drift). Daemon
can't recover; operator must re-enroll. (flesher #12.)
- Heartbeat returns CodeNotFound (fleet_node soft-deleted
server-side). Same exit-and-re-enroll guidance. (flesher #13.)
Transient errors (network, server down) still log + retry on the
next tick. Two new tests: TestRunCmd_ExitsOnCodeNotFoundHeartbeat
and TestRunCmd_ExitsWhenTickRefreshHitsBeginAuthRejected.
- authclient.go: failingStreamingClientConn no longer embeds a real
conn. Codex flagged that CloseRequest() was delegated to the
underlying conn, so CloseAndReceive() would open an
unauthenticated stream before failing. The stub now implements
every StreamingClientConn method itself (no underlying conn is
constructed when the token is empty), so a tokenless stream open
never touches the network. (Codex P2 #11.)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Round-4 update on the two Codex re-run findings: MEDIUM — Heartbeats can drive unbounded database writes — same posture as the previous reply: bounded single-row UPDATE per fleet node, gated by session auth. Rate-limiting is a real follow-up but designing it alongside telemetry/events/control-stream is cleaner than landing a per-RPC limiter for just heartbeats. Not blocking this PR. LOW — last_seen_at can move backward under clock skew — accepted. The handler stamps 219d70d addresses the three actionable inline threads (exit on revoked api_key, exit on fleet_node deleted, stream fail-fast hardening). Build/lint/68 server test packages all green. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 219d70d2f6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Summary
Adds the `fleetnode run` subcommand: a long-running daemon that holds an authenticated session and posts `UploadHeartbeat` on a ticker. This is the smallest viable slice of the "agent runtime" described in RFC 0001 Phase 2, and lays the scaffolding that `UploadTelemetry` / `UploadEvents` / `ControlStream` will reuse without further plumbing.
What's in this PR
Server side
Client library (`fleetnodebootstrap`)
CLI
\\text
fleetnode run [--heartbeat-interval=30s]
[--state-dir=]
\\
Test coverage
Out of scope (explicitly)
Test plan
Smoke
🤖 Generated with Claude Code