feat(curtailment): operator read/update/admin APIs + audit + metrics#308
Conversation
Curtailment needs operational metrics — tick duration, tick failures, selector candidate exclusions, maintenance overrides — but the codebase has only OTel tracing today (no Meter, no /metrics, no Prometheus exporter). The pipeline-shape decision is platform-team scope and already in flight via the notifications + Grafana migration; curtailment shouldn't make that decision unilaterally and shouldn't block on it. Define a Metrics interface in the curtailment domain with a no-op default. Service and Reconciler accept it through a functional option so the dozens of existing test call sites (NewService(store), New(cfg, store, cmd)) keep working unchanged. main.go wires NoOpMetrics through both constructors so production has a single named site to swap when the platform observability path lands — interface-stable, one-file change, no curtailment-package churn. Recorder call sites land in follow-up commits. Refs #289
Three of the four Metrics recorders now have call sites:
- ObserveTickDuration fires from safeTick around runTick, capturing
wall-clock per tick on every path (happy, panic-recovered,
list-events-failure).
- IncTickFailure fires from safeTick on tick-infra panic AND from
processEvent on per-event panic. The list-events early-return path
is intentionally NOT counted because the heartbeat still advances
there ("freshness, not query health" — see the comment in runTick).
- IncCandidateExcluded fires from Service.Start (not Preview) after the
selector returns, once per skipped device labeled by reason. Start-
only emission keeps debounced Preview calls from flooding the counter
against a static fleet snapshot.
IncMaintenanceOverride is intentionally deferred. The per-miner
increment needs the selector to surface "this miner was kept because
the maintenance override was honored" — current candidate filtering
just lets the miner fall through without tagging. That instrumentation
lands in a follow-up commit alongside the audit-sweep work where
`curtailment_maintenance_override` activity rows are emitted on the
same code path.
Tests add a goroutine-safe recordingMetrics fake in both the
reconciler and service test files. Three reconciler tests pin
ObserveTickDuration on the happy path, IncTickFailure on tick-infra
panic, and IncTickFailure on per-event panic. One service test pins
IncCandidateExcluded on a phantom-load miner.
Refs #289
Operator-facing event history was previously Unimplemented and the settings-page history table (PR #280) was reading fixtures. This wires the RPC through every layer with a trimmed decision-snapshot policy that keeps response sizes bounded on large fleets. - sqlc: ListCurtailmentEventsForOrg, cursor-paginated by id DESC with an optional state filter. Caller passes limit+1 so the over-fetch detects whether another page remains. - Store: opaque cursor (base64-encoded JSON) so the token shape is free to grow later (sort fields, secondary keys) without breaking older clients. PageSize <=0 maps to a 50-row default; an internal upper cap of 200 mirrors the proto validator as defense in depth. - Service: ListEvents validates org and rejects negative page_size, then forwards to the store. Service-layer guard is needed because cross-tenant exposure is one query away. - Handler: replaces the Unimplemented stub. Session-based org-id resolution, proto enum → service-layer state-filter mapping. - Translate: list-view event proto omits per-target rows (heavy on 10K-miner events × N pages) and trims the per-device `skipped` array to `skipped_aggregate` reason-count map. Top-K selected and the summary fields stay intact so dashboards can render exclusion trend lines. Test fakes in three packages gain ListEvents stubs; the curtailment-package fakeStore gains a working pagination impl mirroring the SQL semantics so service-level tests can assert cursor round-trips. Refs #289
Operator-safe partial update of a non-terminal event. Replaces the Unimplemented stub on the handler. State policy: pending and active accept the patch; restoring and terminal states reject with FailedPrecondition. Operators who need to intervene mid-restore go through AdminTerminateEvent — that's the recovery surface, not Update. The conservative policy keeps the recompute-vs-freeze question (Open #13) out of v1: Update of restore_batch_size persists the new value but does NOT recompute effective_batch_size. The reconciler's restore-claim reads the Start-time stamped value through to the next event. Validation mirrors Start: restore_batch_interval_sec is gated by the non-admin cap (admin sets the session-derived bypass), max_duration must be > 0 and <= 7 days, restore_batch_size >= 0. Misconfigured values surface as InvalidArgument or Forbidden — never as a DB CHECK violation. sqlc UPDATE uses COALESCE on nil params so a partial patch preserves unset columns. The WHERE clause re-asserts state IN ('pending', 'active') as defense in depth: a race where the row advanced between the service's pre-read and the UPDATE surfaces as ErrCurtailmentUpdateStateRaceLoss → FailedPrecondition with a distinct message from the pre-read rejection. Refs #289
Adds the admin-only escape hatch for forcing a non-terminal event to
CANCELLED or FAILED when a normal stop+restore cycle can't run.
The persistence layer wraps the terminal state transition and the
swept-target update in a single transaction via db.WithTransaction so
the event row and its targets stay in sync. An idempotent re-issue
with the same target_state is a no-op; a different terminal state
surfaces ErrCurtailmentAdminTerminateStateConflict, which the service
maps to FailedPrecondition.
Service-layer defense-in-depth checks (target_state in {CANCELLED,
FAILED}, non-empty reason, org/uuid present) mirror the proto
validator so non-Connect callers can't tunnel past it.
Adds a pre-insert lookup so a re-issued Start with the same idempotency_key or (external_source, external_reference) pair returns the original event instead of re-running the selector and tripping the partial unique indexes (which would surface as a less helpful AlreadyExists from the per-org non-terminal constraint). idempotency_key takes precedence over external reference: the operator-supplied retry handle wins over upstream re-delivery. Lookup errors propagate unchanged so a transient db failure is visible rather than silently falling through to a double-insert attempt.
Adds an AuditLogger interface on the curtailment Service with a no-op default so tests that don't care can ignore the wiring. main.go injects *activity.Service via WithAuditLogger. Two override-specific event types ride alongside the base curtailment_started row so a feed of unbounded or force-include starts is a simple event-type filter rather than a metadata scan. IncMaintenanceOverride fires in parallel when force_include_maintenance is set, so the platform metrics dashboard tracks the override rate without joining against activity_log. Audit emission is intentionally absent on the idempotent-replay and insufficient-load paths: the original Start already recorded the trail, and a path that never persisted shouldn't claim it did.
Documents the curtailment_reconciler_heartbeat-based liveness signal: warn at 2 minutes of staleness with active events present, page at 5 minutes regardless. The SQL form is canonical; the vmalert rule is parked behind a placeholder bridge metric so the wiring is one config change away once a postgres-exporter publishes the staleness gauge. Runbook walks four failure modes (panic loop, slow-query contention, events not picked up, restore-loop forever) with operator response steps that lean on AdminTerminateEvent for the cases where infra mitigation isn't enough.
Walks the operator-facing service flow end-to-end against the in-memory fake: Preview (no persistence side-effects) → Start (persistence + audit + metrics) → Stop (RESTORING transition) → AdminTerminate (forced terminal). The reconciler's tick-by-tick state machine is covered piecewise in reconciler_test.go and restore_test.go; this test pins the boundary between the public service API and the persistence layer. Companion tests cover the webhook idempotency-replay path (duplicate Start short-circuits, no double-audit) and the read-path query (ListEvents returns terminal rows filtered by state). A docker-driven HTTP-level e2e for the same lifecycle is a follow-up — the existing server/e2e dir requires postgres + proto-sim and lands when the curtailment plugin path is ready.
Four real lint findings from this branch, fixed without suppressions: - Service.AdminTerminate: replace the two-case switch + default with an if-comparison so exhaustive doesn't demand the unhandled cases be enumerated. The default branch was load-bearing — the if form keeps the same behavior with less surface. - service_list_test.go / handler_list_test.go: hoist the opaque cursor literal into a file-scope const. gosec G101 looks at string literals assigned to fields whose name matches credential keywords (PageToken matches "token"); an identifier reference clears the heuristic cleanly. - service_start_idempotency_test.go: move the subtest store + svc creation inside the t.Run closure so each subtest can call t.Parallel() without sharing mutable counters across cases.
A multi-reviewer code review surfaced four merge-blocking P1s and a batch of P2/P3 hygiene items on this branch. This commit lands the focused, defensible subset that doesn't require contract or security-policy decisions; the remaining items are recorded for the next session. Idempotency / race-recovery: - Recognize uq_curtailment_event_idempotency and uq_curtailment_event_external_ref unique violations as race-loss via typed sentinels; Service.Start re-issues the corresponding replay lookup so the race-loser falls into the same response path as a deliberate retry rather than surfacing Internal with the constraint name leaked in the error string. - AdminTerminateEvent: on zero-row UPDATE caused by a concurrent terminate-to-same-state, re-read inside the transaction and echo the row idempotently (mirrors BeginRestoreTransition's pattern). Audit / observability: - Emit a curtailment_admin_terminated activity row on AdminTerminate so the privileged force-terminate path captures actor + reason in the activity feed (parallels emitStartAuditTrail). - emitStartAuditTrail now maps req.SourceActorType to activitymodels.ActorType so scheduler-triggered starts persist actor_type='scheduler' on activity_log instead of defaulting to 'user'. Update path hardening: - Reject explicit empty-string Reason as InvalidArgument and add a 256-char length cap mirroring Start. The proto-translate comment is updated to describe the actual silent-no-op behavior. Proto contract docs: - Field-level docstrings on ListCurtailmentEvents describing the omitted target_rollup/targets and the trimmed decision_snapshot shape (skipped_aggregate vs raw skipped). - max_len=1024 on ListCurtailmentEventsRequest.page_token so the cursor decode path is bounded. - Annotate the two eventStateFromProto call sites distinguishing the no-filter sentinel role from the target_state mapping role. Cleanup: - Drop duplicate finitePtr generic in handler_start_test.go (the existing ptr generic in handler_test.go covers the use case). - Inline single-call-site valueOrZero generic in service.go.
Codex security review + Copilot inline reviewers surfaced five actionable findings on this branch. All five validated; landing the fixes here. Admin gate on Update.max_duration_seconds (HIGH). Mirrors Start's post-normalization admin check inside Service.Update. Without this, a non-admin who Started at the org default could Update the same event above the default, bypassing the privilege boundary Start enforces. Fetches org config lazily — only when max_duration is in the patch and the caller lacks admin controls. AdminTerminate.reason length cap (MEDIUM). Service-level backstop rejects oversized reasons (>256 chars) so a bulky operator string can't amplify into every swept target's last_error column. The proto field gets the matching max_len=256 rule; proto regen is deferred to a clean tooling pass (the service backstop already catches the case today). List query trims decision_snapshot at the SQL boundary (MEDIUM). ListCurtailmentEventsForOrg now projects explicit columns with (decision_snapshot_jsonb - 'skipped')::JSONB so the per-device skip list (multi-MB on 10K-miner events) doesn't ride the wire for every list row. Field layout matches CurtailmentEvent exactly so the existing convertEventRow path applies via a single struct conversion. Cursor rejects non-positive IDs (MEDIUM). decodeCurtailmentEventCursor now returns InvalidArgument when the decoded id is <= 0. The store never emits a non-positive id; a user-supplied token that decodes to one would silently rewind to the first page (id=0) or return zero rows (id<0). Audit metadata key naming (MEDIUM). Renamed `force_include` to `force_include_maintenance` on the curtailment_started audit row metadata so the key matches the domain/proto field name. Downstream analytics no longer have to map between abbreviated and full names. Test coverage added for each fix: non-admin max_duration rejection, admin pass-through, oversized reason rejection, cursor non-positive id rejection (zero / negative / missing).
Harden curtailment replay, list, and reconciler behavior so the new read/admin APIs keep their documented contracts under retries and races.
Prevent stale reconciler ticks from dispatching commands after event state changes and require active admin-terminated events to stop first so restore work is queued. Reserve the removed admin terminate idempotency field to protect protobuf compatibility.
…tate race-loss Two related reconciler-write-path issues from the third-iteration review: eventStillDispatchable was called once per target inside dispatchOneCurtail and dispatchRestoreBatch. For a 100-target event that's 100 redundant GetEventByUUID lookups before any Curtail dispatch fires — the event row doesn't change between iterations within a tick. Hoist the check to the event level (after target list in dispatchPending, after enforceMaxDuration in observeActive) and drop the per-target invocation. UpdateCurtailmentEventState was sqlc :exec, which discards the rows-affected count. When a concurrent AdminTerminate sweeps the event terminal, the reconciler's maybeMarkActive / maybeCompleteRestoring UPDATE matched 0 rows, the store returned nil, and the reconciler proceeded as if the transition succeeded. Switch the query to :execrows; map 0-rows to a new typed sentinel ErrCurtailmentEventStateRaceLoss; reconciler logs slog.Warn and increments a new IncEventStateRaceLoss metric on the signal. Tick behavior is unchanged beyond the new observability — making the reconciler abort the tick on this signal is a separate design decision tracked in the follow-up queue. Also clarifies the comment on the list-query skipped-trim projection: the win is application-tier bytes (network + JSON decode), not Postgres disk I/O — TOAST detoast still occurs per row.
Mirrors TestHandler_AdminTerminateEvent_StateConflictMapsFailedPrecondition for the ErrCurtailmentAdminTerminateActiveEvent path so the operator-visible "Stop first" semantics have a regression guard at the gRPC code-mapping layer (service-layer mapping was already covered).
…t guidance The AdminTerminateEvent RPC now surfaces two distinct FailedPrecondition variants that share a gRPC code (active-event-must-stop-first; state-conflict). Expand the proto service comment to enumerate both, matching StopCurtailment's existing pattern, and note reason's length bounds. The reconciler runbook's failure-mode #1 mitigation step previously prescribed direct AdminTerminateEvent on a panicking event, which now returns FailedPrecondition for ACTIVE events. Amend to call out the StopCurtailment-first requirement; PENDING and RESTORING events can still be terminated directly. Proto-only doc edit; no .pb.go regen needed.
toEventProtoListItem previously zeroed ExternalSource / ExternalReference / IdempotencyKey inline after toEventProto had set them. Move the scrub into a named helper so the intent (these fields are list-view-omitted) is explicit at the call site and a future fourth scrubbed field doesn't need a comment to explain the pattern. No behavior change.
A page_token issued before the cross-list-binding org_id field landed used to reject with InvalidArgument once the binding-check went live. A long-lived pagination loop crossing the deployment boundary should restart from the first page transparently instead of surfacing an opaque cursor error to the caller. Decode now treats OrgID==0 as the "first page" sentinel and returns a nil cursor (same as an empty token). State_filter mismatch and other non-zero violations still reject. Test renamed to reflect the new contract.
validateUpdateRequest now rejects requests where every patchable optional field (reason, restore_batch_size, restore_batch_interval_sec, max_duration_seconds) is nil. The SQL UPDATE still ran in that case and bumped updated_at via COALESCE, producing a misleading freshness signal for clients that track the column and adding write load with no semantic change. Affected service + handler tests that previously passed empty UpdateRequests to reach state-guard / not-found branches now carry a minimal valid patch (Reason). New TestService_Update_RejectsEmptyPatch pins the new contract.
…lment-read-apis-audit-metrics # Conflicts: # server/generated/sqlc/db.go
A concurrent AdminTerminate landing between target N and target N+1 of the dispatch loop would let remaining Curtail commands fire against an already-terminated event because the previous design checked liveness once per event-tick. Move the eventStillDispatchable call to the start of dispatchOneCurtail and drop the redundant pre-loop hoists in dispatchPending and observeActive. The cost is one GetEventByUUID per target — N reads per tick for an N-target event — which is acceptable; AdminTerminate is rare and the perf optimization was incorrect in claim (the event row CAN change inside a tick) and incorrect in consequence. dispatchRestoreBatch is unaffected: its single bulk Uncurtail is one command per batch, so the existing per-event check already guards it. Adds TestReconciler_SkipsRemainingCurtailDispatchesWhenEventTerminatesMidLoop asserting only the first target dispatches when the event flips between target 1 and target 2.
The previous gate rejected admin-terminate only when event state was ACTIVE. A PENDING event whose reconciler tick had already dispatched some Curtail commands had targets in DISPATCHED/CONFIRMED/DRIFTED, and admin-terminate would proceed — sweeping those targets to RESTORE_FAILED without issuing the compensating Uncurtail commands, leaving the already-curtailed miners stuck. Replace the state==active check with a SQL existence check on non-restored target states. The new check subsumes ACTIVE (which always has CONFIRMED targets via maybeMarkActive) and additionally catches PENDING events with dispatched targets. The ErrCurtailmentAdminTerminateActiveEvent sentinel keeps its name for backward compatibility with handler/test references, but its doc and operator-facing error message now describe the broader condition. Proto RPC comment and the reconciler runbook's mitigation guidance updated to match.
RBAC PR 1 landed migration 000052_create_permission_tables on main and the post-merge commit had two files claiming version 52 — the duplicate trips golang-migrate's version uniqueness check at server boot. The list-index migration only ever lived on this feature branch, so the migration-immutability rule (which gates edits to migrations on main) does not apply. Renaming to the next free slot (000055, after the RBAC track's 000052/000053/000054) is the right resolution. Content unchanged; only the file numbers move.
…NTLY on list index Two BE-5 merge gates landed together. force_include_maintenance is safety-critical: it commands curtailment on miners in active physical maintenance — forcibly power-cycling a miner a technician is servicing is a personnel hazard. The BE-1.x design intended Admin-only; the gate was never wired and any API-key caller could trip it. Wire requireAdminFromContext mirroring the allow_unbounded pattern; add table coverage in TestHandler_OverrideFieldsRoleGate. Migration 000055's index now builds with CONCURRENTLY. The earlier "annotation needed" framing turned out to be wrong — golang-migrate v4's postgres driver runs ExecContext directly without wrapping the migration body in a transaction, so CONCURRENTLY works without any no-transaction annotation. Cheap fix; a hard merge gate at high-row-count deploys.
…l / Uncurtail Closes the AdminTerminate residual race named in Known Limitation #10. Before this change, a tick that read PENDING targets, called cmd.Curtail, then lost a foot race to a concurrent AdminTerminate left miners curtailed with no compensating Uncurtail — the sweep flipped the just-dispatched-to targets to RESTORE_FAILED while the command landed against a dead event. Blast radius scaled with event size: thousands of stranded curtailments on a 5K-miner mid-dispatch event. The fix is a two-phase write. dispatchOneCurtail and dispatchRestoreBatch now stamp DISPATCHING on each target before issuing the command, and transition to DISPATCHED only after the command returns. The in-flight gate's SQL EXISTS predicate (CurtailmentEventHasInFlightTargets) counts dispatching rows alongside dispatched/confirmed/drifted. A concurrent terminate that races a mid-dispatch tick observes the dispatching row and rejects as Stop-first, so the command cannot fire against a swept event. last_dispatched_at intentionally lands on the DISPATCHED write, not the DISPATCHING pre-write — it records successful enqueue, used by the restore-batch interval gate. Filter-skipped / empty-batch failures roll back via recordDispatchFailure without leaving a misleading timestamp. New proto value CURTAILMENT_TARGET_STATE_DISPATCHING = 8. New Go constant TargetStateDispatching ("dispatching"). Regression coverage in TestReconciler_DispatchingPreWrite_CommitsBeforeCommand via a curtailHook on the fake dispatcher that inspects store state at the moment cmd.Curtail is called.
Start dispatch wrote each target with a separate InsertCurtailmentTarget round-trip inside the event transaction. For a 10K-miner curtailment that's 10K serial parameterized exec calls — easily the dominant cost of a Start RPC even with prepared-statement reuse, and a steady source of transaction holding time that conflicts with the reconciler tick. Switch to a single BulkInsertCurtailmentTargets call that fans the rows out server-side via jsonb_to_recordset. The Go side marshals the per-target inputs into one JSONB array; Postgres expands it into the typed columns (NUMERIC for baseline_power_w, JSONB for the selector rationale, missing keys → NULL). Returns rows affected so the store can pin (inserted == len(targets)) and fail the transaction loudly if jsonb_to_recordset silently drops a row that fails its column cast. Drops the unused single-row InsertCurtailmentTarget query and the rawMessageOrNullable helper that only it consumed. The JSON payload shape is pinned by a unit test so a wrong field tag or nil-handling regression breaks loud at unit-time rather than via a docker-compose integration failure.
Multiple corrections surfaced by the review pass against this PR. Runtime: - SweepCurtailmentTargetsToRestoreFailed wrote to a non-existent `updated_at` column on `curtailment_target`. Every AdminTerminate against a non-terminal event with at least one in-scope target tripped "column does not exist" and rolled back the entire termination, leaving the admin recovery escape hatch unusable. Drop the column from the UPDATE; the other target writes already don't reference it. - recordDispatchFailure now falls back to a new BumpCurtailmentTargetRetry query when the rich state-change UPDATE fails non-race-loss. Persistent state-write failures no longer block retry-budget progress; MaxRetries → RESTORE_FAILED escalation lands on the next successful UpdateTargetState. New restore-path test pins the fallback contract. - Restore the "heartbeat advances on tick freshness, not query health" semantic when ListNonTerminalEvents fails. The SQL staleness alert distinguishes reconciler-dead (no upsert) from DB-read-degraded (upsert advances, IncTickFailure rises). Test renamed to match. Contract: - UpdateCurtailmentEventRequest gains buf-validators mirroring Start (reason min/max_len, restore_batch_size lte:10000, restore_batch_interval_sec lte:3600, max_duration_seconds lte:604800). Previously a client could patch a non-terminal event with bounds Start would reject. - AdminTerminate proto docstring documents the RESTORING-event race: the in-flight gate scopes to desired_state=CURTAILED, so a terminate during restore proceeds even with in-flight Uncurtails. Race-window targets persist as RESTORE_FAILED while the device may be restored. - ListCurtailmentEvents proto docstring documents the trigger-metadata scrub on list rows; clients correlating webhook deliveries fetch via GetActiveCurtailment for unscrubbed attribution.
- Migrations 000056/000058 add operator-recovery notes for the CONCURRENTLY-split failure mode (paired CREATE in 000057/000059 fails after the DROP lands: schema_migrations stays dirty with the unique index gone; resolve duplicates, force the version, retry). - Replace dangling `Open #13` references with the recompute-vs-freeze reason inline so future readers don't chase a missing tracker link.
There was a problem hiding this comment.
Pull request overview
This PR completes the remaining v1 curtailment operator/admin RPC surface (list/update/admin-terminate) and adds supporting persistence, audit, metrics, and reliability hardening needed for production observability and safe concurrency.
Changes:
- Added operator RPC implementations for listing event history and updating operator-safe fields, plus an admin terminate RPC (service/handler/store + tests).
- Introduced cursor-based pagination for event history and persistence-layer idempotency lookups (idempotency key + external reference) with migrations to scope uniqueness to non-terminal events.
- Added curtailment audit + metrics interfaces (no-op defaults) and wired them into
fleetd, along with expanded unit and E2E coverage.
Reviewed changes
Copilot reviewed 44 out of 50 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| server/sqlc/queries/curtailment.sql | Adds queries for list pagination, idempotency lookups, admin terminate, bulk target insert, and race-loss guards. |
| server/migrations/000055_add_curtailment_event_list_index.up.sql | Adds (org_id, id DESC) index to support list pagination. |
| server/migrations/000055_add_curtailment_event_list_index.down.sql | Drops the list pagination index. |
| server/migrations/000056_scope_curtailment_idempotency_to_non_terminal.up.sql | Drops prior idempotency unique index (prep for scoped recreate). |
| server/migrations/000056_scope_curtailment_idempotency_to_non_terminal.down.sql | Recreates the unscoped idempotency unique index (rollback). |
| server/migrations/000057_scope_curtailment_idempotency_to_non_terminal.up.sql | Recreates idempotency unique index scoped to non-terminal events. |
| server/migrations/000057_scope_curtailment_idempotency_to_non_terminal.down.sql | Drops scoped idempotency unique index (rollback). |
| server/migrations/000058_scope_curtailment_external_ref_to_non_terminal.up.sql | Drops prior external-ref unique index (prep for scoped recreate). |
| server/migrations/000058_scope_curtailment_external_ref_to_non_terminal.down.sql | Recreates unscoped external-ref unique index (rollback). |
| server/migrations/000059_scope_curtailment_external_ref_to_non_terminal.up.sql | Recreates external-ref unique index scoped to non-terminal events. |
| server/migrations/000059_scope_curtailment_external_ref_to_non_terminal.down.sql | Drops scoped external-ref unique index (rollback). |
| server/internal/handlers/curtailment/translate.go | Adds request/response translation for list/update/admin-terminate and list-view scrubbing. |
| server/internal/handlers/curtailment/handler.go | Implements Update/List/AdminTerminate handlers and extends Start admin gating. |
| server/internal/handlers/curtailment/handler_update_test.go | Handler-level tests for UpdateCurtailmentEvent (auth, validation, role gates, response shape). |
| server/internal/handlers/curtailment/handler_test.go | Updates validation expectations and extends override-field role gating tests. |
| server/internal/handlers/curtailment/handler_stop_test.go | Updates stub store surface for new store interface methods. |
| server/internal/handlers/curtailment/handler_start_test.go | Extends Start handler tests for idempotent replay response rendering. |
| server/internal/handlers/curtailment/handler_list_test.go | Adds handler tests for ListCurtailmentEvents behavior and response scrubbing. |
| server/internal/handlers/curtailment/handler_admin_terminate_test.go | Adds handler tests for AdminTerminateEvent role gate and error mapping. |
| server/internal/domain/stores/sqlstores/curtailment_test.go | Pins JSON payload shape for bulk target insert into Postgres recordset. |
| server/internal/domain/stores/sqlstores/curtailment_cursor.go | Implements cursor codec for event-history pagination. |
| server/internal/domain/stores/sqlstores/curtailment_cursor_test.go | Tests cursor codec round-trip and invalid-token rejection behavior. |
| server/internal/domain/stores/interfaces/curtailment.go | Extends store interface: list/update/admin-terminate/idempotency/race-loss signaling. |
| server/internal/domain/curtailment/service_test.go | Expands fake store + introduces goroutine-safe metrics recorder for assertions. |
| server/internal/domain/curtailment/service_start_test.go | Adds Start guard + metrics emission test coverage (force maintenance, exclusion metrics). |
| server/internal/domain/curtailment/service_start_idempotency_test.go | Adds comprehensive idempotency replay + race-loser behavior tests. |
| server/internal/domain/curtailment/service_start_audit_test.go | Adds audit emission tests and audit-write-failure metric assertions. |
| server/internal/domain/curtailment/service_list_test.go | Adds service-level tests for ListEvents parameter forwarding and validation. |
| server/internal/domain/curtailment/service_lifecycle_test.go | Adds end-to-end lifecycle tests against the in-memory fake (Preview/Start/Stop/AdminTerminate/List). |
| server/internal/domain/curtailment/service_admin_terminate_test.go | Adds service-level validation + error mapping tests for AdminTerminate. |
| server/internal/domain/curtailment/selector.go | Extends Plan to support replay rendering from persisted event + targets. |
| server/internal/domain/curtailment/reconciler/restore_test.go | Expands restore-phase reconciler tests for retry-budget fallback and failure modes. |
| server/internal/domain/curtailment/models/models.go | Adds DISPATCHING target state and clarifies candidate documentation. |
| server/internal/domain/curtailment/metrics.go | Introduces Metrics interface with a default no-op implementation. |
| server/internal/domain/curtailment/audit.go | Introduces AuditLogger interface + activity type constants + no-op logger. |
| server/generated/sqlc/db.go | Auto-generated sqlc query preparation updates for new/renamed queries. |
| server/generated/grpc/curtailment/v1/curtailmentv1connect/curtailment.connect.go | Auto-generated Connect client/handler docs updated for new RPC semantics. |
| server/e2e/curtailment_e2e_test.go | Adds E2E coverage for full curtailment lifecycle and reconciler restart safety. |
| server/cmd/fleetd/main.go | Wires curtailment handler, audit logger, and metrics recorder into the server. |
| proto/curtailment/v1/curtailment.proto | Evolves proto contract: new DISPATCHING state, request bounds, list semantics, admin terminate docs. |
| client/src/protoFleet/features/energy/CurtailmentHistory.tsx | Minor refactor: extracted sort-time helper and Set-based status filtering. |
| .gitignore | Ignores docs/reviews/ tooling artifacts. |
🔐 Codex Security Review
Review SummaryOverall Risk: HIGH Findings[HIGH] Terminal idempotency keys can replay as fresh curtailments
[MEDIUM] Updating
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d6df608f07
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 65678d95d4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…issions Codex Security Review flagged these as accessible to any authenticated org member: the handlers only read session.GetInfo and never invoked middleware.RequirePermission. List/Update had no role gate at all; AdminTerminate only had a session-only admin-role gate. Each handler now calls RequirePermission against the appropriate permission key — curtailment:read for List, curtailment:manage for Update + AdminTerminate. AdminTerminate keeps requireAdminFromContext as defense-in-depth so legacy seeds that have not yet had their curtailment:manage assignment audited still fail closed. The test helpers wire EffectivePermissions onto the request context so the gate fires; new denial cases prove a caller without the permission is rejected before any store work.
The PR description promises that pre-org-binding page tokens transparently restart from the first page so an in-flight pagination loop spanning a deployment doesn't surface InvalidArgument. The decoder was rejecting any cursor with org_id <= 0, which contradicted that contract. Split the OrgID check into two cases: explicit negative is still tampering and surfaces InvalidArgument, but zero/missing org_id decodes to a nil cursor so ListEvents falls through to page 1. The cross-tenant guard at the call site is unaffected — a non-nil cursor still must match the caller's org_id.
…atch The previous claim loop accepted both PENDING and DISPATCHING into the same batch up to effective_batch_size. After a reconciler restart with DISPATCHING orphans (interrupted prior tick) coexisting with fresh PENDING, one batch could carry both — doubling the inrush and bypassing the one-batch-per-interval throttle the operator relies on. The claim phase now runs in two passes. If any DISPATCHING orphans exist, the wave redispatches only those (Uncurtail is device-idempotent, so re-sending is safe) and returns; fresh PENDING is held for the next tick. Throttling stays correct because orphan recovery consumes the tick's batch slot on its own.
When recordDispatchFailure's write fails non-race-loss, the BumpTargetRetry fallback advances retry_count without changing state. After enough bumps, a target sat in DISPATCHING (or Drifted) with retry_count past MaxRetries — and observeActive's `continue` arm silently skipped it every subsequent tick, so the row never reached a terminal state. Both arms now route exhausted targets through recordDispatchFailure itself. recordDispatchFailure already escalates to RESTORE_FAILED at the MaxRetries threshold via writeTargetState; if that write fails too, the existing fallback at least keeps the IncTargetWriteFailure counter firing so an operator alert surfaces. The Drifted arm carries the same bug as DISPATCHING (same fallback path, same observeActive skip). Applied symmetrically.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bb81b73824
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…ail path The Curtail path's DISPATCHING pre-write only logged and returned when writeTargetState failed non-race-loss; retry_count was never bumped. A row-specific persistent write failure (one row hitting a lock or a transient column error while reads succeed) would stall the event indefinitely because the target never advanced toward RESTORE_FAILED and maybeMarkActive could not promote the event. Route the failure through recordDispatchFailure with nonTerminalFailureState as the residual state — matches the restore path's symmetric handler at dispatchRestoreBatch. recordDispatchFailure's BumpTargetRetry fallback keeps the budget advancing even when the rich state write also fails.
Rebase onto main picked up #308 / #311, which added a defense-in-depth requireAdminFromContext call ahead of the curtailment:manage gate. PR 2c's design moves RBAC to the authoritative check, so drop the legacy role gate and reorder so RequirePermission runs before the nil-service check (auth-before-implementation; preserves Unauthenticated on missing session instead of leaking Unimplemented). The TestHandler_AdminTerminateEvent_RejectsNonAdmin case asserted the legacy role gate's defense-in-depth behavior — it's redundant with the renamed TestHandler_AdminTerminateEvent_RejectsCallerWithoutCurtailmentManage which already covers the RBAC denial path. Drop the obsolete test.
Rebase onto main picked up #308 / #311, which added a defense-in-depth requireAdminFromContext call ahead of the curtailment:manage gate. PR 2c's design moves RBAC to the authoritative check, so drop the legacy role gate and reorder so RequirePermission runs before the nil-service check (auth-before-implementation; preserves Unauthenticated on missing session instead of leaking Unimplemented). The TestHandler_AdminTerminateEvent_RejectsNonAdmin case asserted the legacy role gate's defense-in-depth behavior — it's redundant with the renamed TestHandler_AdminTerminateEvent_RejectsCallerWithoutCurtailmentManage which already covers the RBAC denial path. Drop the obsolete test.
Summary
Closes the operator surface and observability scaffolding for v1 curtailment. Lifecycle, dispatch, and restore are already on main (#192 + #232); this PR fills in the missing read/update/admin RPCs, the audit + metrics hooks, and the reconciler race fixes uncovered along the way.
Closes #289.
What's in this PR
1. Operator RPCs
End-to-end (service + handler + audit + tests). All three are gated via
middleware.RequirePermissionagainstcurtailment:read(List) orcurtailment:manage(Update, AdminTerminate); AdminTerminate keeps the session-only admin-role check as defense-in-depth.ListCurtailmentEvents— cursor-paginated history. SQL trims the per-deviceskippedlist into a reason→count aggregate so responses stay bounded on large fleet events. Cursor binds(org_id, state_filter, id); legacy tokens withoutorg_idrestart from page 1 instead of crossing tenants.UpdateCurtailmentEvent— patchesreason, restore batch + interval, max duration. Empty-patch rejection (otherwiseupdated_atwould tick from a no-op); admin gate on raisingmax_duration_seconds; typedFailedPreconditionon the pre-read / write state race.AdminTerminateEvent— forces a non-terminal event toCANCELLED/FAILEDand sweeps every non-terminal target toRESTORE_FAILEDin one transaction. Stop-first gate triggers on any in-flight curtail target. Idempotent same-target echo returns the row and emits a typedcurtailment_admin_terminated_replayrow carrying the caller's actor + reason, so a concurrent race-loser's attribution still lands in the audit feed.2. Audit + metrics scaffolding
Both interfaces default to no-op; the platform implementations swap in at
cmd/fleetd/main.gowithout churn in the curtailment package.AuditLogger— wired toactivity.ServiceviaLogStrictso audit-write failures surface to a counter instead of being silently swallowed. Emitscurtailment_started(plus override-class rows forallow_unboundedandforce_include_maintenance),curtailment_admin_terminated/..._replay, andcurtailment_updated.Metrics— tick duration, tick failures, candidate-exclusion (labeled by reason), maintenance overrides, event-state race-loss, target-write failures, audit-write failures.3. Reconciler hardening
Race + reliability gaps uncovered while wiring the admin surface:
DISPATCHINGtransient state — targets are stamped beforecmd.Curtail/cmd.Uncurtailso a concurrent terminate's in-flight gate observes them during the command window. Interrupted dispatch leaves aDISPATCHINGorphan; the next tick redispatches via the normal pending loop (commands are device-idempotent).DISPATCHINGorphans exist, the restore claim phase redispatches only those and holds freshPENDINGfor the next tick. Prevents a restart race from doubling the inrush and bypassing the one-batch-per-interval throttle.:execrowszero-rows-affected surfaces asErrCurtailmentEventStateRaceLossso the reconciler counter-bumps and skips instead of treating the race-loss as a successful transition.recordDispatchFailure, which bumpsretry_countand falls back toBumpTargetRetrywhen the rich state write itself can't land. ExhaustedDISPATCHINGandDriftedorphans escalate toRESTORE_FAILEDrather than looping in the transient state.4. Webhook idempotency persistence
(org_id, idempotency_key)and(org_id, external_source, external_reference)lookups land at the persistence boundary so the eventual webhook-ingest RPC arrives without a contract change. Redelivery returns the original event (state + targets) rather than synthesizing PENDING; race-losers route through the replay branch instead of leaking Postgres constraint names.5. Migrations
Five files (one statement each — required for
CONCURRENTLYunder golang-migrate):000055—(org_id, id DESC)index for List.000056/000057— re-scope theidempotency_keypartial-unique index to non-terminal events so a completed event's key frees up for a new Start.000058/000059— same scoping for(external_source, external_reference).6. Proto contract evolution
AdminTerminateEventRequest.idempotency_key(tag 4) removed for v1; tag + name reserved so an old client's payload can't alias to a future field.reasongainsmin_len=1, max_len=256;page_tokengainsmax_len=1024;UpdateCurtailmentEventRequestadds bounds mirroring Start so the same value can't tunnel in via Update.CurtailmentTargetState.DISPATCHING(= 8) added as a read-back value.FailedPreconditionvariants for AdminTerminate (codes 1 / 2) and the trimmed response shape for List.7. Frontend touch
CurtailmentHistory.tsx — small cleanup: extract
getSortTime, switch the status filter to aSetlookup. No behavior change. UI wiring for the new RPCs is a follow-up.Out of scope (follow-ups)
AdminTerminate.forcefor the case where Stop itself fails (DB outage, adapter unreachable) — pairs with a runbook decision about abandonment semantics.Test plan
go build ./...and lint clean on the changed scope.DISPATCHINGpre-stamp visibility, orphan recovery, orphan-priority restore claim, exhausted-orphan escalation toRESTORE_FAILED, Curtail-path pre-write retry-budget bump, retry-budget fallback, and heartbeat advancement on list-query failure.RequirePermissiondenial for callers lackingcurtailment:read/curtailment:manage, malformed UUID rejection, andFailedPreconditioncode mapping for both AdminTerminate variants.org_id) restart from page 1, plus malformed / state-filter-mismatch rejection.DISPATCHINGorphan recovery.