feat(curtailment): operator read/update/admin APIs + audit + metrics by rongxin-liu · Pull Request #308 · block/proto-fleet

rongxin-liu · 2026-05-25T18:37:49Z

Summary

Closes the operator surface and observability scaffolding for v1 curtailment. Lifecycle, dispatch, and restore are already on main (#192 + #232); this PR fills in the missing read/update/admin RPCs, the audit + metrics hooks, and the reconciler race fixes uncovered along the way.

Closes #289.

What's in this PR

1. Operator RPCs

End-to-end (service + handler + audit + tests). All three are gated via middleware.RequirePermission against curtailment:read (List) or curtailment:manage (Update, AdminTerminate); AdminTerminate keeps the session-only admin-role check as defense-in-depth.

ListCurtailmentEvents — cursor-paginated history. SQL trims the per-device skipped list into a reason→count aggregate so responses stay bounded on large fleet events. Cursor binds (org_id, state_filter, id); legacy tokens without org_id restart from page 1 instead of crossing tenants.
UpdateCurtailmentEvent — patches reason, restore batch + interval, max duration. Empty-patch rejection (otherwise updated_at would tick from a no-op); admin gate on raising max_duration_seconds; typed FailedPrecondition on the pre-read / write state race.
AdminTerminateEvent — forces a non-terminal event to CANCELLED/FAILED and sweeps every non-terminal target to RESTORE_FAILED in one transaction. Stop-first gate triggers on any in-flight curtail target. Idempotent same-target echo returns the row and emits a typed curtailment_admin_terminated_replay row carrying the caller's actor + reason, so a concurrent race-loser's attribution still lands in the audit feed.

2. Audit + metrics scaffolding

Both interfaces default to no-op; the platform implementations swap in at cmd/fleetd/main.go without churn in the curtailment package.

AuditLogger — wired to activity.Service via LogStrict so audit-write failures surface to a counter instead of being silently swallowed. Emits curtailment_started (plus override-class rows for allow_unbounded and force_include_maintenance), curtailment_admin_terminated / ..._replay, and curtailment_updated.
Metrics — tick duration, tick failures, candidate-exclusion (labeled by reason), maintenance overrides, event-state race-loss, target-write failures, audit-write failures.

3. Reconciler hardening

Race + reliability gaps uncovered while wiring the admin surface:

DISPATCHING transient state — targets are stamped before cmd.Curtail / cmd.Uncurtail so a concurrent terminate's in-flight gate observes them during the command window. Interrupted dispatch leaves a DISPATCHING orphan; the next tick redispatches via the normal pending loop (commands are device-idempotent).
Orphan-priority restore claim — when DISPATCHING orphans exist, the restore claim phase redispatches only those and holds fresh PENDING for the next tick. Prevents a restart race from doubling the inrush and bypassing the one-batch-per-interval throttle.
Per-event liveness re-read before each dispatch, hoisted per-event so a 100-target event pays one DB lookup per tick, not 100.
Typed race-loss sentinels on event/target state writes — :execrows zero-rows-affected surfaces as ErrCurtailmentEventStateRaceLoss so the reconciler counter-bumps and skips instead of treating the race-loss as a successful transition.
Retry-budget fallback + escalation — both the Curtail and Restore pre-write paths route non-race-loss failures through recordDispatchFailure, which bumps retry_count and falls back to BumpTargetRetry when the rich state write itself can't land. Exhausted DISPATCHING and Drifted orphans escalate to RESTORE_FAILED rather than looping in the transient state.
Heartbeat semantics — advances on tick freshness even when the event-list query errors, so the SQL staleness alert distinguishes reconciler-dead from DB-read-degraded.

4. Webhook idempotency persistence

(org_id, idempotency_key) and (org_id, external_source, external_reference) lookups land at the persistence boundary so the eventual webhook-ingest RPC arrives without a contract change. Redelivery returns the original event (state + targets) rather than synthesizing PENDING; race-losers route through the replay branch instead of leaking Postgres constraint names.

5. Migrations

Five files (one statement each — required for CONCURRENTLY under golang-migrate):

000055 — (org_id, id DESC) index for List.
000056 / 000057 — re-scope the idempotency_key partial-unique index to non-terminal events so a completed event's key frees up for a new Start.
000058 / 000059 — same scoping for (external_source, external_reference).

6. Proto contract evolution

AdminTerminateEventRequest.idempotency_key (tag 4) removed for v1; tag + name reserved so an old client's payload can't alias to a future field.
reason gains min_len=1, max_len=256; page_token gains max_len=1024; UpdateCurtailmentEventRequest adds bounds mirroring Start so the same value can't tunnel in via Update.
CurtailmentTargetState.DISPATCHING (= 8) added as a read-back value.
RPC docs enumerate FailedPrecondition variants for AdminTerminate (codes 1 / 2) and the trimmed response shape for List.

7. Frontend touch

CurtailmentHistory.tsx — small cleanup: extract getSortTime, switch the status filter to a Set lookup. No behavior change. UI wiring for the new RPCs is a follow-up.

Out of scope (follow-ups)

AdminTerminate.force for the case where Stop itself fails (DB outage, adapter unreachable) — pairs with a runbook decision about abandonment semantics.
Webhook ingest RPC itself — gated on naming a specific upstream system.
Scoped API keys — landing separately; override fields stay API-key-reachable until then.

Test plan

go build ./... and lint clean on the changed scope.
Service-layer tests cover List, Update, and AdminTerminate end-to-end: state guards, admin gating, empty-patch rejection, race-loss handling, audit emission vs. suppression on idempotent replays.
Idempotency-replay tests cover both Start channels (key + external ref), precedence, partial-fields handling, and race-loser routing through the unique-violation sentinel.
Reconciler tests cover the per-event liveness guard, DISPATCHING pre-stamp visibility, orphan recovery, orphan-priority restore claim, exhausted-orphan escalation to RESTORE_FAILED, Curtail-path pre-write retry-budget bump, retry-budget fallback, and heartbeat advancement on list-query failure.
Handler tests cover session resolution, role gates, RequirePermission denial for callers lacking curtailment:read / curtailment:manage, malformed UUID rejection, and FailedPrecondition code mapping for both AdminTerminate variants.
Cursor codec round-trip, legacy-token (no org_id) restart from page 1, plus malformed / state-filter-mismatch rejection.
Docker E2E (server/e2e/curtailment_e2e_test.go) exercises the full lifecycle against a real Postgres + reconciler tick loop, including reconciler-restart DISPATCHING orphan recovery.

Curtailment needs operational metrics — tick duration, tick failures, selector candidate exclusions, maintenance overrides — but the codebase has only OTel tracing today (no Meter, no /metrics, no Prometheus exporter). The pipeline-shape decision is platform-team scope and already in flight via the notifications + Grafana migration; curtailment shouldn't make that decision unilaterally and shouldn't block on it. Define a Metrics interface in the curtailment domain with a no-op default. Service and Reconciler accept it through a functional option so the dozens of existing test call sites (NewService(store), New(cfg, store, cmd)) keep working unchanged. main.go wires NoOpMetrics through both constructors so production has a single named site to swap when the platform observability path lands — interface-stable, one-file change, no curtailment-package churn. Recorder call sites land in follow-up commits. Refs #289

Three of the four Metrics recorders now have call sites: - ObserveTickDuration fires from safeTick around runTick, capturing wall-clock per tick on every path (happy, panic-recovered, list-events-failure). - IncTickFailure fires from safeTick on tick-infra panic AND from processEvent on per-event panic. The list-events early-return path is intentionally NOT counted because the heartbeat still advances there ("freshness, not query health" — see the comment in runTick). - IncCandidateExcluded fires from Service.Start (not Preview) after the selector returns, once per skipped device labeled by reason. Start- only emission keeps debounced Preview calls from flooding the counter against a static fleet snapshot. IncMaintenanceOverride is intentionally deferred. The per-miner increment needs the selector to surface "this miner was kept because the maintenance override was honored" — current candidate filtering just lets the miner fall through without tagging. That instrumentation lands in a follow-up commit alongside the audit-sweep work where `curtailment_maintenance_override` activity rows are emitted on the same code path. Tests add a goroutine-safe recordingMetrics fake in both the reconciler and service test files. Three reconciler tests pin ObserveTickDuration on the happy path, IncTickFailure on tick-infra panic, and IncTickFailure on per-event panic. One service test pins IncCandidateExcluded on a phantom-load miner. Refs #289

Operator-facing event history was previously Unimplemented and the settings-page history table (PR #280) was reading fixtures. This wires the RPC through every layer with a trimmed decision-snapshot policy that keeps response sizes bounded on large fleets. - sqlc: ListCurtailmentEventsForOrg, cursor-paginated by id DESC with an optional state filter. Caller passes limit+1 so the over-fetch detects whether another page remains. - Store: opaque cursor (base64-encoded JSON) so the token shape is free to grow later (sort fields, secondary keys) without breaking older clients. PageSize <=0 maps to a 50-row default; an internal upper cap of 200 mirrors the proto validator as defense in depth. - Service: ListEvents validates org and rejects negative page_size, then forwards to the store. Service-layer guard is needed because cross-tenant exposure is one query away. - Handler: replaces the Unimplemented stub. Session-based org-id resolution, proto enum → service-layer state-filter mapping. - Translate: list-view event proto omits per-target rows (heavy on 10K-miner events × N pages) and trims the per-device `skipped` array to `skipped_aggregate` reason-count map. Top-K selected and the summary fields stay intact so dashboards can render exclusion trend lines. Test fakes in three packages gain ListEvents stubs; the curtailment-package fakeStore gains a working pagination impl mirroring the SQL semantics so service-level tests can assert cursor round-trips. Refs #289

Operator-safe partial update of a non-terminal event. Replaces the Unimplemented stub on the handler. State policy: pending and active accept the patch; restoring and terminal states reject with FailedPrecondition. Operators who need to intervene mid-restore go through AdminTerminateEvent — that's the recovery surface, not Update. The conservative policy keeps the recompute-vs-freeze question (Open #13) out of v1: Update of restore_batch_size persists the new value but does NOT recompute effective_batch_size. The reconciler's restore-claim reads the Start-time stamped value through to the next event. Validation mirrors Start: restore_batch_interval_sec is gated by the non-admin cap (admin sets the session-derived bypass), max_duration must be > 0 and <= 7 days, restore_batch_size >= 0. Misconfigured values surface as InvalidArgument or Forbidden — never as a DB CHECK violation. sqlc UPDATE uses COALESCE on nil params so a partial patch preserves unset columns. The WHERE clause re-asserts state IN ('pending', 'active') as defense in depth: a race where the row advanced between the service's pre-read and the UPDATE surfaces as ErrCurtailmentUpdateStateRaceLoss → FailedPrecondition with a distinct message from the pre-read rejection. Refs #289

Adds the admin-only escape hatch for forcing a non-terminal event to CANCELLED or FAILED when a normal stop+restore cycle can't run. The persistence layer wraps the terminal state transition and the swept-target update in a single transaction via db.WithTransaction so the event row and its targets stay in sync. An idempotent re-issue with the same target_state is a no-op; a different terminal state surfaces ErrCurtailmentAdminTerminateStateConflict, which the service maps to FailedPrecondition. Service-layer defense-in-depth checks (target_state in {CANCELLED, FAILED}, non-empty reason, org/uuid present) mirror the proto validator so non-Connect callers can't tunnel past it.

Adds a pre-insert lookup so a re-issued Start with the same idempotency_key or (external_source, external_reference) pair returns the original event instead of re-running the selector and tripping the partial unique indexes (which would surface as a less helpful AlreadyExists from the per-org non-terminal constraint). idempotency_key takes precedence over external reference: the operator-supplied retry handle wins over upstream re-delivery. Lookup errors propagate unchanged so a transient db failure is visible rather than silently falling through to a double-insert attempt.

Adds an AuditLogger interface on the curtailment Service with a no-op default so tests that don't care can ignore the wiring. main.go injects *activity.Service via WithAuditLogger. Two override-specific event types ride alongside the base curtailment_started row so a feed of unbounded or force-include starts is a simple event-type filter rather than a metadata scan. IncMaintenanceOverride fires in parallel when force_include_maintenance is set, so the platform metrics dashboard tracks the override rate without joining against activity_log. Audit emission is intentionally absent on the idempotent-replay and insufficient-load paths: the original Start already recorded the trail, and a path that never persisted shouldn't claim it did.

Documents the curtailment_reconciler_heartbeat-based liveness signal: warn at 2 minutes of staleness with active events present, page at 5 minutes regardless. The SQL form is canonical; the vmalert rule is parked behind a placeholder bridge metric so the wiring is one config change away once a postgres-exporter publishes the staleness gauge. Runbook walks four failure modes (panic loop, slow-query contention, events not picked up, restore-loop forever) with operator response steps that lean on AdminTerminateEvent for the cases where infra mitigation isn't enough.

Walks the operator-facing service flow end-to-end against the in-memory fake: Preview (no persistence side-effects) → Start (persistence + audit + metrics) → Stop (RESTORING transition) → AdminTerminate (forced terminal). The reconciler's tick-by-tick state machine is covered piecewise in reconciler_test.go and restore_test.go; this test pins the boundary between the public service API and the persistence layer. Companion tests cover the webhook idempotency-replay path (duplicate Start short-circuits, no double-audit) and the read-path query (ListEvents returns terminal rows filtered by state). A docker-driven HTTP-level e2e for the same lifecycle is a follow-up — the existing server/e2e dir requires postgres + proto-sim and lands when the curtailment plugin path is ready.

Four real lint findings from this branch, fixed without suppressions: - Service.AdminTerminate: replace the two-case switch + default with an if-comparison so exhaustive doesn't demand the unhandled cases be enumerated. The default branch was load-bearing — the if form keeps the same behavior with less surface. - service_list_test.go / handler_list_test.go: hoist the opaque cursor literal into a file-scope const. gosec G101 looks at string literals assigned to fields whose name matches credential keywords (PageToken matches "token"); an identifier reference clears the heuristic cleanly. - service_start_idempotency_test.go: move the subtest store + svc creation inside the t.Run closure so each subtest can call t.Parallel() without sharing mutable counters across cases.

A multi-reviewer code review surfaced four merge-blocking P1s and a batch of P2/P3 hygiene items on this branch. This commit lands the focused, defensible subset that doesn't require contract or security-policy decisions; the remaining items are recorded for the next session. Idempotency / race-recovery: - Recognize uq_curtailment_event_idempotency and uq_curtailment_event_external_ref unique violations as race-loss via typed sentinels; Service.Start re-issues the corresponding replay lookup so the race-loser falls into the same response path as a deliberate retry rather than surfacing Internal with the constraint name leaked in the error string. - AdminTerminateEvent: on zero-row UPDATE caused by a concurrent terminate-to-same-state, re-read inside the transaction and echo the row idempotently (mirrors BeginRestoreTransition's pattern). Audit / observability: - Emit a curtailment_admin_terminated activity row on AdminTerminate so the privileged force-terminate path captures actor + reason in the activity feed (parallels emitStartAuditTrail). - emitStartAuditTrail now maps req.SourceActorType to activitymodels.ActorType so scheduler-triggered starts persist actor_type='scheduler' on activity_log instead of defaulting to 'user'. Update path hardening: - Reject explicit empty-string Reason as InvalidArgument and add a 256-char length cap mirroring Start. The proto-translate comment is updated to describe the actual silent-no-op behavior. Proto contract docs: - Field-level docstrings on ListCurtailmentEvents describing the omitted target_rollup/targets and the trimmed decision_snapshot shape (skipped_aggregate vs raw skipped). - max_len=1024 on ListCurtailmentEventsRequest.page_token so the cursor decode path is bounded. - Annotate the two eventStateFromProto call sites distinguishing the no-filter sentinel role from the target_state mapping role. Cleanup: - Drop duplicate finitePtr generic in handler_start_test.go (the existing ptr generic in handler_test.go covers the use case). - Inline single-call-site valueOrZero generic in service.go.

Codex security review + Copilot inline reviewers surfaced five actionable findings on this branch. All five validated; landing the fixes here. Admin gate on Update.max_duration_seconds (HIGH). Mirrors Start's post-normalization admin check inside Service.Update. Without this, a non-admin who Started at the org default could Update the same event above the default, bypassing the privilege boundary Start enforces. Fetches org config lazily — only when max_duration is in the patch and the caller lacks admin controls. AdminTerminate.reason length cap (MEDIUM). Service-level backstop rejects oversized reasons (>256 chars) so a bulky operator string can't amplify into every swept target's last_error column. The proto field gets the matching max_len=256 rule; proto regen is deferred to a clean tooling pass (the service backstop already catches the case today). List query trims decision_snapshot at the SQL boundary (MEDIUM). ListCurtailmentEventsForOrg now projects explicit columns with (decision_snapshot_jsonb - 'skipped')::JSONB so the per-device skip list (multi-MB on 10K-miner events) doesn't ride the wire for every list row. Field layout matches CurtailmentEvent exactly so the existing convertEventRow path applies via a single struct conversion. Cursor rejects non-positive IDs (MEDIUM). decodeCurtailmentEventCursor now returns InvalidArgument when the decoded id is <= 0. The store never emits a non-positive id; a user-supplied token that decodes to one would silently rewind to the first page (id=0) or return zero rows (id<0). Audit metadata key naming (MEDIUM). Renamed `force_include` to `force_include_maintenance` on the curtailment_started audit row metadata so the key matches the domain/proto field name. Downstream analytics no longer have to map between abbreviated and full names. Test coverage added for each fix: non-admin max_duration rejection, admin pass-through, oversized reason rejection, cursor non-positive id rejection (zero / negative / missing).

Harden curtailment replay, list, and reconciler behavior so the new read/admin APIs keep their documented contracts under retries and races.

Prevent stale reconciler ticks from dispatching commands after event state changes and require active admin-terminated events to stop first so restore work is queued. Reserve the removed admin terminate idempotency field to protect protobuf compatibility.

…etrics

…tate race-loss Two related reconciler-write-path issues from the third-iteration review: eventStillDispatchable was called once per target inside dispatchOneCurtail and dispatchRestoreBatch. For a 100-target event that's 100 redundant GetEventByUUID lookups before any Curtail dispatch fires — the event row doesn't change between iterations within a tick. Hoist the check to the event level (after target list in dispatchPending, after enforceMaxDuration in observeActive) and drop the per-target invocation. UpdateCurtailmentEventState was sqlc :exec, which discards the rows-affected count. When a concurrent AdminTerminate sweeps the event terminal, the reconciler's maybeMarkActive / maybeCompleteRestoring UPDATE matched 0 rows, the store returned nil, and the reconciler proceeded as if the transition succeeded. Switch the query to :execrows; map 0-rows to a new typed sentinel ErrCurtailmentEventStateRaceLoss; reconciler logs slog.Warn and increments a new IncEventStateRaceLoss metric on the signal. Tick behavior is unchanged beyond the new observability — making the reconciler abort the tick on this signal is a separate design decision tracked in the follow-up queue. Also clarifies the comment on the list-query skipped-trim projection: the win is application-tier bytes (network + JSON decode), not Postgres disk I/O — TOAST detoast still occurs per row.

Mirrors TestHandler_AdminTerminateEvent_StateConflictMapsFailedPrecondition for the ErrCurtailmentAdminTerminateActiveEvent path so the operator-visible "Stop first" semantics have a regression guard at the gRPC code-mapping layer (service-layer mapping was already covered).

…t guidance The AdminTerminateEvent RPC now surfaces two distinct FailedPrecondition variants that share a gRPC code (active-event-must-stop-first; state-conflict). Expand the proto service comment to enumerate both, matching StopCurtailment's existing pattern, and note reason's length bounds. The reconciler runbook's failure-mode #1 mitigation step previously prescribed direct AdminTerminateEvent on a panicking event, which now returns FailedPrecondition for ACTIVE events. Amend to call out the StopCurtailment-first requirement; PENDING and RESTORING events can still be terminated directly. Proto-only doc edit; no .pb.go regen needed.

toEventProtoListItem previously zeroed ExternalSource / ExternalReference / IdempotencyKey inline after toEventProto had set them. Move the scrub into a named helper so the intent (these fields are list-view-omitted) is explicit at the call site and a future fourth scrubbed field doesn't need a comment to explain the pattern. No behavior change.

A page_token issued before the cross-list-binding org_id field landed used to reject with InvalidArgument once the binding-check went live. A long-lived pagination loop crossing the deployment boundary should restart from the first page transparently instead of surfacing an opaque cursor error to the caller. Decode now treats OrgID==0 as the "first page" sentinel and returns a nil cursor (same as an empty token). State_filter mismatch and other non-zero violations still reject. Test renamed to reflect the new contract.

validateUpdateRequest now rejects requests where every patchable optional field (reason, restore_batch_size, restore_batch_interval_sec, max_duration_seconds) is nil. The SQL UPDATE still ran in that case and bumped updated_at via COALESCE, producing a misleading freshness signal for clients that track the column and adding write load with no semantic change. Affected service + handler tests that previously passed empty UpdateRequests to reach state-guard / not-found branches now carry a minimal valid patch (Reason). New TestService_Update_RejectsEmptyPatch pins the new contract.

…lment-read-apis-audit-metrics # Conflicts: # server/generated/sqlc/db.go

A concurrent AdminTerminate landing between target N and target N+1 of the dispatch loop would let remaining Curtail commands fire against an already-terminated event because the previous design checked liveness once per event-tick. Move the eventStillDispatchable call to the start of dispatchOneCurtail and drop the redundant pre-loop hoists in dispatchPending and observeActive. The cost is one GetEventByUUID per target — N reads per tick for an N-target event — which is acceptable; AdminTerminate is rare and the perf optimization was incorrect in claim (the event row CAN change inside a tick) and incorrect in consequence. dispatchRestoreBatch is unaffected: its single bulk Uncurtail is one command per batch, so the existing per-event check already guards it. Adds TestReconciler_SkipsRemainingCurtailDispatchesWhenEventTerminatesMidLoop asserting only the first target dispatches when the event flips between target 1 and target 2.

The previous gate rejected admin-terminate only when event state was ACTIVE. A PENDING event whose reconciler tick had already dispatched some Curtail commands had targets in DISPATCHED/CONFIRMED/DRIFTED, and admin-terminate would proceed — sweeping those targets to RESTORE_FAILED without issuing the compensating Uncurtail commands, leaving the already-curtailed miners stuck. Replace the state==active check with a SQL existence check on non-restored target states. The new check subsumes ACTIVE (which always has CONFIRMED targets via maybeMarkActive) and additionally catches PENDING events with dispatched targets. The ErrCurtailmentAdminTerminateActiveEvent sentinel keeps its name for backward compatibility with handler/test references, but its doc and operator-facing error message now describe the broader condition. Proto RPC comment and the reconciler runbook's mitigation guidance updated to match.

RBAC PR 1 landed migration 000052_create_permission_tables on main and the post-merge commit had two files claiming version 52 — the duplicate trips golang-migrate's version uniqueness check at server boot. The list-index migration only ever lived on this feature branch, so the migration-immutability rule (which gates edits to migrations on main) does not apply. Renaming to the next free slot (000055, after the RBAC track's 000052/000053/000054) is the right resolution. Content unchanged; only the file numbers move.

…NTLY on list index Two BE-5 merge gates landed together. force_include_maintenance is safety-critical: it commands curtailment on miners in active physical maintenance — forcibly power-cycling a miner a technician is servicing is a personnel hazard. The BE-1.x design intended Admin-only; the gate was never wired and any API-key caller could trip it. Wire requireAdminFromContext mirroring the allow_unbounded pattern; add table coverage in TestHandler_OverrideFieldsRoleGate. Migration 000055's index now builds with CONCURRENTLY. The earlier "annotation needed" framing turned out to be wrong — golang-migrate v4's postgres driver runs ExecContext directly without wrapping the migration body in a transaction, so CONCURRENTLY works without any no-transaction annotation. Cheap fix; a hard merge gate at high-row-count deploys.

…l / Uncurtail Closes the AdminTerminate residual race named in Known Limitation #10. Before this change, a tick that read PENDING targets, called cmd.Curtail, then lost a foot race to a concurrent AdminTerminate left miners curtailed with no compensating Uncurtail — the sweep flipped the just-dispatched-to targets to RESTORE_FAILED while the command landed against a dead event. Blast radius scaled with event size: thousands of stranded curtailments on a 5K-miner mid-dispatch event. The fix is a two-phase write. dispatchOneCurtail and dispatchRestoreBatch now stamp DISPATCHING on each target before issuing the command, and transition to DISPATCHED only after the command returns. The in-flight gate's SQL EXISTS predicate (CurtailmentEventHasInFlightTargets) counts dispatching rows alongside dispatched/confirmed/drifted. A concurrent terminate that races a mid-dispatch tick observes the dispatching row and rejects as Stop-first, so the command cannot fire against a swept event. last_dispatched_at intentionally lands on the DISPATCHED write, not the DISPATCHING pre-write — it records successful enqueue, used by the restore-batch interval gate. Filter-skipped / empty-batch failures roll back via recordDispatchFailure without leaving a misleading timestamp. New proto value CURTAILMENT_TARGET_STATE_DISPATCHING = 8. New Go constant TargetStateDispatching ("dispatching"). Regression coverage in TestReconciler_DispatchingPreWrite_CommitsBeforeCommand via a curtailHook on the fake dispatcher that inspects store state at the moment cmd.Curtail is called.

Start dispatch wrote each target with a separate InsertCurtailmentTarget round-trip inside the event transaction. For a 10K-miner curtailment that's 10K serial parameterized exec calls — easily the dominant cost of a Start RPC even with prepared-statement reuse, and a steady source of transaction holding time that conflicts with the reconciler tick. Switch to a single BulkInsertCurtailmentTargets call that fans the rows out server-side via jsonb_to_recordset. The Go side marshals the per-target inputs into one JSONB array; Postgres expands it into the typed columns (NUMERIC for baseline_power_w, JSONB for the selector rationale, missing keys → NULL). Returns rows affected so the store can pin (inserted == len(targets)) and fail the transaction loudly if jsonb_to_recordset silently drops a row that fails its column cast. Drops the unused single-row InsertCurtailmentTarget query and the rawMessageOrNullable helper that only it consumed. The JSON payload shape is pinned by a unit test so a wrong field tag or nil-handling regression breaks loud at unit-time rather than via a docker-compose integration failure.

Multiple corrections surfaced by the review pass against this PR. Runtime: - SweepCurtailmentTargetsToRestoreFailed wrote to a non-existent `updated_at` column on `curtailment_target`. Every AdminTerminate against a non-terminal event with at least one in-scope target tripped "column does not exist" and rolled back the entire termination, leaving the admin recovery escape hatch unusable. Drop the column from the UPDATE; the other target writes already don't reference it. - recordDispatchFailure now falls back to a new BumpCurtailmentTargetRetry query when the rich state-change UPDATE fails non-race-loss. Persistent state-write failures no longer block retry-budget progress; MaxRetries → RESTORE_FAILED escalation lands on the next successful UpdateTargetState. New restore-path test pins the fallback contract. - Restore the "heartbeat advances on tick freshness, not query health" semantic when ListNonTerminalEvents fails. The SQL staleness alert distinguishes reconciler-dead (no upsert) from DB-read-degraded (upsert advances, IncTickFailure rises). Test renamed to match. Contract: - UpdateCurtailmentEventRequest gains buf-validators mirroring Start (reason min/max_len, restore_batch_size lte:10000, restore_batch_interval_sec lte:3600, max_duration_seconds lte:604800). Previously a client could patch a non-terminal event with bounds Start would reject. - AdminTerminate proto docstring documents the RESTORING-event race: the in-flight gate scopes to desired_state=CURTAILED, so a terminate during restore proceeds even with in-flight Uncurtails. Race-window targets persist as RESTORE_FAILED while the device may be restored. - ListCurtailmentEvents proto docstring documents the trigger-metadata scrub on list rows; clients correlating webhook deliveries fetch via GetActiveCurtailment for unscrubbed attribution.

- Migrations 000056/000058 add operator-recovery notes for the CONCURRENTLY-split failure mode (paired CREATE in 000057/000059 fails after the DROP lands: schema_migrations stays dirty with the unique index gone; resolve duplicates, force the version, retry). - Replace dangling `Open #13` references with the recompute-vs-freeze reason inline so future readers don't chase a missing tracker link.

…of the repo

Copilot

Pull request overview

This PR completes the remaining v1 curtailment operator/admin RPC surface (list/update/admin-terminate) and adds supporting persistence, audit, metrics, and reliability hardening needed for production observability and safe concurrency.

Changes:

Added operator RPC implementations for listing event history and updating operator-safe fields, plus an admin terminate RPC (service/handler/store + tests).
Introduced cursor-based pagination for event history and persistence-layer idempotency lookups (idempotency key + external reference) with migrations to scope uniqueness to non-terminal events.
Added curtailment audit + metrics interfaces (no-op defaults) and wired them into fleetd, along with expanded unit and E2E coverage.

Reviewed changes

Copilot reviewed 44 out of 50 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
server/sqlc/queries/curtailment.sql	Adds queries for list pagination, idempotency lookups, admin terminate, bulk target insert, and race-loss guards.
server/migrations/000055_add_curtailment_event_list_index.up.sql	Adds `(org_id, id DESC)` index to support list pagination.
server/migrations/000055_add_curtailment_event_list_index.down.sql	Drops the list pagination index.
server/migrations/000056_scope_curtailment_idempotency_to_non_terminal.up.sql	Drops prior idempotency unique index (prep for scoped recreate).
server/migrations/000056_scope_curtailment_idempotency_to_non_terminal.down.sql	Recreates the unscoped idempotency unique index (rollback).
server/migrations/000057_scope_curtailment_idempotency_to_non_terminal.up.sql	Recreates idempotency unique index scoped to non-terminal events.
server/migrations/000057_scope_curtailment_idempotency_to_non_terminal.down.sql	Drops scoped idempotency unique index (rollback).
server/migrations/000058_scope_curtailment_external_ref_to_non_terminal.up.sql	Drops prior external-ref unique index (prep for scoped recreate).
server/migrations/000058_scope_curtailment_external_ref_to_non_terminal.down.sql	Recreates unscoped external-ref unique index (rollback).
server/migrations/000059_scope_curtailment_external_ref_to_non_terminal.up.sql	Recreates external-ref unique index scoped to non-terminal events.
server/migrations/000059_scope_curtailment_external_ref_to_non_terminal.down.sql	Drops scoped external-ref unique index (rollback).
server/internal/handlers/curtailment/translate.go	Adds request/response translation for list/update/admin-terminate and list-view scrubbing.
server/internal/handlers/curtailment/handler.go	Implements Update/List/AdminTerminate handlers and extends Start admin gating.
server/internal/handlers/curtailment/handler_update_test.go	Handler-level tests for UpdateCurtailmentEvent (auth, validation, role gates, response shape).
server/internal/handlers/curtailment/handler_test.go	Updates validation expectations and extends override-field role gating tests.
server/internal/handlers/curtailment/handler_stop_test.go	Updates stub store surface for new store interface methods.
server/internal/handlers/curtailment/handler_start_test.go	Extends Start handler tests for idempotent replay response rendering.
server/internal/handlers/curtailment/handler_list_test.go	Adds handler tests for ListCurtailmentEvents behavior and response scrubbing.
server/internal/handlers/curtailment/handler_admin_terminate_test.go	Adds handler tests for AdminTerminateEvent role gate and error mapping.
server/internal/domain/stores/sqlstores/curtailment_test.go	Pins JSON payload shape for bulk target insert into Postgres recordset.
server/internal/domain/stores/sqlstores/curtailment_cursor.go	Implements cursor codec for event-history pagination.
server/internal/domain/stores/sqlstores/curtailment_cursor_test.go	Tests cursor codec round-trip and invalid-token rejection behavior.
server/internal/domain/stores/interfaces/curtailment.go	Extends store interface: list/update/admin-terminate/idempotency/race-loss signaling.
server/internal/domain/curtailment/service_test.go	Expands fake store + introduces goroutine-safe metrics recorder for assertions.
server/internal/domain/curtailment/service_start_test.go	Adds Start guard + metrics emission test coverage (force maintenance, exclusion metrics).
server/internal/domain/curtailment/service_start_idempotency_test.go	Adds comprehensive idempotency replay + race-loser behavior tests.
server/internal/domain/curtailment/service_start_audit_test.go	Adds audit emission tests and audit-write-failure metric assertions.
server/internal/domain/curtailment/service_list_test.go	Adds service-level tests for ListEvents parameter forwarding and validation.
server/internal/domain/curtailment/service_lifecycle_test.go	Adds end-to-end lifecycle tests against the in-memory fake (Preview/Start/Stop/AdminTerminate/List).
server/internal/domain/curtailment/service_admin_terminate_test.go	Adds service-level validation + error mapping tests for AdminTerminate.
server/internal/domain/curtailment/selector.go	Extends Plan to support replay rendering from persisted event + targets.
server/internal/domain/curtailment/reconciler/restore_test.go	Expands restore-phase reconciler tests for retry-budget fallback and failure modes.
server/internal/domain/curtailment/models/models.go	Adds DISPATCHING target state and clarifies candidate documentation.
server/internal/domain/curtailment/metrics.go	Introduces Metrics interface with a default no-op implementation.
server/internal/domain/curtailment/audit.go	Introduces AuditLogger interface + activity type constants + no-op logger.
server/generated/sqlc/db.go	Auto-generated sqlc query preparation updates for new/renamed queries.
server/generated/grpc/curtailment/v1/curtailmentv1connect/curtailment.connect.go	Auto-generated Connect client/handler docs updated for new RPC semantics.
server/e2e/curtailment_e2e_test.go	Adds E2E coverage for full curtailment lifecycle and reconciler restart safety.
server/cmd/fleetd/main.go	Wires curtailment handler, audit logger, and metrics recorder into the server.
proto/curtailment/v1/curtailment.proto	Evolves proto contract: new DISPATCHING state, request bounds, list semantics, admin terminate docs.
client/src/protoFleet/features/energy/CurtailmentHistory.tsx	Minor refactor: extracted sort-time helper and Set-based status filtering.
.gitignore	Ignores `docs/reviews/` tooling artifacts.

github-actions · 2026-05-25T18:42:57Z

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

Reviewed pull request diff only (16121bbbfdbed395396ff06a1b1e46ee72d27764...ad191066f385fbe2abd7cde647ac9eac7e20458d, exact PR three-dot diff)

Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.

Review Summary

Overall Risk: HIGH

Findings

[HIGH] Terminal idempotency keys can replay as fresh curtailments

Category: Reliability
Location: server/migrations/000057_scope_curtailment_idempotency_to_non_terminal.up.sql:1
Description: The new partial unique index only covers pending, active, and restoring events. The replay lookups mirror that by ignoring terminal events. A duplicate webhook/API retry with the same idempotency_key or (external_source, external_reference) after the original event completes will no longer replay the original row; it can start a new curtailment.
Impact: Late retries or replayed external signals can repeatedly curtail miners after the operator believes the event is over, causing avoidable hashrate and revenue loss.
Recommendation: Keep dedupe keys valid beyond terminal state, at least for a retention window, and return the prior terminal event or a non-retryable conflict. If key reuse is required for a new curtailment, require a new explicit operation/version key instead of releasing the same idempotency key immediately at terminal state.

[MEDIUM] Updating `restore_batch_size` does not update the value the restorer uses

Category: Reliability
Location: server/internal/domain/curtailment/service.go:301
Description: UpdateCurtailmentEvent accepts restore_batch_size, but the restorer uses EffectiveBatchSize, which is stamped at Start and is not recomputed or patched by the update path. The SQL update changes restore_batch_size only, so the API can report a successful update that has no effect on restore batching.
Impact: Operators may believe they slowed down or sped up restore batching, while the reconciler continues using the old effective batch size. This can prolong restoration unexpectedly or preserve an unsafe inrush profile.
Recommendation: Either remove restore_batch_size from the update surface, or recompute and persist effective_batch_size when restore_batch_size changes while the event is still pending or active.

Notes

No pool URL, wallet, stratum, or worker credential changes were present in the reviewed diff.

_{Generated by Codex Security Review |

Triggered by: @rongxin-liu |

Review workflow run}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d6df608f07

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…etrics

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 65678d95d4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…issions Codex Security Review flagged these as accessible to any authenticated org member: the handlers only read session.GetInfo and never invoked middleware.RequirePermission. List/Update had no role gate at all; AdminTerminate only had a session-only admin-role gate. Each handler now calls RequirePermission against the appropriate permission key — curtailment:read for List, curtailment:manage for Update + AdminTerminate. AdminTerminate keeps requireAdminFromContext as defense-in-depth so legacy seeds that have not yet had their curtailment:manage assignment audited still fail closed. The test helpers wire EffectivePermissions onto the request context so the gate fires; new denial cases prove a caller without the permission is rejected before any store work.

The PR description promises that pre-org-binding page tokens transparently restart from the first page so an in-flight pagination loop spanning a deployment doesn't surface InvalidArgument. The decoder was rejecting any cursor with org_id <= 0, which contradicted that contract. Split the OrgID check into two cases: explicit negative is still tampering and surfaces InvalidArgument, but zero/missing org_id decodes to a nil cursor so ListEvents falls through to page 1. The cross-tenant guard at the call site is unaffected — a non-nil cursor still must match the caller's org_id.

…atch The previous claim loop accepted both PENDING and DISPATCHING into the same batch up to effective_batch_size. After a reconciler restart with DISPATCHING orphans (interrupted prior tick) coexisting with fresh PENDING, one batch could carry both — doubling the inrush and bypassing the one-batch-per-interval throttle the operator relies on. The claim phase now runs in two passes. If any DISPATCHING orphans exist, the wave redispatches only those (Uncurtail is device-idempotent, so re-sending is safe) and returns; fresh PENDING is held for the next tick. Throttling stays correct because orphan recovery consumes the tick's batch slot on its own.

When recordDispatchFailure's write fails non-race-loss, the BumpTargetRetry fallback advances retry_count without changing state. After enough bumps, a target sat in DISPATCHING (or Drifted) with retry_count past MaxRetries — and observeActive's `continue` arm silently skipped it every subsequent tick, so the row never reached a terminal state. Both arms now route exhausted targets through recordDispatchFailure itself. recordDispatchFailure already escalates to RESTORE_FAILED at the MaxRetries threshold via writeTargetState; if that write fails too, the existing fallback at least keeps the IncTargetWriteFailure counter firing so an operator alert surfaces. The Drifted arm carries the same bug as DISPATCHING (same fallback path, same observeActive skip). Applied symmetrically.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bb81b73824

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…ail path The Curtail path's DISPATCHING pre-write only logged and returned when writeTargetState failed non-race-loss; retry_count was never bumped. A row-specific persistent write failure (one row hitting a lock or a transient column error while reads succeed) would stall the event indefinitely because the target never advanced toward RESTORE_FAILED and maybeMarkActive could not promote the event. Route the failure through recordDispatchFailure with nonTerminalFailureState as the residual state — matches the restore path's symmetric handler at dispatchRestoreBatch. recordDispatchFailure's BumpTargetRetry fallback keeps the budget advancing even when the rich state write also fails.

Rebase onto main picked up #308 / #311, which added a defense-in-depth requireAdminFromContext call ahead of the curtailment:manage gate. PR 2c's design moves RBAC to the authoritative check, so drop the legacy role gate and reorder so RequirePermission runs before the nil-service check (auth-before-implementation; preserves Unauthenticated on missing session instead of leaking Unimplemented). The TestHandler_AdminTerminateEvent_RejectsNonAdmin case asserted the legacy role gate's defense-in-depth behavior — it's redundant with the renamed TestHandler_AdminTerminateEvent_RejectsCallerWithoutCurtailmentManage which already covers the RBAC denial path. Drop the obsolete test.

rongxin-liu added 30 commits May 21, 2026 13:21

Regenerate codes

6530e84

Regenerate codes

1b4d25b

fix(curtailment): resolve read API review findings

8c3287a

Harden curtailment replay, list, and reconciler behavior so the new read/admin APIs keep their documented contracts under retries and races.

Merge branch 'main' into feat/issue-289-curtailment-read-apis-audit-m…

b84a593

…etrics

Merge remote-tracking branch 'origin/main' into feat/issue-289-curtai…

332d25d

…lment-read-apis-audit-metrics # Conflicts: # server/generated/sqlc/db.go

Regenerate codes

cf1c912

rongxin-liu added 6 commits May 25, 2026 12:32

Trimmed comments

a695866

fix curtailment review findings

ae84543

chore: gitignore docs/reviews/ to keep internal review artifacts out …

d6df608

…of the repo

Copilot AI review requested due to automatic review settings May 25, 2026 18:37

rongxin-liu requested a review from a team as a code owner May 25, 2026 18:37

github-actions Bot assigned rongxin-liu May 25, 2026

github-actions Bot added javascript Pull requests that update javascript code client server shared labels May 25, 2026

Copilot started reviewing on behalf of rongxin-liu May 25, 2026 18:38 View session

Copilot AI reviewed May 25, 2026

View reviewed changes

Comment thread server/internal/domain/stores/sqlstores/curtailment_cursor.go

chatgpt-codex-connector Bot reviewed May 25, 2026

View reviewed changes

Comment thread server/internal/domain/curtailment/reconciler/reconciler.go Outdated

Merge branch 'main' into feat/issue-289-curtailment-read-apis-audit-m…

65678d9

…etrics

chatgpt-codex-connector Bot reviewed May 25, 2026

View reviewed changes

Comment thread server/internal/domain/curtailment/reconciler/reconciler.go

rongxin-liu added 4 commits May 26, 2026 02:11

chatgpt-codex-connector Bot reviewed May 26, 2026

View reviewed changes

Comment thread server/internal/domain/curtailment/reconciler/reconciler.go

mcharles-square approved these changes May 26, 2026

View reviewed changes

rongxin-liu merged commit aef2113 into main May 26, 2026
71 checks passed

rongxin-liu deleted the feat/issue-289-curtailment-read-apis-audit-metrics branch May 26, 2026 18:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(curtailment): operator read/update/admin APIs + audit + metrics#308

feat(curtailment): operator read/update/admin APIs + audit + metrics#308
rongxin-liu merged 67 commits into
mainfrom
feat/issue-289-curtailment-read-apis-audit-metrics

rongxin-liu commented May 25, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

github-actions Bot commented May 25, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rongxin-liu commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in this PR

1. Operator RPCs

2. Audit + metrics scaffolding

3. Reconciler hardening

4. Webhook idempotency persistence

5. Migrations

6. Proto contract evolution

7. Frontend touch

Out of scope (follow-ups)

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

github-actions Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔐 Codex Security Review

Review Summary

Findings

[HIGH] Terminal idempotency keys can replay as fresh curtailments

[MEDIUM] Updating restore_batch_size does not update the value the restorer uses

Notes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rongxin-liu commented May 25, 2026 •

edited

Loading

github-actions Bot commented May 25, 2026 •

edited

Loading

[MEDIUM] Updating `restore_batch_size` does not update the value the restorer uses