Skip to content

feat(curtailment): operator read/update/admin APIs + audit + metrics#308

Merged
rongxin-liu merged 67 commits into
mainfrom
feat/issue-289-curtailment-read-apis-audit-metrics
May 26, 2026
Merged

feat(curtailment): operator read/update/admin APIs + audit + metrics#308
rongxin-liu merged 67 commits into
mainfrom
feat/issue-289-curtailment-read-apis-audit-metrics

Conversation

@rongxin-liu

@rongxin-liu rongxin-liu commented May 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes the operator surface and observability scaffolding for v1 curtailment. Lifecycle, dispatch, and restore are already on main (#192 + #232); this PR fills in the missing read/update/admin RPCs, the audit + metrics hooks, and the reconciler race fixes uncovered along the way.

Closes #289.

What's in this PR

1. Operator RPCs

End-to-end (service + handler + audit + tests). All three are gated via middleware.RequirePermission against curtailment:read (List) or curtailment:manage (Update, AdminTerminate); AdminTerminate keeps the session-only admin-role check as defense-in-depth.

  • ListCurtailmentEvents — cursor-paginated history. SQL trims the per-device skipped list into a reason→count aggregate so responses stay bounded on large fleet events. Cursor binds (org_id, state_filter, id); legacy tokens without org_id restart from page 1 instead of crossing tenants.
  • UpdateCurtailmentEvent — patches reason, restore batch + interval, max duration. Empty-patch rejection (otherwise updated_at would tick from a no-op); admin gate on raising max_duration_seconds; typed FailedPrecondition on the pre-read / write state race.
  • AdminTerminateEvent — forces a non-terminal event to CANCELLED/FAILED and sweeps every non-terminal target to RESTORE_FAILED in one transaction. Stop-first gate triggers on any in-flight curtail target. Idempotent same-target echo returns the row and emits a typed curtailment_admin_terminated_replay row carrying the caller's actor + reason, so a concurrent race-loser's attribution still lands in the audit feed.

2. Audit + metrics scaffolding

Both interfaces default to no-op; the platform implementations swap in at cmd/fleetd/main.go without churn in the curtailment package.

  • AuditLogger — wired to activity.Service via LogStrict so audit-write failures surface to a counter instead of being silently swallowed. Emits curtailment_started (plus override-class rows for allow_unbounded and force_include_maintenance), curtailment_admin_terminated / ..._replay, and curtailment_updated.
  • Metrics — tick duration, tick failures, candidate-exclusion (labeled by reason), maintenance overrides, event-state race-loss, target-write failures, audit-write failures.

3. Reconciler hardening

Race + reliability gaps uncovered while wiring the admin surface:

  • DISPATCHING transient state — targets are stamped before cmd.Curtail / cmd.Uncurtail so a concurrent terminate's in-flight gate observes them during the command window. Interrupted dispatch leaves a DISPATCHING orphan; the next tick redispatches via the normal pending loop (commands are device-idempotent).
  • Orphan-priority restore claim — when DISPATCHING orphans exist, the restore claim phase redispatches only those and holds fresh PENDING for the next tick. Prevents a restart race from doubling the inrush and bypassing the one-batch-per-interval throttle.
  • Per-event liveness re-read before each dispatch, hoisted per-event so a 100-target event pays one DB lookup per tick, not 100.
  • Typed race-loss sentinels on event/target state writes — :execrows zero-rows-affected surfaces as ErrCurtailmentEventStateRaceLoss so the reconciler counter-bumps and skips instead of treating the race-loss as a successful transition.
  • Retry-budget fallback + escalation — both the Curtail and Restore pre-write paths route non-race-loss failures through recordDispatchFailure, which bumps retry_count and falls back to BumpTargetRetry when the rich state write itself can't land. Exhausted DISPATCHING and Drifted orphans escalate to RESTORE_FAILED rather than looping in the transient state.
  • Heartbeat semantics — advances on tick freshness even when the event-list query errors, so the SQL staleness alert distinguishes reconciler-dead from DB-read-degraded.

4. Webhook idempotency persistence

(org_id, idempotency_key) and (org_id, external_source, external_reference) lookups land at the persistence boundary so the eventual webhook-ingest RPC arrives without a contract change. Redelivery returns the original event (state + targets) rather than synthesizing PENDING; race-losers route through the replay branch instead of leaking Postgres constraint names.

5. Migrations

Five files (one statement each — required for CONCURRENTLY under golang-migrate):

  • 000055(org_id, id DESC) index for List.
  • 000056 / 000057 — re-scope the idempotency_key partial-unique index to non-terminal events so a completed event's key frees up for a new Start.
  • 000058 / 000059 — same scoping for (external_source, external_reference).

6. Proto contract evolution

  • AdminTerminateEventRequest.idempotency_key (tag 4) removed for v1; tag + name reserved so an old client's payload can't alias to a future field.
  • reason gains min_len=1, max_len=256; page_token gains max_len=1024; UpdateCurtailmentEventRequest adds bounds mirroring Start so the same value can't tunnel in via Update.
  • CurtailmentTargetState.DISPATCHING (= 8) added as a read-back value.
  • RPC docs enumerate FailedPrecondition variants for AdminTerminate (codes 1 / 2) and the trimmed response shape for List.

7. Frontend touch

CurtailmentHistory.tsx — small cleanup: extract getSortTime, switch the status filter to a Set lookup. No behavior change. UI wiring for the new RPCs is a follow-up.

Out of scope (follow-ups)

  • AdminTerminate.force for the case where Stop itself fails (DB outage, adapter unreachable) — pairs with a runbook decision about abandonment semantics.
  • Webhook ingest RPC itself — gated on naming a specific upstream system.
  • Scoped API keys — landing separately; override fields stay API-key-reachable until then.

Test plan

  • go build ./... and lint clean on the changed scope.
  • Service-layer tests cover List, Update, and AdminTerminate end-to-end: state guards, admin gating, empty-patch rejection, race-loss handling, audit emission vs. suppression on idempotent replays.
  • Idempotency-replay tests cover both Start channels (key + external ref), precedence, partial-fields handling, and race-loser routing through the unique-violation sentinel.
  • Reconciler tests cover the per-event liveness guard, DISPATCHING pre-stamp visibility, orphan recovery, orphan-priority restore claim, exhausted-orphan escalation to RESTORE_FAILED, Curtail-path pre-write retry-budget bump, retry-budget fallback, and heartbeat advancement on list-query failure.
  • Handler tests cover session resolution, role gates, RequirePermission denial for callers lacking curtailment:read / curtailment:manage, malformed UUID rejection, and FailedPrecondition code mapping for both AdminTerminate variants.
  • Cursor codec round-trip, legacy-token (no org_id) restart from page 1, plus malformed / state-filter-mismatch rejection.
  • Docker E2E (server/e2e/curtailment_e2e_test.go) exercises the full lifecycle against a real Postgres + reconciler tick loop, including reconciler-restart DISPATCHING orphan recovery.

Curtailment needs operational metrics — tick duration, tick failures,
selector candidate exclusions, maintenance overrides — but the codebase
has only OTel tracing today (no Meter, no /metrics, no Prometheus
exporter). The pipeline-shape decision is platform-team scope and
already in flight via the notifications + Grafana migration; curtailment
shouldn't make that decision unilaterally and shouldn't block on it.

Define a Metrics interface in the curtailment domain with a no-op
default. Service and Reconciler accept it through a functional option
so the dozens of existing test call sites (NewService(store), New(cfg,
store, cmd)) keep working unchanged. main.go wires NoOpMetrics through
both constructors so production has a single named site to swap when
the platform observability path lands — interface-stable, one-file
change, no curtailment-package churn.

Recorder call sites land in follow-up commits.

Refs #289
Three of the four Metrics recorders now have call sites:

- ObserveTickDuration fires from safeTick around runTick, capturing
  wall-clock per tick on every path (happy, panic-recovered,
  list-events-failure).
- IncTickFailure fires from safeTick on tick-infra panic AND from
  processEvent on per-event panic. The list-events early-return path
  is intentionally NOT counted because the heartbeat still advances
  there ("freshness, not query health" — see the comment in runTick).
- IncCandidateExcluded fires from Service.Start (not Preview) after the
  selector returns, once per skipped device labeled by reason. Start-
  only emission keeps debounced Preview calls from flooding the counter
  against a static fleet snapshot.

IncMaintenanceOverride is intentionally deferred. The per-miner
increment needs the selector to surface "this miner was kept because
the maintenance override was honored" — current candidate filtering
just lets the miner fall through without tagging. That instrumentation
lands in a follow-up commit alongside the audit-sweep work where
`curtailment_maintenance_override` activity rows are emitted on the
same code path.

Tests add a goroutine-safe recordingMetrics fake in both the
reconciler and service test files. Three reconciler tests pin
ObserveTickDuration on the happy path, IncTickFailure on tick-infra
panic, and IncTickFailure on per-event panic. One service test pins
IncCandidateExcluded on a phantom-load miner.

Refs #289
Operator-facing event history was previously Unimplemented and the
settings-page history table (PR #280) was reading fixtures. This wires
the RPC through every layer with a trimmed decision-snapshot policy
that keeps response sizes bounded on large fleets.

- sqlc: ListCurtailmentEventsForOrg, cursor-paginated by id DESC with
  an optional state filter. Caller passes limit+1 so the over-fetch
  detects whether another page remains.
- Store: opaque cursor (base64-encoded JSON) so the token shape is
  free to grow later (sort fields, secondary keys) without breaking
  older clients. PageSize <=0 maps to a 50-row default; an internal
  upper cap of 200 mirrors the proto validator as defense in depth.
- Service: ListEvents validates org and rejects negative page_size,
  then forwards to the store. Service-layer guard is needed because
  cross-tenant exposure is one query away.
- Handler: replaces the Unimplemented stub. Session-based org-id
  resolution, proto enum → service-layer state-filter mapping.
- Translate: list-view event proto omits per-target rows (heavy on
  10K-miner events × N pages) and trims the per-device `skipped`
  array to `skipped_aggregate` reason-count map. Top-K selected and
  the summary fields stay intact so dashboards can render exclusion
  trend lines.

Test fakes in three packages gain ListEvents stubs; the
curtailment-package fakeStore gains a working pagination impl mirroring
the SQL semantics so service-level tests can assert cursor round-trips.

Refs #289
Operator-safe partial update of a non-terminal event. Replaces the
Unimplemented stub on the handler.

State policy: pending and active accept the patch; restoring and
terminal states reject with FailedPrecondition. Operators who need to
intervene mid-restore go through AdminTerminateEvent — that's the
recovery surface, not Update. The conservative policy keeps the
recompute-vs-freeze question (Open #13) out of v1: Update of
restore_batch_size persists the new value but does NOT recompute
effective_batch_size. The reconciler's restore-claim reads the
Start-time stamped value through to the next event.

Validation mirrors Start: restore_batch_interval_sec is gated by the
non-admin cap (admin sets the session-derived bypass), max_duration
must be > 0 and <= 7 days, restore_batch_size >= 0. Misconfigured
values surface as InvalidArgument or Forbidden — never as a DB CHECK
violation.

sqlc UPDATE uses COALESCE on nil params so a partial patch preserves
unset columns. The WHERE clause re-asserts state IN ('pending',
'active') as defense in depth: a race where the row advanced between
the service's pre-read and the UPDATE surfaces as
ErrCurtailmentUpdateStateRaceLoss → FailedPrecondition with a distinct
message from the pre-read rejection.

Refs #289
Adds the admin-only escape hatch for forcing a non-terminal event to
CANCELLED or FAILED when a normal stop+restore cycle can't run.

The persistence layer wraps the terminal state transition and the
swept-target update in a single transaction via db.WithTransaction so
the event row and its targets stay in sync. An idempotent re-issue
with the same target_state is a no-op; a different terminal state
surfaces ErrCurtailmentAdminTerminateStateConflict, which the service
maps to FailedPrecondition.

Service-layer defense-in-depth checks (target_state in {CANCELLED,
FAILED}, non-empty reason, org/uuid present) mirror the proto
validator so non-Connect callers can't tunnel past it.
Adds a pre-insert lookup so a re-issued Start with the same
idempotency_key or (external_source, external_reference) pair returns
the original event instead of re-running the selector and tripping
the partial unique indexes (which would surface as a less helpful
AlreadyExists from the per-org non-terminal constraint).

idempotency_key takes precedence over external reference: the
operator-supplied retry handle wins over upstream re-delivery.
Lookup errors propagate unchanged so a transient db failure is
visible rather than silently falling through to a double-insert
attempt.
Adds an AuditLogger interface on the curtailment Service with a no-op
default so tests that don't care can ignore the wiring. main.go injects
*activity.Service via WithAuditLogger. Two override-specific event types
ride alongside the base curtailment_started row so a feed of unbounded
or force-include starts is a simple event-type filter rather than a
metadata scan.

IncMaintenanceOverride fires in parallel when force_include_maintenance
is set, so the platform metrics dashboard tracks the override rate
without joining against activity_log.

Audit emission is intentionally absent on the idempotent-replay and
insufficient-load paths: the original Start already recorded the trail,
and a path that never persisted shouldn't claim it did.
Documents the curtailment_reconciler_heartbeat-based liveness signal:
warn at 2 minutes of staleness with active events present, page at 5
minutes regardless. The SQL form is canonical; the vmalert rule is
parked behind a placeholder bridge metric so the wiring is one config
change away once a postgres-exporter publishes the staleness gauge.

Runbook walks four failure modes (panic loop, slow-query contention,
events not picked up, restore-loop forever) with operator response
steps that lean on AdminTerminateEvent for the cases where infra
mitigation isn't enough.
Walks the operator-facing service flow end-to-end against the
in-memory fake: Preview (no persistence side-effects) → Start
(persistence + audit + metrics) → Stop (RESTORING transition) →
AdminTerminate (forced terminal). The reconciler's tick-by-tick
state machine is covered piecewise in reconciler_test.go and
restore_test.go; this test pins the boundary between the public
service API and the persistence layer.

Companion tests cover the webhook idempotency-replay path
(duplicate Start short-circuits, no double-audit) and the
read-path query (ListEvents returns terminal rows filtered by
state).

A docker-driven HTTP-level e2e for the same lifecycle is a
follow-up — the existing server/e2e dir requires postgres +
proto-sim and lands when the curtailment plugin path is ready.
Four real lint findings from this branch, fixed without suppressions:

- Service.AdminTerminate: replace the two-case switch + default with
  an if-comparison so exhaustive doesn't demand the unhandled cases
  be enumerated. The default branch was load-bearing — the if form
  keeps the same behavior with less surface.

- service_list_test.go / handler_list_test.go: hoist the opaque
  cursor literal into a file-scope const. gosec G101 looks at string
  literals assigned to fields whose name matches credential keywords
  (PageToken matches "token"); an identifier reference clears the
  heuristic cleanly.

- service_start_idempotency_test.go: move the subtest store + svc
  creation inside the t.Run closure so each subtest can call
  t.Parallel() without sharing mutable counters across cases.
A multi-reviewer code review surfaced four merge-blocking P1s and a
batch of P2/P3 hygiene items on this branch. This commit lands the
focused, defensible subset that doesn't require contract or
security-policy decisions; the remaining items are recorded for the
next session.

Idempotency / race-recovery:
- Recognize uq_curtailment_event_idempotency and
  uq_curtailment_event_external_ref unique violations as race-loss
  via typed sentinels; Service.Start re-issues the corresponding
  replay lookup so the race-loser falls into the same response path
  as a deliberate retry rather than surfacing Internal with the
  constraint name leaked in the error string.
- AdminTerminateEvent: on zero-row UPDATE caused by a concurrent
  terminate-to-same-state, re-read inside the transaction and echo
  the row idempotently (mirrors BeginRestoreTransition's pattern).

Audit / observability:
- Emit a curtailment_admin_terminated activity row on AdminTerminate
  so the privileged force-terminate path captures actor + reason in
  the activity feed (parallels emitStartAuditTrail).
- emitStartAuditTrail now maps req.SourceActorType to
  activitymodels.ActorType so scheduler-triggered starts persist
  actor_type='scheduler' on activity_log instead of defaulting to
  'user'.

Update path hardening:
- Reject explicit empty-string Reason as InvalidArgument and add a
  256-char length cap mirroring Start. The proto-translate comment
  is updated to describe the actual silent-no-op behavior.

Proto contract docs:
- Field-level docstrings on ListCurtailmentEvents describing the
  omitted target_rollup/targets and the trimmed decision_snapshot
  shape (skipped_aggregate vs raw skipped).
- max_len=1024 on ListCurtailmentEventsRequest.page_token so the
  cursor decode path is bounded.
- Annotate the two eventStateFromProto call sites distinguishing
  the no-filter sentinel role from the target_state mapping role.

Cleanup:
- Drop duplicate finitePtr generic in handler_start_test.go (the
  existing ptr generic in handler_test.go covers the use case).
- Inline single-call-site valueOrZero generic in service.go.
Codex security review + Copilot inline reviewers surfaced five
actionable findings on this branch. All five validated; landing the
fixes here.

Admin gate on Update.max_duration_seconds (HIGH).
  Mirrors Start's post-normalization admin check inside Service.Update.
  Without this, a non-admin who Started at the org default could Update
  the same event above the default, bypassing the privilege boundary
  Start enforces. Fetches org config lazily — only when max_duration is
  in the patch and the caller lacks admin controls.

AdminTerminate.reason length cap (MEDIUM).
  Service-level backstop rejects oversized reasons (>256 chars) so a
  bulky operator string can't amplify into every swept target's
  last_error column. The proto field gets the matching max_len=256
  rule; proto regen is deferred to a clean tooling pass (the service
  backstop already catches the case today).

List query trims decision_snapshot at the SQL boundary (MEDIUM).
  ListCurtailmentEventsForOrg now projects explicit columns with
  (decision_snapshot_jsonb - 'skipped')::JSONB so the per-device skip
  list (multi-MB on 10K-miner events) doesn't ride the wire for every
  list row. Field layout matches CurtailmentEvent exactly so the
  existing convertEventRow path applies via a single struct
  conversion.

Cursor rejects non-positive IDs (MEDIUM).
  decodeCurtailmentEventCursor now returns InvalidArgument when the
  decoded id is <= 0. The store never emits a non-positive id; a
  user-supplied token that decodes to one would silently rewind to the
  first page (id=0) or return zero rows (id<0).

Audit metadata key naming (MEDIUM).
  Renamed `force_include` to `force_include_maintenance` on the
  curtailment_started audit row metadata so the key matches the
  domain/proto field name. Downstream analytics no longer have to map
  between abbreviated and full names.

Test coverage added for each fix: non-admin max_duration rejection,
admin pass-through, oversized reason rejection, cursor non-positive
id rejection (zero / negative / missing).
Harden curtailment replay, list, and reconciler behavior so the new read/admin APIs keep their documented contracts under retries and races.
Prevent stale reconciler ticks from dispatching commands after event state changes and require active admin-terminated events to stop first so restore work is queued. Reserve the removed admin terminate idempotency field to protect protobuf compatibility.
…tate race-loss

Two related reconciler-write-path issues from the third-iteration review:

eventStillDispatchable was called once per target inside dispatchOneCurtail
and dispatchRestoreBatch. For a 100-target event that's 100 redundant
GetEventByUUID lookups before any Curtail dispatch fires — the event row
doesn't change between iterations within a tick. Hoist the check to the
event level (after target list in dispatchPending, after enforceMaxDuration
in observeActive) and drop the per-target invocation.

UpdateCurtailmentEventState was sqlc :exec, which discards the rows-affected
count. When a concurrent AdminTerminate sweeps the event terminal, the
reconciler's maybeMarkActive / maybeCompleteRestoring UPDATE matched 0 rows,
the store returned nil, and the reconciler proceeded as if the transition
succeeded. Switch the query to :execrows; map 0-rows to a new typed
sentinel ErrCurtailmentEventStateRaceLoss; reconciler logs slog.Warn and
increments a new IncEventStateRaceLoss metric on the signal. Tick behavior
is unchanged beyond the new observability — making the reconciler abort
the tick on this signal is a separate design decision tracked in the
follow-up queue.

Also clarifies the comment on the list-query skipped-trim projection:
the win is application-tier bytes (network + JSON decode), not Postgres
disk I/O — TOAST detoast still occurs per row.
Mirrors TestHandler_AdminTerminateEvent_StateConflictMapsFailedPrecondition
for the ErrCurtailmentAdminTerminateActiveEvent path so the operator-visible
"Stop first" semantics have a regression guard at the gRPC code-mapping
layer (service-layer mapping was already covered).
…t guidance

The AdminTerminateEvent RPC now surfaces two distinct FailedPrecondition
variants that share a gRPC code (active-event-must-stop-first;
state-conflict). Expand the proto service comment to enumerate both,
matching StopCurtailment's existing pattern, and note reason's length
bounds.

The reconciler runbook's failure-mode #1 mitigation step previously
prescribed direct AdminTerminateEvent on a panicking event, which now
returns FailedPrecondition for ACTIVE events. Amend to call out the
StopCurtailment-first requirement; PENDING and RESTORING events can
still be terminated directly.

Proto-only doc edit; no .pb.go regen needed.
toEventProtoListItem previously zeroed ExternalSource / ExternalReference /
IdempotencyKey inline after toEventProto had set them. Move the scrub
into a named helper so the intent (these fields are list-view-omitted)
is explicit at the call site and a future fourth scrubbed field doesn't
need a comment to explain the pattern.

No behavior change.
A page_token issued before the cross-list-binding org_id field landed
used to reject with InvalidArgument once the binding-check went live.
A long-lived pagination loop crossing the deployment boundary should
restart from the first page transparently instead of surfacing an
opaque cursor error to the caller.

Decode now treats OrgID==0 as the "first page" sentinel and returns a
nil cursor (same as an empty token). State_filter mismatch and other
non-zero violations still reject. Test renamed to reflect the new
contract.
validateUpdateRequest now rejects requests where every patchable
optional field (reason, restore_batch_size, restore_batch_interval_sec,
max_duration_seconds) is nil. The SQL UPDATE still ran in that case
and bumped updated_at via COALESCE, producing a misleading freshness
signal for clients that track the column and adding write load with
no semantic change.

Affected service + handler tests that previously passed empty
UpdateRequests to reach state-guard / not-found branches now carry a
minimal valid patch (Reason). New TestService_Update_RejectsEmptyPatch
pins the new contract.
…lment-read-apis-audit-metrics

# Conflicts:
#	server/generated/sqlc/db.go
A concurrent AdminTerminate landing between target N and target N+1 of
the dispatch loop would let remaining Curtail commands fire against an
already-terminated event because the previous design checked liveness
once per event-tick.

Move the eventStillDispatchable call to the start of dispatchOneCurtail
and drop the redundant pre-loop hoists in dispatchPending and
observeActive. The cost is one GetEventByUUID per target — N reads per
tick for an N-target event — which is acceptable; AdminTerminate is rare
and the perf optimization was incorrect in claim (the event row CAN
change inside a tick) and incorrect in consequence.

dispatchRestoreBatch is unaffected: its single bulk Uncurtail is one
command per batch, so the existing per-event check already guards it.

Adds TestReconciler_SkipsRemainingCurtailDispatchesWhenEventTerminatesMidLoop
asserting only the first target dispatches when the event flips
between target 1 and target 2.
The previous gate rejected admin-terminate only when event state was
ACTIVE. A PENDING event whose reconciler tick had already dispatched
some Curtail commands had targets in DISPATCHED/CONFIRMED/DRIFTED, and
admin-terminate would proceed — sweeping those targets to RESTORE_FAILED
without issuing the compensating Uncurtail commands, leaving the
already-curtailed miners stuck.

Replace the state==active check with a SQL existence check on
non-restored target states. The new check subsumes ACTIVE (which always
has CONFIRMED targets via maybeMarkActive) and additionally catches
PENDING events with dispatched targets.

The ErrCurtailmentAdminTerminateActiveEvent sentinel keeps its name for
backward compatibility with handler/test references, but its doc and
operator-facing error message now describe the broader condition. Proto
RPC comment and the reconciler runbook's mitigation guidance updated to
match.
RBAC PR 1 landed migration 000052_create_permission_tables on main and
the post-merge commit had two files claiming version 52 — the duplicate
trips golang-migrate's version uniqueness check at server boot.

The list-index migration only ever lived on this feature branch, so
the migration-immutability rule (which gates edits to migrations on
main) does not apply. Renaming to the next free slot (000055, after
the RBAC track's 000052/000053/000054) is the right resolution.

Content unchanged; only the file numbers move.
…NTLY on list index

Two BE-5 merge gates landed together.

force_include_maintenance is safety-critical: it commands curtailment
on miners in active physical maintenance — forcibly power-cycling a
miner a technician is servicing is a personnel hazard. The BE-1.x
design intended Admin-only; the gate was never wired and any API-key
caller could trip it. Wire requireAdminFromContext mirroring the
allow_unbounded pattern; add table coverage in
TestHandler_OverrideFieldsRoleGate.

Migration 000055's index now builds with CONCURRENTLY. The earlier
"annotation needed" framing turned out to be wrong — golang-migrate
v4's postgres driver runs ExecContext directly without wrapping the
migration body in a transaction, so CONCURRENTLY works without any
no-transaction annotation. Cheap fix; a hard merge gate at
high-row-count deploys.
…l / Uncurtail

Closes the AdminTerminate residual race named in Known Limitation #10.
Before this change, a tick that read PENDING targets, called
cmd.Curtail, then lost a foot race to a concurrent AdminTerminate left
miners curtailed with no compensating Uncurtail — the sweep flipped
the just-dispatched-to targets to RESTORE_FAILED while the command
landed against a dead event. Blast radius scaled with event size:
thousands of stranded curtailments on a 5K-miner mid-dispatch event.

The fix is a two-phase write. dispatchOneCurtail and
dispatchRestoreBatch now stamp DISPATCHING on each target before
issuing the command, and transition to DISPATCHED only after the
command returns. The in-flight gate's SQL EXISTS predicate
(CurtailmentEventHasInFlightTargets) counts dispatching rows
alongside dispatched/confirmed/drifted. A concurrent terminate that
races a mid-dispatch tick observes the dispatching row and rejects as
Stop-first, so the command cannot fire against a swept event.

last_dispatched_at intentionally lands on the DISPATCHED write, not
the DISPATCHING pre-write — it records successful enqueue, used by
the restore-batch interval gate. Filter-skipped / empty-batch failures
roll back via recordDispatchFailure without leaving a misleading
timestamp.

New proto value CURTAILMENT_TARGET_STATE_DISPATCHING = 8. New Go
constant TargetStateDispatching ("dispatching"). Regression coverage
in TestReconciler_DispatchingPreWrite_CommitsBeforeCommand via a
curtailHook on the fake dispatcher that inspects store state at the
moment cmd.Curtail is called.
Start dispatch wrote each target with a separate InsertCurtailmentTarget
round-trip inside the event transaction. For a 10K-miner curtailment
that's 10K serial parameterized exec calls — easily the dominant cost of
a Start RPC even with prepared-statement reuse, and a steady source of
transaction holding time that conflicts with the reconciler tick.

Switch to a single BulkInsertCurtailmentTargets call that fans the rows
out server-side via jsonb_to_recordset. The Go side marshals the
per-target inputs into one JSONB array; Postgres expands it into the
typed columns (NUMERIC for baseline_power_w, JSONB for the selector
rationale, missing keys → NULL). Returns rows affected so the store can
pin (inserted == len(targets)) and fail the transaction loudly if
jsonb_to_recordset silently drops a row that fails its column cast.

Drops the unused single-row InsertCurtailmentTarget query and the
rawMessageOrNullable helper that only it consumed. The JSON payload
shape is pinned by a unit test so a wrong field tag or nil-handling
regression breaks loud at unit-time rather than via a docker-compose
integration failure.
Multiple corrections surfaced by the review pass against this PR.

Runtime:

- SweepCurtailmentTargetsToRestoreFailed wrote to a non-existent
  `updated_at` column on `curtailment_target`. Every AdminTerminate
  against a non-terminal event with at least one in-scope target tripped
  "column does not exist" and rolled back the entire termination,
  leaving the admin recovery escape hatch unusable. Drop the column
  from the UPDATE; the other target writes already don't reference it.

- recordDispatchFailure now falls back to a new
  BumpCurtailmentTargetRetry query when the rich state-change UPDATE
  fails non-race-loss. Persistent state-write failures no longer block
  retry-budget progress; MaxRetries → RESTORE_FAILED escalation lands
  on the next successful UpdateTargetState. New restore-path test pins
  the fallback contract.

- Restore the "heartbeat advances on tick freshness, not query health"
  semantic when ListNonTerminalEvents fails. The SQL staleness alert
  distinguishes reconciler-dead (no upsert) from DB-read-degraded
  (upsert advances, IncTickFailure rises). Test renamed to match.

Contract:

- UpdateCurtailmentEventRequest gains buf-validators mirroring Start
  (reason min/max_len, restore_batch_size lte:10000,
  restore_batch_interval_sec lte:3600, max_duration_seconds lte:604800).
  Previously a client could patch a non-terminal event with bounds
  Start would reject.

- AdminTerminate proto docstring documents the RESTORING-event race:
  the in-flight gate scopes to desired_state=CURTAILED, so a terminate
  during restore proceeds even with in-flight Uncurtails. Race-window
  targets persist as RESTORE_FAILED while the device may be restored.

- ListCurtailmentEvents proto docstring documents the trigger-metadata
  scrub on list rows; clients correlating webhook deliveries fetch via
  GetActiveCurtailment for unscrubbed attribution.
- Migrations 000056/000058 add operator-recovery notes for the
  CONCURRENTLY-split failure mode (paired CREATE in 000057/000059
  fails after the DROP lands: schema_migrations stays dirty with the
  unique index gone; resolve duplicates, force the version, retry).

- Replace dangling `Open #13` references with the recompute-vs-freeze
  reason inline so future readers don't chase a missing tracker link.
Copilot AI review requested due to automatic review settings May 25, 2026 18:37
@rongxin-liu rongxin-liu requested a review from a team as a code owner May 25, 2026 18:37
@github-actions github-actions Bot added javascript Pull requests that update javascript code client server shared labels May 25, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR completes the remaining v1 curtailment operator/admin RPC surface (list/update/admin-terminate) and adds supporting persistence, audit, metrics, and reliability hardening needed for production observability and safe concurrency.

Changes:

  • Added operator RPC implementations for listing event history and updating operator-safe fields, plus an admin terminate RPC (service/handler/store + tests).
  • Introduced cursor-based pagination for event history and persistence-layer idempotency lookups (idempotency key + external reference) with migrations to scope uniqueness to non-terminal events.
  • Added curtailment audit + metrics interfaces (no-op defaults) and wired them into fleetd, along with expanded unit and E2E coverage.

Reviewed changes

Copilot reviewed 44 out of 50 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
server/sqlc/queries/curtailment.sql Adds queries for list pagination, idempotency lookups, admin terminate, bulk target insert, and race-loss guards.
server/migrations/000055_add_curtailment_event_list_index.up.sql Adds (org_id, id DESC) index to support list pagination.
server/migrations/000055_add_curtailment_event_list_index.down.sql Drops the list pagination index.
server/migrations/000056_scope_curtailment_idempotency_to_non_terminal.up.sql Drops prior idempotency unique index (prep for scoped recreate).
server/migrations/000056_scope_curtailment_idempotency_to_non_terminal.down.sql Recreates the unscoped idempotency unique index (rollback).
server/migrations/000057_scope_curtailment_idempotency_to_non_terminal.up.sql Recreates idempotency unique index scoped to non-terminal events.
server/migrations/000057_scope_curtailment_idempotency_to_non_terminal.down.sql Drops scoped idempotency unique index (rollback).
server/migrations/000058_scope_curtailment_external_ref_to_non_terminal.up.sql Drops prior external-ref unique index (prep for scoped recreate).
server/migrations/000058_scope_curtailment_external_ref_to_non_terminal.down.sql Recreates unscoped external-ref unique index (rollback).
server/migrations/000059_scope_curtailment_external_ref_to_non_terminal.up.sql Recreates external-ref unique index scoped to non-terminal events.
server/migrations/000059_scope_curtailment_external_ref_to_non_terminal.down.sql Drops scoped external-ref unique index (rollback).
server/internal/handlers/curtailment/translate.go Adds request/response translation for list/update/admin-terminate and list-view scrubbing.
server/internal/handlers/curtailment/handler.go Implements Update/List/AdminTerminate handlers and extends Start admin gating.
server/internal/handlers/curtailment/handler_update_test.go Handler-level tests for UpdateCurtailmentEvent (auth, validation, role gates, response shape).
server/internal/handlers/curtailment/handler_test.go Updates validation expectations and extends override-field role gating tests.
server/internal/handlers/curtailment/handler_stop_test.go Updates stub store surface for new store interface methods.
server/internal/handlers/curtailment/handler_start_test.go Extends Start handler tests for idempotent replay response rendering.
server/internal/handlers/curtailment/handler_list_test.go Adds handler tests for ListCurtailmentEvents behavior and response scrubbing.
server/internal/handlers/curtailment/handler_admin_terminate_test.go Adds handler tests for AdminTerminateEvent role gate and error mapping.
server/internal/domain/stores/sqlstores/curtailment_test.go Pins JSON payload shape for bulk target insert into Postgres recordset.
server/internal/domain/stores/sqlstores/curtailment_cursor.go Implements cursor codec for event-history pagination.
server/internal/domain/stores/sqlstores/curtailment_cursor_test.go Tests cursor codec round-trip and invalid-token rejection behavior.
server/internal/domain/stores/interfaces/curtailment.go Extends store interface: list/update/admin-terminate/idempotency/race-loss signaling.
server/internal/domain/curtailment/service_test.go Expands fake store + introduces goroutine-safe metrics recorder for assertions.
server/internal/domain/curtailment/service_start_test.go Adds Start guard + metrics emission test coverage (force maintenance, exclusion metrics).
server/internal/domain/curtailment/service_start_idempotency_test.go Adds comprehensive idempotency replay + race-loser behavior tests.
server/internal/domain/curtailment/service_start_audit_test.go Adds audit emission tests and audit-write-failure metric assertions.
server/internal/domain/curtailment/service_list_test.go Adds service-level tests for ListEvents parameter forwarding and validation.
server/internal/domain/curtailment/service_lifecycle_test.go Adds end-to-end lifecycle tests against the in-memory fake (Preview/Start/Stop/AdminTerminate/List).
server/internal/domain/curtailment/service_admin_terminate_test.go Adds service-level validation + error mapping tests for AdminTerminate.
server/internal/domain/curtailment/selector.go Extends Plan to support replay rendering from persisted event + targets.
server/internal/domain/curtailment/reconciler/restore_test.go Expands restore-phase reconciler tests for retry-budget fallback and failure modes.
server/internal/domain/curtailment/models/models.go Adds DISPATCHING target state and clarifies candidate documentation.
server/internal/domain/curtailment/metrics.go Introduces Metrics interface with a default no-op implementation.
server/internal/domain/curtailment/audit.go Introduces AuditLogger interface + activity type constants + no-op logger.
server/generated/sqlc/db.go Auto-generated sqlc query preparation updates for new/renamed queries.
server/generated/grpc/curtailment/v1/curtailmentv1connect/curtailment.connect.go Auto-generated Connect client/handler docs updated for new RPC semantics.
server/e2e/curtailment_e2e_test.go Adds E2E coverage for full curtailment lifecycle and reconciler restart safety.
server/cmd/fleetd/main.go Wires curtailment handler, audit logger, and metrics recorder into the server.
proto/curtailment/v1/curtailment.proto Evolves proto contract: new DISPATCHING state, request bounds, list semantics, admin terminate docs.
client/src/protoFleet/features/energy/CurtailmentHistory.tsx Minor refactor: extracted sort-time helper and Set-based status filtering.
.gitignore Ignores docs/reviews/ tooling artifacts.

Comment thread server/internal/domain/stores/sqlstores/curtailment_cursor.go
@github-actions

github-actions Bot commented May 25, 2026

Copy link
Copy Markdown

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (16121bbbfdbed395396ff06a1b1e46ee72d27764...ad191066f385fbe2abd7cde647ac9eac7e20458d, exact PR three-dot diff)
  • Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: HIGH

Findings

[HIGH] Terminal idempotency keys can replay as fresh curtailments

  • Category: Reliability
  • Location: server/migrations/000057_scope_curtailment_idempotency_to_non_terminal.up.sql:1
  • Description: The new partial unique index only covers pending, active, and restoring events. The replay lookups mirror that by ignoring terminal events. A duplicate webhook/API retry with the same idempotency_key or (external_source, external_reference) after the original event completes will no longer replay the original row; it can start a new curtailment.
  • Impact: Late retries or replayed external signals can repeatedly curtail miners after the operator believes the event is over, causing avoidable hashrate and revenue loss.
  • Recommendation: Keep dedupe keys valid beyond terminal state, at least for a retention window, and return the prior terminal event or a non-retryable conflict. If key reuse is required for a new curtailment, require a new explicit operation/version key instead of releasing the same idempotency key immediately at terminal state.

[MEDIUM] Updating restore_batch_size does not update the value the restorer uses

  • Category: Reliability
  • Location: server/internal/domain/curtailment/service.go:301
  • Description: UpdateCurtailmentEvent accepts restore_batch_size, but the restorer uses EffectiveBatchSize, which is stamped at Start and is not recomputed or patched by the update path. The SQL update changes restore_batch_size only, so the API can report a successful update that has no effect on restore batching.
  • Impact: Operators may believe they slowed down or sped up restore batching, while the reconciler continues using the old effective batch size. This can prolong restoration unexpectedly or preserve an unsafe inrush profile.
  • Recommendation: Either remove restore_batch_size from the update surface, or recompute and persist effective_batch_size when restore_batch_size changes while the event is still pending or active.

Notes

No pool URL, wallet, stratum, or worker credential changes were present in the reviewed diff.


Generated by Codex Security Review |
Triggered by: @rongxin-liu |
Review workflow run

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d6df608f07

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/internal/domain/curtailment/reconciler/reconciler.go Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 65678d95d4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/internal/domain/curtailment/reconciler/reconciler.go
…issions

Codex Security Review flagged these as accessible to any authenticated
org member: the handlers only read session.GetInfo and never invoked
middleware.RequirePermission. List/Update had no role gate at all;
AdminTerminate only had a session-only admin-role gate.

Each handler now calls RequirePermission against the appropriate
permission key — curtailment:read for List, curtailment:manage for
Update + AdminTerminate. AdminTerminate keeps requireAdminFromContext
as defense-in-depth so legacy seeds that have not yet had their
curtailment:manage assignment audited still fail closed.

The test helpers wire EffectivePermissions onto the request context so
the gate fires; new denial cases prove a caller without the permission
is rejected before any store work.
The PR description promises that pre-org-binding page tokens
transparently restart from the first page so an in-flight pagination
loop spanning a deployment doesn't surface InvalidArgument. The decoder
was rejecting any cursor with org_id <= 0, which contradicted that
contract.

Split the OrgID check into two cases: explicit negative is still
tampering and surfaces InvalidArgument, but zero/missing org_id decodes
to a nil cursor so ListEvents falls through to page 1. The cross-tenant
guard at the call site is unaffected — a non-nil cursor still must
match the caller's org_id.
…atch

The previous claim loop accepted both PENDING and DISPATCHING into the
same batch up to effective_batch_size. After a reconciler restart with
DISPATCHING orphans (interrupted prior tick) coexisting with fresh
PENDING, one batch could carry both — doubling the inrush and bypassing
the one-batch-per-interval throttle the operator relies on.

The claim phase now runs in two passes. If any DISPATCHING orphans
exist, the wave redispatches only those (Uncurtail is device-idempotent,
so re-sending is safe) and returns; fresh PENDING is held for the next
tick. Throttling stays correct because orphan recovery consumes the
tick's batch slot on its own.
When recordDispatchFailure's write fails non-race-loss, the BumpTargetRetry
fallback advances retry_count without changing state. After enough
bumps, a target sat in DISPATCHING (or Drifted) with retry_count past
MaxRetries — and observeActive's `continue` arm silently skipped it
every subsequent tick, so the row never reached a terminal state.

Both arms now route exhausted targets through recordDispatchFailure
itself. recordDispatchFailure already escalates to RESTORE_FAILED at
the MaxRetries threshold via writeTargetState; if that write fails too,
the existing fallback at least keeps the IncTargetWriteFailure counter
firing so an operator alert surfaces.

The Drifted arm carries the same bug as DISPATCHING (same fallback
path, same observeActive skip). Applied symmetrically.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bb81b73824

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/internal/domain/curtailment/reconciler/reconciler.go
…ail path

The Curtail path's DISPATCHING pre-write only logged and returned when
writeTargetState failed non-race-loss; retry_count was never bumped.
A row-specific persistent write failure (one row hitting a lock or a
transient column error while reads succeed) would stall the event
indefinitely because the target never advanced toward RESTORE_FAILED
and maybeMarkActive could not promote the event.

Route the failure through recordDispatchFailure with nonTerminalFailureState
as the residual state — matches the restore path's symmetric handler at
dispatchRestoreBatch. recordDispatchFailure's BumpTargetRetry fallback
keeps the budget advancing even when the rich state write also fails.
@rongxin-liu rongxin-liu merged commit aef2113 into main May 26, 2026
71 checks passed
@rongxin-liu rongxin-liu deleted the feat/issue-289-curtailment-read-apis-audit-metrics branch May 26, 2026 18:42
mcharles-square added a commit that referenced this pull request May 26, 2026
Rebase onto main picked up #308 / #311, which added a defense-in-depth
requireAdminFromContext call ahead of the curtailment:manage gate.
PR 2c's design moves RBAC to the authoritative check, so drop the
legacy role gate and reorder so RequirePermission runs before the
nil-service check (auth-before-implementation; preserves Unauthenticated
on missing session instead of leaking Unimplemented).

The TestHandler_AdminTerminateEvent_RejectsNonAdmin case asserted the
legacy role gate's defense-in-depth behavior — it's redundant with the
renamed TestHandler_AdminTerminateEvent_RejectsCallerWithoutCurtailmentManage
which already covers the RBAC denial path. Drop the obsolete test.
mcharles-square added a commit that referenced this pull request May 27, 2026
Rebase onto main picked up #308 / #311, which added a defense-in-depth
requireAdminFromContext call ahead of the curtailment:manage gate.
PR 2c's design moves RBAC to the authoritative check, so drop the
legacy role gate and reorder so RequirePermission runs before the
nil-service check (auth-before-implementation; preserves Unauthenticated
on missing session instead of leaking Unimplemented).

The TestHandler_AdminTerminateEvent_RejectsNonAdmin case asserted the
legacy role gate's defense-in-depth behavior — it's redundant with the
renamed TestHandler_AdminTerminateEvent_RejectsCallerWithoutCurtailmentManage
which already covers the RBAC denial path. Drop the obsolete test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

client javascript Pull requests that update javascript code server shared

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(curtailment): operator read APIs + admin terminate + audit + metrics interface + E2E

3 participants