Skip to content

feat(curtailment): operator read/update/admin APIs + audit + metrics#299

Closed
rongxin-liu wants to merge 61 commits into
mainfrom
feat/issue-289-curtailment-read-apis-audit-metrics
Closed

feat(curtailment): operator read/update/admin APIs + audit + metrics#299
rongxin-liu wants to merge 61 commits into
mainfrom
feat/issue-289-curtailment-read-apis-audit-metrics

Conversation

@rongxin-liu

@rongxin-liu rongxin-liu commented May 21, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes the operator-facing surface and observability scaffolding for v1 curtailment. Builds on the lifecycle and dispatch work already on main (preview + start + dispatch + reconciler in #192, stop + staggered restore + max-duration enforcement in #232).

Operator read / update / admin

  • ListCurtailmentEvents — cursor-paginated history. The decision snapshot is trimmed at the SQL boundary so the response stays bounded on large fleet events: the SQL projection strips the per-device skipped array and computes the skipped_aggregate reason→count map inline, and per-target rows are intentionally omitted (consumers paginate over events here and fetch per-event detail separately). The cursor token carries org_id and state_filter alongside the row id so a cursor cannot cross tenants or state filters; legacy tokens without org_id transparently restart from the first page so a pagination loop crossing the deployment boundary does not surface a confusing error. Migration 000055 adds the supporting (org_id, id DESC) index using CREATE INDEX CONCURRENTLY (with the migrate-tool no-transaction annotation) so the build does not lock the table on high-row-count deploys.
  • UpdateCurtailmentEvent — operator-safe fields only: reason, restore_batch_size, restore_batch_interval_sec, max_duration_seconds. The service rejects empty patches (a request with no patchable field set would still bump updated_at via COALESCE, producing a misleading freshness signal). The same admin gate as Start applies to max_duration_seconds: non-admin callers cannot raise the cap above the org default. Reason carries the same length bound as Start (256 chars). Race between the pre-read and the UPDATE surfaces as a typed FailedPrecondition rather than silently no-op'ing.
  • AdminTerminateEvent body. Forces a non-terminal event to CANCELLED or FAILED and sweeps every non-terminal target to RESTORE_FAILED in the same transaction. The validator restricts target_state to those two; COMPLETED is rejected because the RPC fires when restore did not actually run. The Stop-first gate now triggers on any in-flight target — DISPATCHING, DISPATCHED, CONFIRMED, or DRIFTED — not only on ACTIVE events, so a pending event whose reconciler tick already issued curtail commands cannot be sliced out from under those commands without compensating Uncurtails. Idempotent re-issue against the same target state echoes the row without re-running the transition or sweep, and suppresses audit emission so audit consumers tracking operator action history do not see a phantom action; a different terminal state surfaces FailedPrecondition with a distinct message. The reason field carries a 256-char cap so a bulky operator string cannot amplify across thousands of target rows in the sweep.

Webhook ingestion idempotency

Pre-insert lookup at the persistence boundary on (org_id, idempotency_key) first, then (org_id, external_source, external_reference). A redelivery returns the original event without re-running selection — including its persisted target list and state — so retry callers do not see a synthesized PENDING response for a terminated event. The race-loser path (two concurrent first-time Starts past the lookup) falls into the same replay branch as a deliberate retry rather than surfacing Internal with the Postgres constraint name leaked in the error string.

Audit trail

Every successful Start emits a curtailment_started activity row. When allow_unbounded or force_include_maintenance is set, a typed row (curtailment_unbounded_start / curtailment_force_include_maintenance) emits alongside the base — two rows rather than one with a flag, so a feed of override-class starts is a simple event-type filter rather than a metadata scan. The audit metadata key matches the proto field name (force_include_maintenance, not the abbreviated force_include). IncMaintenanceOverride fires in parallel so the override rate surfaces on the platform metrics dashboard without joining against activity_log. AdminTerminateEvent emits its own activity row capturing actor + reason, but suppresses emission on idempotent replays — a duplicate curtailment_admin_terminated row for a no-op echo would mislead consumers. The audit ActorType reflects source_actor_type (scheduler / user / api_key) rather than defaulting to user.

Reconciler state-guard

Every reconciler dispatch (Curtail on pending targets, Uncurtail on restoring batches) re-reads the event immediately before the command issues so a tick that read its event list before a concurrent AdminTerminateEvent does not dispatch commands against a now-terminated event. The check is hoisted to the per-event level so a 100-target event pays one DB read per tick, not 100. Targets are stamped DISPATCHING before cmd.Curtail / cmd.Uncurtail so the row is visible to a concurrent terminate's in-flight gate during the command window; a tick interrupted between the pre-write and the post-command transition leaves a DISPATCHING orphan, and the next tick redispatches it via the normal pending-target loop (Curtail / Uncurtail are device-idempotent). UpdateCurtailmentEventState and UpdateCurtailmentTargetState are both :execrows and surface zero-rows-affected as typed sentinels (ErrCurtailmentEventStateRaceLoss / ErrCurtailmentUpdateTargetStateRaceLoss); the reconciler logs the signal and increments a dedicated counter rather than silently treating the race-loss as a successful transition.

Metrics interface

A reconciler.Metrics interface inside the curtailment domain with tick-duration, tick-failure, candidate-exclusion (labeled by reason), maintenance-override, and event-state-race-loss recorders. The default is a no-op; the concrete implementation wires at cmd/fleetd/main.go once the platform observability path lands. Interface shape is stable enough that the swap is a one-file change with no curtailment-package churn.

Heartbeat staleness runbook

The 5-minute staleness signal is canonically a SQL check against the curtailment_reconciler_heartbeat row, not an application metric — the runbook documents the SQL form and walks four failure modes (panic loop, slow-query contention, events not picked up, restore loop). Operator response steps lean on AdminTerminateEvent for the cases where infrastructure mitigation isn't enough; the runbook calls out the Stop-first requirement so an operator does not hit FailedPrecondition trying to terminate an active event directly.

Proto contract evolution

  • AdminTerminateEventRequest.idempotency_key (field 4) is removed for v1; tag and field name are reserved so the slot cannot accidentally be reused. AdminTerminate idempotency is state-based — a re-issue against the same target state echoes the row.
  • AdminTerminateEventRequest.reason now carries min_len = 1, max_len = 256; the service mirrors the cap as defense in depth.
  • ListCurtailmentEventsRequest.page_token carries max_len = 1024 so the base64+JSON decode path is bounded.
  • CurtailmentTargetState adds DISPATCHING (enum value 8) to model the transient stamp between the pre-command write and the post-command state — the value is part of the read-back contract for any consumer paging over targets.
  • AdminTerminateEvent and ListCurtailmentEvents RPC doc comments enumerate the FailedPrecondition variants and the trimmed response-shape contract respectively, matching the convention StopCurtailment already established.

Follow-up

A few items are intentionally outside this PR's scope, captured for the next iteration:

  • AdminTerminate lacks a force flag for cases where Stop also fails (DB outage, target adapter unreachable). Adding the escape hatch is a contract decision that pairs with a runbook entry describing the abandonment semantics.
  • The per-event liveness check covers dispatch paths; the confirmation and drift-detection paths bail on the typed target-state race-loss sentinel but the event-state promotion path still relies on the SQL-level EXISTS guard alone. A unified write-path bail on the typed sentinel is a follow-up.
  • A scoped API key (least-privilege replacement for the current admin-API-key blanket allow) lands separately; allow_unbounded and the operator override fields are intentionally API-key-reachable until that surface exists.

Test plan

Service-layer unit tests cover ListCurtailmentEvents, UpdateCurtailmentEvent, and AdminTerminateEvent end to end — happy path, state-machine guards, admin gating, empty-patch rejection, race-loss handling, the broadened in-flight-targets requirement, and the audit-suppression / audit-emission split across idempotent-replay and real-transition arms.

Idempotency-replay tests cover both Start channels (key, external-source/reference) including precedence ordering, partial-fields handling, lookup error propagation, persisted-payload return on replay, and the unique-violation race-loser path through the constraint-name sentinel routing.

Audit-emission tests pin the base row + override-specific rows under expected conditions, that the source actor type maps correctly, and that AdminTerminate suppresses emission on idempotent echoes. The lifecycle test pins Preview → Start → Stop → AdminTerminate persistence + emission.

Reconciler tests cover the per-event state-guard skip path on Curtail and restore dispatch, the DISPATCHING pre-command stamp visible during the command window, orphan recovery on the next tick for interrupted curtail and restore dispatches, and the typed race-loss signals on event-state and target-state updates.

Handler-level tests for each new RPC cover session resolution, role gates, malformed UUID rejection, proto/service translation, and the FailedPrecondition error-code mapping for both AdminTerminate variants. Cursor codec tests cover round-trip, malformed-base64 rejection, missing-org-id legacy restart, state-filter mismatch rejection, and non-positive id rejection.

A docker-driven HTTP-level E2E in server/e2e/ exercises the lifecycle path against a real Postgres + reconciler tick loop, including the reconciler-restart recovery path that re-picks up a DISPATCHING orphan after a process restart.

go build ./... clean; curtailment domain + handler + cursor test suites green; lint clean on the changed scope (pre-existing repo-wide lint debt unrelated to this branch).

Closes #289

Curtailment needs operational metrics — tick duration, tick failures,
selector candidate exclusions, maintenance overrides — but the codebase
has only OTel tracing today (no Meter, no /metrics, no Prometheus
exporter). The pipeline-shape decision is platform-team scope and
already in flight via the notifications + Grafana migration; curtailment
shouldn't make that decision unilaterally and shouldn't block on it.

Define a Metrics interface in the curtailment domain with a no-op
default. Service and Reconciler accept it through a functional option
so the dozens of existing test call sites (NewService(store), New(cfg,
store, cmd)) keep working unchanged. main.go wires NoOpMetrics through
both constructors so production has a single named site to swap when
the platform observability path lands — interface-stable, one-file
change, no curtailment-package churn.

Recorder call sites land in follow-up commits.

Refs #289
Three of the four Metrics recorders now have call sites:

- ObserveTickDuration fires from safeTick around runTick, capturing
  wall-clock per tick on every path (happy, panic-recovered,
  list-events-failure).
- IncTickFailure fires from safeTick on tick-infra panic AND from
  processEvent on per-event panic. The list-events early-return path
  is intentionally NOT counted because the heartbeat still advances
  there ("freshness, not query health" — see the comment in runTick).
- IncCandidateExcluded fires from Service.Start (not Preview) after the
  selector returns, once per skipped device labeled by reason. Start-
  only emission keeps debounced Preview calls from flooding the counter
  against a static fleet snapshot.

IncMaintenanceOverride is intentionally deferred. The per-miner
increment needs the selector to surface "this miner was kept because
the maintenance override was honored" — current candidate filtering
just lets the miner fall through without tagging. That instrumentation
lands in a follow-up commit alongside the audit-sweep work where
`curtailment_maintenance_override` activity rows are emitted on the
same code path.

Tests add a goroutine-safe recordingMetrics fake in both the
reconciler and service test files. Three reconciler tests pin
ObserveTickDuration on the happy path, IncTickFailure on tick-infra
panic, and IncTickFailure on per-event panic. One service test pins
IncCandidateExcluded on a phantom-load miner.

Refs #289
Operator-facing event history was previously Unimplemented and the
settings-page history table (PR #280) was reading fixtures. This wires
the RPC through every layer with a trimmed decision-snapshot policy
that keeps response sizes bounded on large fleets.

- sqlc: ListCurtailmentEventsForOrg, cursor-paginated by id DESC with
  an optional state filter. Caller passes limit+1 so the over-fetch
  detects whether another page remains.
- Store: opaque cursor (base64-encoded JSON) so the token shape is
  free to grow later (sort fields, secondary keys) without breaking
  older clients. PageSize <=0 maps to a 50-row default; an internal
  upper cap of 200 mirrors the proto validator as defense in depth.
- Service: ListEvents validates org and rejects negative page_size,
  then forwards to the store. Service-layer guard is needed because
  cross-tenant exposure is one query away.
- Handler: replaces the Unimplemented stub. Session-based org-id
  resolution, proto enum → service-layer state-filter mapping.
- Translate: list-view event proto omits per-target rows (heavy on
  10K-miner events × N pages) and trims the per-device `skipped`
  array to `skipped_aggregate` reason-count map. Top-K selected and
  the summary fields stay intact so dashboards can render exclusion
  trend lines.

Test fakes in three packages gain ListEvents stubs; the
curtailment-package fakeStore gains a working pagination impl mirroring
the SQL semantics so service-level tests can assert cursor round-trips.

Refs #289
Operator-safe partial update of a non-terminal event. Replaces the
Unimplemented stub on the handler.

State policy: pending and active accept the patch; restoring and
terminal states reject with FailedPrecondition. Operators who need to
intervene mid-restore go through AdminTerminateEvent — that's the
recovery surface, not Update. The conservative policy keeps the
recompute-vs-freeze question (Open #13) out of v1: Update of
restore_batch_size persists the new value but does NOT recompute
effective_batch_size. The reconciler's restore-claim reads the
Start-time stamped value through to the next event.

Validation mirrors Start: restore_batch_interval_sec is gated by the
non-admin cap (admin sets the session-derived bypass), max_duration
must be > 0 and <= 7 days, restore_batch_size >= 0. Misconfigured
values surface as InvalidArgument or Forbidden — never as a DB CHECK
violation.

sqlc UPDATE uses COALESCE on nil params so a partial patch preserves
unset columns. The WHERE clause re-asserts state IN ('pending',
'active') as defense in depth: a race where the row advanced between
the service's pre-read and the UPDATE surfaces as
ErrCurtailmentUpdateStateRaceLoss → FailedPrecondition with a distinct
message from the pre-read rejection.

Refs #289
Adds the admin-only escape hatch for forcing a non-terminal event to
CANCELLED or FAILED when a normal stop+restore cycle can't run.

The persistence layer wraps the terminal state transition and the
swept-target update in a single transaction via db.WithTransaction so
the event row and its targets stay in sync. An idempotent re-issue
with the same target_state is a no-op; a different terminal state
surfaces ErrCurtailmentAdminTerminateStateConflict, which the service
maps to FailedPrecondition.

Service-layer defense-in-depth checks (target_state in {CANCELLED,
FAILED}, non-empty reason, org/uuid present) mirror the proto
validator so non-Connect callers can't tunnel past it.
Adds a pre-insert lookup so a re-issued Start with the same
idempotency_key or (external_source, external_reference) pair returns
the original event instead of re-running the selector and tripping
the partial unique indexes (which would surface as a less helpful
AlreadyExists from the per-org non-terminal constraint).

idempotency_key takes precedence over external reference: the
operator-supplied retry handle wins over upstream re-delivery.
Lookup errors propagate unchanged so a transient db failure is
visible rather than silently falling through to a double-insert
attempt.
Adds an AuditLogger interface on the curtailment Service with a no-op
default so tests that don't care can ignore the wiring. main.go injects
*activity.Service via WithAuditLogger. Two override-specific event types
ride alongside the base curtailment_started row so a feed of unbounded
or force-include starts is a simple event-type filter rather than a
metadata scan.

IncMaintenanceOverride fires in parallel when force_include_maintenance
is set, so the platform metrics dashboard tracks the override rate
without joining against activity_log.

Audit emission is intentionally absent on the idempotent-replay and
insufficient-load paths: the original Start already recorded the trail,
and a path that never persisted shouldn't claim it did.
Documents the curtailment_reconciler_heartbeat-based liveness signal:
warn at 2 minutes of staleness with active events present, page at 5
minutes regardless. The SQL form is canonical; the vmalert rule is
parked behind a placeholder bridge metric so the wiring is one config
change away once a postgres-exporter publishes the staleness gauge.

Runbook walks four failure modes (panic loop, slow-query contention,
events not picked up, restore-loop forever) with operator response
steps that lean on AdminTerminateEvent for the cases where infra
mitigation isn't enough.
Walks the operator-facing service flow end-to-end against the
in-memory fake: Preview (no persistence side-effects) → Start
(persistence + audit + metrics) → Stop (RESTORING transition) →
AdminTerminate (forced terminal). The reconciler's tick-by-tick
state machine is covered piecewise in reconciler_test.go and
restore_test.go; this test pins the boundary between the public
service API and the persistence layer.

Companion tests cover the webhook idempotency-replay path
(duplicate Start short-circuits, no double-audit) and the
read-path query (ListEvents returns terminal rows filtered by
state).

A docker-driven HTTP-level e2e for the same lifecycle is a
follow-up — the existing server/e2e dir requires postgres +
proto-sim and lands when the curtailment plugin path is ready.
Four real lint findings from this branch, fixed without suppressions:

- Service.AdminTerminate: replace the two-case switch + default with
  an if-comparison so exhaustive doesn't demand the unhandled cases
  be enumerated. The default branch was load-bearing — the if form
  keeps the same behavior with less surface.

- service_list_test.go / handler_list_test.go: hoist the opaque
  cursor literal into a file-scope const. gosec G101 looks at string
  literals assigned to fields whose name matches credential keywords
  (PageToken matches "token"); an identifier reference clears the
  heuristic cleanly.

- service_start_idempotency_test.go: move the subtest store + svc
  creation inside the t.Run closure so each subtest can call
  t.Parallel() without sharing mutable counters across cases.
A multi-reviewer code review surfaced four merge-blocking P1s and a
batch of P2/P3 hygiene items on this branch. This commit lands the
focused, defensible subset that doesn't require contract or
security-policy decisions; the remaining items are recorded for the
next session.

Idempotency / race-recovery:
- Recognize uq_curtailment_event_idempotency and
  uq_curtailment_event_external_ref unique violations as race-loss
  via typed sentinels; Service.Start re-issues the corresponding
  replay lookup so the race-loser falls into the same response path
  as a deliberate retry rather than surfacing Internal with the
  constraint name leaked in the error string.
- AdminTerminateEvent: on zero-row UPDATE caused by a concurrent
  terminate-to-same-state, re-read inside the transaction and echo
  the row idempotently (mirrors BeginRestoreTransition's pattern).

Audit / observability:
- Emit a curtailment_admin_terminated activity row on AdminTerminate
  so the privileged force-terminate path captures actor + reason in
  the activity feed (parallels emitStartAuditTrail).
- emitStartAuditTrail now maps req.SourceActorType to
  activitymodels.ActorType so scheduler-triggered starts persist
  actor_type='scheduler' on activity_log instead of defaulting to
  'user'.

Update path hardening:
- Reject explicit empty-string Reason as InvalidArgument and add a
  256-char length cap mirroring Start. The proto-translate comment
  is updated to describe the actual silent-no-op behavior.

Proto contract docs:
- Field-level docstrings on ListCurtailmentEvents describing the
  omitted target_rollup/targets and the trimmed decision_snapshot
  shape (skipped_aggregate vs raw skipped).
- max_len=1024 on ListCurtailmentEventsRequest.page_token so the
  cursor decode path is bounded.
- Annotate the two eventStateFromProto call sites distinguishing
  the no-filter sentinel role from the target_state mapping role.

Cleanup:
- Drop duplicate finitePtr generic in handler_start_test.go (the
  existing ptr generic in handler_test.go covers the use case).
- Inline single-call-site valueOrZero generic in service.go.
@rongxin-liu rongxin-liu requested a review from a team as a code owner May 21, 2026 23:13
Copilot AI review requested due to automatic review settings May 21, 2026 23:13
@github-actions github-actions Bot added documentation Improvements or additions to documentation automation server shared labels May 21, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR completes the operator-facing curtailment management surface (list/update/admin terminate) and adds observability scaffolding (audit events, metrics interfaces, heartbeat runbook/alert template) to support v1 curtailment operations end-to-end in the server.

Changes:

  • Add operator RPCs for listing historical curtailment events (cursor pagination) and updating operator-safe fields, plus an admin RPC to force-terminate an event and sweep targets.
  • Add webhook-style Start idempotency lookups (idempotency key + external source/reference) with race-loser handling, plus audit + metrics interfaces wired through the service and reconciler.
  • Add reconciler heartbeat runbook + placeholder vmalert rules for stalled reconciler/tick failures.

Reviewed changes

Copilot reviewed 28 out of 31 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
server/sqlc/queries/curtailment.sql Adds SQLC queries for idempotency lookups, operator-field update, admin terminate + target sweep, and org event history listing.
server/internal/handlers/curtailment/translate.go Adds request/response translators for AdminTerminate/Update/List; trims decision snapshot for list view; adds proto↔model event-state mapping helper.
server/internal/handlers/curtailment/handler.go Implements UpdateCurtailmentEvent, ListCurtailmentEvents, and AdminTerminateEvent handlers with session/admin gating.
server/internal/handlers/curtailment/handler_update_test.go Handler tests for UpdateCurtailmentEvent auth, validation, and admin gating behavior.
server/internal/handlers/curtailment/handler_stop_test.go Updates Stop handler test stub to satisfy expanded store interface.
server/internal/handlers/curtailment/handler_start_test.go Updates Start handler test stub for new store methods and adjusts optional pointer helper usage.
server/internal/handlers/curtailment/handler_list_test.go Handler tests for ListCurtailmentEvents pagination/filtering and decision-snapshot trimming behavior.
server/internal/handlers/curtailment/handler_admin_terminate_test.go Handler tests for AdminTerminateEvent admin gating, UUID validation, and state-conflict mapping.
server/internal/domain/stores/sqlstores/curtailment.go Implements SQL store methods for idempotency lookups, ListEvents pagination, operator-field update, and AdminTerminateEvent transaction.
server/internal/domain/stores/sqlstores/curtailment_cursor.go Adds base64+JSON cursor encode/decode helpers for ListEvents pagination.
server/internal/domain/stores/interfaces/curtailment.go Extends CurtailmentStore interface with list/update/admin-terminate/idempotency methods and new typed error sentinels.
server/internal/domain/curtailment/service.go Adds metrics/audit plumbing, Start replay lookups + race handling, ListEvents/Update/AdminTerminate service methods, and audit emission helpers.
server/internal/domain/curtailment/service_update_test.go Unit tests for Update service method validation/state-guard/race-loss behavior.
server/internal/domain/curtailment/service_test.go Expands fake store to support new store methods; adds a metrics recorder test helper.
server/internal/domain/curtailment/service_start_test.go Adds Start metrics test for candidate-exclusion counters.
server/internal/domain/curtailment/service_start_idempotency_test.go Adds Start idempotency replay + precedence + error-path tests.
server/internal/domain/curtailment/service_start_audit_test.go Adds Start audit emission tests (base row + override-specific rows + replay suppression).
server/internal/domain/curtailment/service_list_test.go Adds ListEvents service tests for forwarding/validation and store error propagation.
server/internal/domain/curtailment/service_lifecycle_test.go Adds service-layer end-to-end lifecycle test across Preview→Start→Stop→AdminTerminate and replay/list behavior.
server/internal/domain/curtailment/service_admin_terminate_test.go Adds AdminTerminate service tests for validation and conflict/error mapping.
server/internal/domain/curtailment/reconciler/reconciler.go Adds metrics injection and records tick duration/failure counters on panic paths.
server/internal/domain/curtailment/reconciler/reconciler_test.go Adds reconciler tests asserting tick duration/failure metric emission.
server/internal/domain/curtailment/metrics.go Introduces curtailment.Metrics interface + NoOpMetrics implementation.
server/internal/domain/curtailment/audit.go Introduces curtailment.AuditLogger interface + NoOpAuditLogger and curtailment audit event-type constants.
server/generated/sqlc/db.go Regenerated SQLC prepared-statement wiring for new curtailment queries.
server/generated/sqlc/curtailment.sql.go Regenerated SQLC query implementations/types for new curtailment queries.
server/docs/curtailment-reconciler-runbook.md Adds heartbeat staleness runbook, SQL alert query, and failure-mode triage guidance.
server/cmd/fleetd/main.go Wires NoOpMetrics + audit logger into curtailment Service and passes metrics into reconciler.
proto/curtailment/v1/curtailment.proto Documents list-response trimming and adds page_token max length validation.
deployment-files/server/monitoring/vmalert/rules.d/proto-fleet-curtailment.yml Adds placeholder vmalert rules for stalled reconciler and tick failure rate using bridge metrics.

Comment thread server/internal/domain/stores/sqlstores/curtailment_cursor.go
Comment thread server/internal/domain/curtailment/service.go

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 62a996a1f5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/internal/domain/curtailment/service.go
@github-actions

github-actions Bot commented May 21, 2026

Copy link
Copy Markdown

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (16121bbbfdbed395396ff06a1b1e46ee72d27764...d6df608f07a8e5442136d62d451652d7f15567bc, exact PR three-dot diff)
  • Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: HIGH

Findings

[HIGH] Stop/AdminTerminate Can Strand Miners After A Racing Curtail Dispatch

  • Category: Concurrency
  • Location: server/sqlc/queries/curtailment.sql:372
  • Description: ResetCurtailmentTargetsForRestore resets every non-terminal target, including dispatching targets, to desired_state='active'. The reconciler sets a target to DISPATCHING before calling cmd.Curtail; if StopCurtailment runs in that window, it clears the only marker that AdminTerminate uses to detect in-flight Curtail commands. An admin can then terminate the restoring event before the compensating Uncurtail is enqueued, while the stale Curtail call can still succeed and its post-command DB write will race-lose.
  • Impact: A miner can remain curtailed after Stop/AdminTerminate, causing lost hashrate/revenue and violating the recovery semantics that active Curtails must be compensated before force termination.
  • Recommendation: Make Stop retry/block while any target is in the Curtail dispatching pre-command state, or preserve an in-flight marker that AdminTerminate blocks on until a compensating Uncurtail is guaranteed. Add a regression test for Stop followed by AdminTerminate between the DISPATCHING pre-write and Curtail completion.

[MEDIUM] Updating Restore Batch Size Does Not Affect Restore Behavior

  • Category: Reliability
  • Location: server/internal/domain/curtailment/service.go:304
  • Description: UpdateCurtailmentEvent accepts and persists restore_batch_size, but the restorer reads the frozen effective_batch_size stamped at Start. The service comment even says operators needing a different batch size should cancel and restart, but the API returns success for a field that no longer drives execution.
  • Impact: Operators can lower batch size before stopping and see the event updated, while restore still dispatches the old effective batch size. That can restore more miners per batch than expected and defeat inrush/thermal controls.
  • Recommendation: Remove or reject restore_batch_size from Update with a clear FailedPrecondition, or safely recompute and persist effective_batch_size under locking before restore can begin.

[MEDIUM] Dispatch Errors Are Persisted And Returned Verbatim

  • Category: gRPC
  • Location: server/internal/domain/curtailment/reconciler/reconciler.go:390
  • Description: The reconciler stores dispatchErr.Error() into target last_error, which is later returned in curtailment target responses. This bypasses normal Connect error sanitization because the raw internal error is persisted as event data.
  • Impact: DB/queue/preflight internals, topology details, or future dispatcher/plugin error contents could be exposed to authenticated org users.
  • Recommendation: Persist a bounded, sanitized failure reason/code for client responses and log the raw error server-side.

Notes

No changed pool, wallet, stratum, mining credential, discovery shell-out, or plugin binary execution paths were present in the reviewed diff.


Generated by Codex Security Review |
Triggered by: @rongxin-liu |
Review workflow run

@github-actions github-actions Bot added javascript Pull requests that update javascript code client labels May 21, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f3aad80aa6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/internal/handlers/curtailment/translate.go
Comment thread server/internal/domain/stores/sqlstores/curtailment.go
Codex security review + Copilot inline reviewers surfaced five
actionable findings on this branch. All five validated; landing the
fixes here.

Admin gate on Update.max_duration_seconds (HIGH).
  Mirrors Start's post-normalization admin check inside Service.Update.
  Without this, a non-admin who Started at the org default could Update
  the same event above the default, bypassing the privilege boundary
  Start enforces. Fetches org config lazily — only when max_duration is
  in the patch and the caller lacks admin controls.

AdminTerminate.reason length cap (MEDIUM).
  Service-level backstop rejects oversized reasons (>256 chars) so a
  bulky operator string can't amplify into every swept target's
  last_error column. The proto field gets the matching max_len=256
  rule; proto regen is deferred to a clean tooling pass (the service
  backstop already catches the case today).

List query trims decision_snapshot at the SQL boundary (MEDIUM).
  ListCurtailmentEventsForOrg now projects explicit columns with
  (decision_snapshot_jsonb - 'skipped')::JSONB so the per-device skip
  list (multi-MB on 10K-miner events) doesn't ride the wire for every
  list row. Field layout matches CurtailmentEvent exactly so the
  existing convertEventRow path applies via a single struct
  conversion.

Cursor rejects non-positive IDs (MEDIUM).
  decodeCurtailmentEventCursor now returns InvalidArgument when the
  decoded id is <= 0. The store never emits a non-positive id; a
  user-supplied token that decodes to one would silently rewind to the
  first page (id=0) or return zero rows (id<0).

Audit metadata key naming (MEDIUM).
  Renamed `force_include` to `force_include_maintenance` on the
  curtailment_started audit row metadata so the key matches the
  domain/proto field name. Downstream analytics no longer have to map
  between abbreviated and full names.

Test coverage added for each fix: non-admin max_duration rejection,
admin pass-through, oversized reason rejection, cursor non-positive
id rejection (zero / negative / missing).
@rongxin-liu

Copy link
Copy Markdown
Contributor Author

All three findings addressed in 35594cd:

  • HIGH — Update.max_duration_seconds admin gate: Service.Update now fetches OrgConfig and applies the same post-normalization gate as Service.Start. Coverage in TestService_Update_{RejectsNonAdmin,AllowsAdmin}MaxDurationAboveOrgDefault.
  • MEDIUM — list query loads full snapshots: ListCurtailmentEventsForOrg projects (decision_snapshot_jsonb - 'skipped')::JSONB so the per-device skip list (multi-MB on 10K-miner events) doesn't ride the wire for every list row.
  • MEDIUM — AdminTerminate.reason unbounded: added max_len=256 on the proto field and a service-level length backstop. Coverage in TestService_AdminTerminate_RejectsOversizedReason.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 51e44672d8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/internal/domain/curtailment/service.go
Comment thread server/sqlc/queries/curtailment.sql Outdated
…ygiene

- restore_batch_interval_sec's non-admin cap now runs against the
  effective patch in Service.Update, mirroring the max_duration_seconds
  fix landed earlier in this PR. A non-admin echoing an admin-elevated
  value as part of an unrelated patch (UI form re-submission) collapses
  to no-op and no longer trips Forbidden — asymmetric gate placement
  between the two fields fixed.
- observeActive comment refreshed to reflect the post-hoist safety
  model. The load-bearing race-closure is the DISPATCHING pre-write's
  EXISTS guard inside dispatchOneCurtail, not a per-target liveness
  read that no longer exists.
…re-write, utf8 boundary

- Update audit emission: a real field change produces one
  curtailment_updated row whose `fields` metadata lists only the
  actually-changed field names (no-op echoes excluded). A patch where
  every field matches the persisted value collapses to no-op with zero
  store calls and zero audit rows.
- Update gate symmetry: non-admin echo of admin-elevated max_duration
  passes; non-admin echo of admin-elevated restore_batch_interval also
  passes (the asymmetric pre-fix would have rejected this).
- Update reason length boundary: 256 multi-byte runes (768 bytes) pass
  rune-count validation; 257 reject. Pins the byte-vs-rune fix.
- observeActive DISPATCHING orphan recovery: a target left in
  DISPATCHING on an ACTIVE event is redispatched on the next tick;
  budget-exhausted orphans are not redispatched.
- dispatchRestoreBatch partial pre-write failure: a non-race-loss pre-
  write error drops just that target from this tick's batch; Uncurtail
  fires for the surviving devices; the failed target stays in its prior
  state for next-tick reclaim.
- Removed the now-unused getEventByUUIDHook + getEventByUUIDCalls
  fields from the reconciler fakeStore (the mid-loop-terminate test
  was rewritten to use updateTargetStateHook in the prior commit).
…note

- observeActive: trim the over-explanatory race-closure block to one
  line; move the comment to sit above the actual eventStillDispatchable
  call rather than the ListCandidates fetch above it.
- validateUpdateRequest: drop the explanatory comment under the
  restore_batch_interval_sec block. The symmetric max_duration_seconds
  block carries no such note; an annotation on one field and silence on
  the other is more confusing than consistent silence. Both gates live
  in Service.Update via effectiveUpdatePatch — readers tracing the
  cap-check follow the existing max_duration_seconds pattern.
- Active-phase orphan budget-exhausted test now asserts the final state
  stays DISPATCHING and RetryCount stays at the cap. Mirror the rigor
  of the symmetric Drifted-arm exhaustion test so a silent state flip
  on the no-redispatch path would not pass.
- New TestReconciler_ObserveActive_DispatchingOrphanRaceLossDoesNotIssueCommand
  pins the EXISTS-guard race-closure on the observeActive redispatch
  path: when the DISPATCHING pre-write returns the race-loss sentinel,
  cmd.Curtail must not fire and the mirror must not advance.
- Restore partial-pre-write test now asserts m1.RetryCount==0 so the
  "skip without budget burn" invariant is pinned (the dispatch attempt
  never reached cmd.Uncurtail, so no retry slot should be consumed).
- New TestReconciler_Restoring_AllPreWriteFailuresSkipUncurtail pins
  the degenerate dispatchSet-empty path: every pre-write fails, no
  Uncurtail fires, no retry burns, targets stay Pending for next-tick
  reclaim.
- Reworded the active-orphan test docstring to describe the invariant
  rather than the review process that surfaced the gap.
…minate

Both write RPCs were returning the response event through toEventProto,
which only maps scalar metadata. The structured fields — scope, mode
params, and decision_snapshot — were silently absent from the wire shape,
along with the persisted target list. A client merging the response into
a cached event object would lose those fields and observe an unexplained
downgrade in event detail.

Fetch the post-write target list via ListTargetsByEvent and route the
response through toEventProtoWithTargets, mirroring StopCurtailment's
already-established pattern. Read endpoints (GetActiveCurtailment, Stop)
already use this path; write endpoints now do too, so cached-event merge
semantics are consistent across the RPC surface.

For AdminTerminate this also surfaces the post-sweep target state
(RESTORE_FAILED rows) to the caller, so operators reviewing the response
see which devices were swept without a separate list call.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 91b9d5672c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/sqlc/queries/curtailment.sql
…ection targets

CurtailmentEventHasInFlightTargets treated any target in
dispatching/dispatched/confirmed/drifted as a Stop-first blocker without
checking desired_state. Once StopCurtailment runs,
ResetCurtailmentTargetsForRestore flips every non-terminal target to
desired_state='active' and the reconciler issues compensating Uncurtails
through the same lifecycle states. The unscoped predicate counted those
in-flight Uncurtails as Curtails and rejected AdminTerminate on
RESTORING events with "call StopCurtailment first" — but Stop has
already been called, leaving the operator with no force-terminate path
during incident recovery.

Add `desired_state = 'curtailed'` to the predicate so the gate fires
only on in-flight Curtail commands (which still need compensating
Uncurtails before termination is safe). In-flight Uncurtail commands
during restore no longer block — AdminTerminate proceeds and the
target sweep transitions remaining non-terminal rows to RESTORE_FAILED
as designed.

Proto + Connect/TS doc comments updated to reflect the scoped contract
so wire callers see the same precondition semantics the SQL enforces.

The error-mapping contract at the service layer is unchanged
(TestService_AdminTerminate_ActiveEventRequiresStopFirst still pins the
sentinel → FailedPrecondition mapping). E2E coverage of the mid-restore
AdminTerminate path remains a follow-up — the existing lifecycle E2E
does not exercise this scenario.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d7c509b1cd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +377 to +384
if patch.MaxDurationSeconds != nil && !req.CanUseAdminControls {
orgConfig, err := s.store.GetOrgConfig(ctx, req.OrgID)
if err != nil {
return nil, err
}
if orgConfig.MaxDurationDefaultSec > 0 &&
*patch.MaxDurationSeconds > orgConfig.MaxDurationDefaultSec {
return nil, fleeterror.NewForbiddenErrorf(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject max-duration updates on unbounded events

Update allows writing max_duration_seconds without checking whether the event is already allow_unbounded=true, so the patch is accepted and persisted even though reconciler.enforceMaxDuration will ignore that cap for the lifetime of the event. In practice, an operator can submit a successful update expecting to re-cap a runaway unbounded event, but the event will continue uncapped. Validate this combination in Service.Update (or atomically clear allow_unbounded when applying a finite cap) so the API cannot acknowledge a no-op safety control change.

Useful? React with 👍 / 👎.

Two pairs of sentinels were handled identically at every callsite. Merge
them into one canonical name each, keeping the discriminators that drive
distinct routing.

- ErrCurtailmentIdempotencyKeyRaceLoss and ErrCurtailmentExternalReferenceRaceLoss
  → ErrCurtailmentReplayRaceLoss. Service.Start's race-loser branch
  already checked both with `||` and routed both into the same
  lookupIdempotentReplay → replayPlanFromPersistedEvent path. The
  partial-unique index that fired doesn't change the recovery action:
  re-issue the matching lookup and return the persisted winner. One
  sentinel removes the OR check and shortens the godoc surface.
- ErrCurtailmentUpdateStateRaceLoss → ErrCurtailmentEventStateRaceLoss.
  Both indicate "row state advanced before the write." The first was
  emitted by UpdateOperatorFields; the second by UpdateEventState and
  UpdateTargetState. Callers route by context (Service.Update returns
  FailedPrecondition; reconciler logs + meters) — the sentinel value
  itself doesn't drive behavior. Folding them keeps the contract
  surface concise.

Kept separate because their downstream routing genuinely differs:
- ErrCurtailmentNonTerminalEventExists → AlreadyExists with a specific
  "non-terminal event exists" message (not a replay candidate).
- ErrCurtailmentAdminTerminateStateConflict → FailedPrecondition with
  "already terminal in a different state" (not retryable).
- ErrCurtailmentAdminTerminateActiveEvent → FailedPrecondition with
  "must Stop first" (operator action required).

Net: 7 sentinels → 5. No behavior change.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e7b3cea875

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

)
}

updated, err := s.store.UpdateOperatorFields(ctx, event.ID, req.OrgID, patch)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject max-duration patches on unbounded events

Fresh evidence in this commit is that Update now writes max_duration_seconds through UpdateOperatorFields whenever the field is patched, but it never checks event.AllowUnbounded first. For events created with allow_unbounded=true, the reconciler’s max-duration enforcement path short-circuits, so this update is acknowledged and persisted yet never enforced. In practice, operators can submit a successful “re-cap” update on a runaway unbounded event and still get no stop behavior, which is a misleading safety control outcome.

Useful? React with 👍 / 👎.

dispatchOneCurtail's post-cmd DISPATCHED write was vulnerable to a
Stop→reconciler cascade: if StopCurtailment ran concurrently between
the pre-cmd DISPATCHING write and the post-cmd DISPATCHED write,
ResetCurtailmentTargetsForRestore had already flipped the target's
desired_state to 'active' and reset state to PENDING. The post-cmd
write's EXISTS guard admitted RESTORING events, so the UPDATE clobbered
Stop's reset with state=DISPATCHED + last_batch_uuid=<Curtail batch>.
observeRestoring then waited for a never-arriving Uncurtail telemetry
confirmation, timed out to RESTORE_FAILED, and the device stayed
curtailed forever with no compensating Uncurtail queued.

Add an optional expected_desired_state predicate to
UpdateCurtailmentTargetState. The post-cmd Curtail write sets it to
'curtailed'; the symmetric Restore-phase post-cmd write sets it to
'active'. When Stop's reset has flipped desired_state out from under
the dispatch, the write race-loses (zero rows → ErrCurtailmentEventStateRaceLoss),
the target stays in {state=pending, desired_state=active} for
observeRestoring to claim via the normal restore-batch path, and the
device gets uncurtailed via the Uncurtail batch.

The predicate is optional so confirmation/error/bookkeeping writes that
legitimately apply across phases stay unaffected. The reconciler
fake-store mirrors the SQL contract; an empty DesiredState in tests is
treated as "test didn't set it; don't enforce" to avoid backfilling
DesiredState across 100+ pre-existing target literals.

New regression test (TestReconciler_CurtailPostWriteRaceLosesWhenStopFlipsDesiredState)
pins the cascade: simulate Stop's reset via curtailHook mid-command,
assert cmd.Curtail still fired (unavoidable mid-flight) but the post-cmd
write race-loses — target retains Stop's reset state with no Curtail
batch identifier stamped.
…nal events

A webhook retry whose key (idempotency_key or external_source+reference)
matched a long-completed event returned that historical row through the
replay path with no new dispatch fired. Operator believed curtailment
was in flight while the original event ended at some prior time.

Migration 000056 tightens both partial unique indexes to
`WHERE … AND state IN ('pending', 'active', 'restoring')`, and the two
replay-lookup queries gain the same state filter. Together they make
the dedup window match the event's actual in-flight lifetime:

- Retries during an event's in-flight lifetime still hit the replay
  path (the partial indexes still cover pending/active/restoring rows
  and the lookups still match them).
- Retries with a key matching a terminal event are treated as a fresh
  Start — the operator gets the new dispatch they intended.

The DROP/CREATE INDEX CONCURRENTLY sequence opens a brief uniqueness
gap; the companion uq_curtailment_event_one_non_terminal_per_org index
continues to block concurrent in-flight duplicates during that window,
so the practical exposure is narrow.

Fake-store updates honor the new state predicate via a single
filterNonTerminalReplayEvent helper so service-level tests exercise the
production semantics. Two new regression tests pin the fresh-Start
behavior for both lookup channels.
… failure

A non-race-loss DISPATCHING pre-write failure in dispatchRestoreBatch
previously dropped the target from this tick's dispatch set with a log
and continued. No retry slot burned. A target whose pre-write
persistently failed (DB column lock, replica-lag spike, etc.) cycled
forever with no path to RESTORE_FAILED; the event could not complete.

Route the failure through recordDispatchFailure(..., TargetStatePending)
so retry_count bumps. At MaxRetries the target transitions to
RESTORE_FAILED via the same path every other restore-phase failure
takes, and the event proceeds to terminal. The dispatchSet drop is
preserved — cmd.Uncurtail still doesn't fire for the failed target —
so the partial-batch progress contract from the prior fix continues to
hold.

When the underlying DB is fully degraded (every write fails), the
recovery write also fails; retry_count stays at its current value and
the target cycles to the next tick. That's the right behavior: with
no DB writes landing the system cannot make progress, and burning
retry slots against a transient DB-down would mask the real problem.

Tests:
- TestReconciler_Restoring_PreWriteFailureSkipsTargetButDispatchesRest
  now asserts m1.RetryCount == 1 + LastError stamped (intermittent
  failure: pre-write fails, recovery write succeeds).
- TestReconciler_Restoring_AllPreWriteFailuresSkipUncurtail docstring
  updated to reflect the fully-degraded-DB scenario (retry stays 0
  because the recovery write also fails).
- New TestReconciler_Restoring_PreWriteFailurePersistsExhaustsRetryBudget
  walks 3 consecutive intermittent failures → RESTORE_FAILED at the
  third tick — the bug fix's load-bearing assertion.
…echo

AdminTerminate previously suppressed audit emission whenever the store
returned transitioned=false (event was already in the requested
terminal state on first read, or a concurrent terminate landed first).
That conflated two distinct cases:

- Same-actor retry: a network retry by the same operator. Suppressing
  the duplicate row is reasonable.
- Concurrent race-loser: a second operator's call to AdminTerminate
  with a different reason. Suppressing this row silently drops the
  loser's actor + reason from the audit feed. A historian
  reconstructing "who tried to terminate this event" sees only the
  winner.

Emit a new ActivityTypeAdminTerminatedReplay row on every
transitioned=false return, carrying THIS caller's actor + reason.
Consumers tracking primary terminate actions filter by
ActivityTypeAdminTerminated and ignore the replay type; consumers
reconstructing complete operator-attempt history union both.

The same-actor retry case now produces a small amount of noise (one
replay row per duplicate click) — acceptable tradeoff for getting the
race-loser attribution right. AdminTerminate is a rare manual action;
duplicate-click frequency is negligible compared to the silent
attribution loss it replaces.

Tests:
- TestService_AdminTerminate_IdempotentReplay (renamed
  _SuppressesAudit → _EmitsReplayAuditRow) inverts its assertion:
  expects one replay-type row with the caller's reason.
- TestService_AdminTerminate_RaceLoserAttributionPreserved walks the
  race end to end — two callers with distinct reasons land two audit
  rows (primary + replay), each with its caller's reason. Pins the
  fix's load-bearing assertion: the loser's attribution survives.
…lures

Two failure modes were operationally invisible:

- Heartbeat hides degraded write path: the heartbeat upsert uses a
  detached context with a 5s timeout and runs at the bottom of every
  tick. When PgBouncer drains the writable pool or a replica-lag spike
  blocks per-target writes, the cheap heartbeat row still upserts
  inside its budget while every per-target UPDATE times out. The
  heartbeat staleness SQL check (the canonical operator alert) reports
  "reconciler healthy" — but no events are progressing.

- Audit failures swallowed silently: emitStartAuditTrail,
  emitAdminTerminateAuditTrail, and emitUpdateAuditTrail call the
  activity store via Log, which logs and discards errors. A failing
  activity_log write produces a slog.Error line; nothing in metrics
  surfaces it. Compliance dashboards reading the audit feed see gaps
  with no signal explaining them.

Two new counters on the curtailment.Metrics interface:

- IncTargetWriteFailure() — non-race-loss UpdateTargetState failures.
  Pair with the heartbeat SQL check: heartbeat fresh + counter
  climbing = degraded write path. Race-loss continues to feed
  IncEventStateRaceLoss; the two signals are intentionally distinct.
- IncAuditWriteFailure(activityType string) — activity_log persistence
  failures during curtailment audit emits. Labeled by activity type so
  one consistently-failing event shape can be isolated from the rest.

AuditLogger interface gained LogStrict(ctx, event) error so the
curtailment side can observe the failure activity.Service already
exposes via its existing LogStrict method. NoOpAuditLogger returns nil.

writeTargetState gates by error type:
- Race-loss → IncEventStateRaceLoss (existing) + slog.Warn + return err
- Other error → IncTargetWriteFailure (new) + return err
- The caller still gets the error for caller-specific logging.

Each emit* function switches from Log to LogStrict, logs the error
when one fires, and bumps IncAuditWriteFailure(activityType). The
curtailment RPC still succeeds — audit failures are best-effort by
design — but operators now have a counter to alert on.

Tests:
- TestService_Start_AuditPersistenceFailureIncrementsMetric (R4)
  pins the audit-failure counter on the Start path. logStrictErr
  injection on the recordingAuditLogger fake makes the fault path
  testable without touching the live activity store.
- TestReconciler_TargetWriteFailure_IncrementsCounter (AD12) pins
  the non-race-loss → IncTargetWriteFailure path.
- TestReconciler_TargetStateRaceLoss_… now also asserts
  TargetWriteFailureCount == 0 to keep the two counters from drifting
  into each other.
…itions

Machine callers had to string-match the debug message to distinguish
the two recoverable FailedPrecondition variants: "call Stop first"
vs. "already settled in a different terminal state." That coupling
breaks the moment the message is reworded for a different audience.

Switch both emits to fleeterror.NewErrorWithServiceCode with stable
int32 codes (1 = in-flight commands, 2 = state conflict) so the wire
carries the discriminator in commonv1.FleetErrorDetails.Service. The
proto doc enumerates the codes; tests pin both ErrorCodeTypeService
and the constant value.
…shot.skipped

ListCurtailmentEventsForOrg already projects decision_snapshot_jsonb with
the per-device `skipped` array stripped and a `skipped_aggregate`
reason→count map substituted (see queries/curtailment.sql). The Go-side
walk in populateEventDecisionSnapshotTrimmed re-ran the same aggregation
on data the SQL had already trimmed — the `skipped` key was never present
in production, so the loop produced nothing and the function devolved to
"hydrate the JSON to structpb."

Delete the trimmed-variant wrapper, point the list path at the existing
populateEventDecisionSnapshot, and pin the SQL boundary in the godoc so
the contract is obvious. The handler test now feeds the pre-trimmed
shape the SQL emits — the previous fixture relied on the dead walk to
produce the aggregate, which obscured where the trim actually happens.
Start dispatch wrote each target with a separate InsertCurtailmentTarget
round-trip inside the event transaction. For a 10K-miner curtailment
that's 10K serial parameterized exec calls — easily the dominant cost of
a Start RPC even with prepared-statement reuse, and a steady source of
transaction holding time that conflicts with the reconciler tick.

Switch to a single BulkInsertCurtailmentTargets call that fans the rows
out server-side via jsonb_to_recordset. The Go side marshals the
per-target inputs into one JSONB array; Postgres expands it into the
typed columns (NUMERIC for baseline_power_w, JSONB for the selector
rationale, missing keys → NULL). Returns rows affected so the store can
pin (inserted == len(targets)) and fail the transaction loudly if
jsonb_to_recordset silently drops a row that fails its column cast.

Drops the unused single-row InsertCurtailmentTarget query and the
rawMessageOrNullable helper that only it consumed. The JSON payload
shape is pinned by a unit test so a wrong field tag or nil-handling
regression breaks loud at unit-time rather than via a docker-compose
integration failure.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 15b6beb41e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +19 to +20
DROP INDEX CONCURRENTLY IF EXISTS uq_curtailment_event_idempotency;
CREATE UNIQUE INDEX CONCURRENTLY uq_curtailment_event_idempotency

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Split concurrent index rebuild into single-statement migrations

000056_scope_curtailment_idempotency_to_non_terminal.up.sql contains four ... CONCURRENTLY statements in one migration file. In this repo, runMigrations constructs the Postgres driver with postgres.WithInstance(conn, &postgres.Config{}) (no multi-statement mode enabled), so multi-statement migrations are executed in a single transaction context; Postgres rejects CREATE/DROP INDEX CONCURRENTLY inside a transaction block. This will fail migration version 56 during deploy and leave the database dirty until manual recovery.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 72743e5808

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

-- golang-migrate postgres driver executes a migration file as one command
-- unless statement splitting is enabled, and PostgreSQL rejects concurrent
-- index operations in a multi-command transaction block.
DROP INDEX CONCURRENTLY IF EXISTS uq_curtailment_event_idempotency;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve idempotency uniqueness while scoping index

Dropping uq_curtailment_event_idempotency in this migration before recreating it in version 57 leaves a deployment window with no DB uniqueness enforcement for (org_id, idempotency_key) on non-terminal events. During that gap (or if 57 fails mid-rollout), concurrent StartCurtailment requests with the same key can both insert, permanently breaking replay determinism because GetCurtailmentEventByIdempotencyKey ... LIMIT 1 will return an arbitrary winner.

Useful? React with 👍 / 👎.

@@ -0,0 +1 @@
DROP INDEX CONCURRENTLY IF EXISTS uq_curtailment_event_external_ref;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep external-reference uniqueness coverage during migration

This DROP INDEX runs one migration before the replacement unique index is created in version 59, creating a no-protection interval for (org_id, external_source, external_reference) on non-terminal events. In that window, duplicate webhook deliveries can persist multiple live rows with the same external reference, which undermines the external replay contract and makes future lookup-by-reference ambiguous.

Useful? React with 👍 / 👎.

Multiple corrections surfaced by the review pass against this PR.

Runtime:

- SweepCurtailmentTargetsToRestoreFailed wrote to a non-existent
  `updated_at` column on `curtailment_target`. Every AdminTerminate
  against a non-terminal event with at least one in-scope target tripped
  "column does not exist" and rolled back the entire termination,
  leaving the admin recovery escape hatch unusable. Drop the column
  from the UPDATE; the other target writes already don't reference it.

- recordDispatchFailure now falls back to a new
  BumpCurtailmentTargetRetry query when the rich state-change UPDATE
  fails non-race-loss. Persistent state-write failures no longer block
  retry-budget progress; MaxRetries → RESTORE_FAILED escalation lands
  on the next successful UpdateTargetState. New restore-path test pins
  the fallback contract.

- Restore the "heartbeat advances on tick freshness, not query health"
  semantic when ListNonTerminalEvents fails. The SQL staleness alert
  distinguishes reconciler-dead (no upsert) from DB-read-degraded
  (upsert advances, IncTickFailure rises). Test renamed to match.

Contract:

- UpdateCurtailmentEventRequest gains buf-validators mirroring Start
  (reason min/max_len, restore_batch_size lte:10000,
  restore_batch_interval_sec lte:3600, max_duration_seconds lte:604800).
  Previously a client could patch a non-terminal event with bounds
  Start would reject.

- AdminTerminate proto docstring documents the RESTORING-event race:
  the in-flight gate scopes to desired_state=CURTAILED, so a terminate
  during restore proceeds even with in-flight Uncurtails. Race-window
  targets persist as RESTORE_FAILED while the device may be restored.

- ListCurtailmentEvents proto docstring documents the trigger-metadata
  scrub on list rows; clients correlating webhook deliveries fetch via
  GetActiveCurtailment for unscrubbed attribution.
- Migrations 000056/000058 add operator-recovery notes for the
  CONCURRENTLY-split failure mode (paired CREATE in 000057/000059
  fails after the DROP lands: schema_migrations stays dirty with the
  unique index gone; resolve duplicates, force the version, retry).

- Replace dangling `Open #13` references with the recompute-vs-freeze
  reason inline so future readers don't chase a missing tracker link.
@rongxin-liu rongxin-liu force-pushed the feat/issue-289-curtailment-read-apis-audit-metrics branch 2 times, most recently from a48a376 to d6df608 Compare May 25, 2026 16:55
@github-actions github-actions Bot removed the documentation Improvements or additions to documentation label May 25, 2026
@rongxin-liu

rongxin-liu commented May 25, 2026

Copy link
Copy Markdown
Contributor Author

Closing in favor of a fresh PR 308 with a cleaner description; same branch and same scope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

client javascript Pull requests that update javascript code server shared

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(curtailment): operator read APIs + admin terminate + audit + metrics interface + E2E

2 participants