Skip to content

feat(curtailment): persistence + preview selector#185

Closed
rongxin-liu wants to merge 9 commits into
mainfrom
feat/issue-140-curtailment-persistence-preview
Closed

feat(curtailment): persistence + preview selector#185
rongxin-liu wants to merge 9 commits into
mainfrom
feat/issue-140-curtailment-persistence-preview

Conversation

@rongxin-liu

@rongxin-liu rongxin-liu commented May 6, 2026

Copy link
Copy Markdown
Contributor

Background

Curtailment is the proto-fleet feature that reduces a fleet's mining power on demand — operators preview a plan, start an event, verify reduction via telemetry, then restore safely.

The contract foundation (#118) shipped the proto surface with handler stubs returning Unimplemented. The admin layer (#173) added AdminTerminateEvent, three admin-gated override fields, session-only registration of the recovery RPC, and requireAdminFromContext. This PR is the next ticket: persistence + the first operator-facing RPC. Builds on #118, #173, and the command-preflight groundwork from #110. Supersedes the closed #141.

Summary

Lands the curtailment persistence layer plus a working read-only PreviewCurtailmentPlan. Preview computes a plan but does not write events, write target rows, dispatch commands, or start reconciliation. The remaining curtailment RPCs continue to return Unimplemented and ship in subsequent tickets.

Six logical commits, layered top-down:

  1. 6f7d59c Foundation — migration 000040, sqlc queries, store interface + sqlstore, domain models, FixedKw mode + tests, TestCurtailmentEventStateNumericPins guard for the AdminTerminateEvent validator from feat(curtailment): admin RPC, override fields, and session-only auth #173.
  2. c6d9b9a Selector — pure-logic BuildPlan with the dual-signal filter, ranking, and mode hand-off. 10 unit tests.
  3. b5e20b1 Service + handler — orchestration, the cross-table candidate-aggregation query, proto translators, fleetd startup wiring.
  4. 8b2f64e Service tests — 14 cases against an in-memory fake store covering scope resolution, the filter matrix, cooldown vs. EMERGENCY bypass, the maintenance-pair gate, the admin-gated override, cross-tenant isolation, and insufficient-load detail propagation.
  5. 0f3cc14 Lint cleanup — golangci exhaustive / gosec / unconvert.
  6. 58c90ecd329ff3 Review-pass fixes — bug fixes from a multi-reviewer pass plus a comment scrub and small simplifications. Detailed below.

What changed

Migration 000040

  • curtailment_event — full lifecycle columns plus the admin fields shipped in feat(curtailment): admin RPC, override fields, and session-only auth #173 (allow_unbounded, effective_batch_size). CHECK constraints for the maintenance-pair invariant and non-empty external_source / external_reference / idempotency_key / reason. Partial UNIQUE indexes for idempotency, webhook dedupe, and active-event lookup.
  • curtailment_target — composite PK (curtailment_event_id, device_identifier). Partial indexes for pending work and active-by-device lookup.
  • curtailment_reconciler_heartbeat — singleton row with CHECK (id = 1). Seeded so the staleness alert always reads a row.
  • curtailment_org_config — per-org tunables. One row per existing non-deleted org seeded in the same transaction; down migration drops the table.

Preview pipeline

  • Service.Preview validates the request → resolves scope (whole-org / device-list; device-set returns Unimplemented) → loads org config → resolves the effective candidate_min_power_w (admin-gated request override > org default) → fetches active-event and cooldown exclusion sets (EMERGENCY priority bypasses cooldown) → classifies candidates with skip-reason attribution → hands the partition to BuildPlan + FixedKw.
  • BuildPlan is pure (no I/O, no time, no shared state): dual-signal filter (power_w >= candidate_min_power_w AND hash_rate_hs > 0, with distinct skip reasons for phantom-load / dead-monitor / below-threshold) → rank by avg_efficiency descending with unknown-efficiency last → mode hand-off.
  • FixedKw walks the ranking and produces target-reached / undershoot-tolerated / insufficient-load. tolerance_kw == 0 collapses the undershoot branch.
  • handlers/curtailment.Handler accepts an optional *Service: nil keeps the existing stub-test contract; populated wires the real implementation. The admin gate from feat(curtailment): admin RPC, override fields, and session-only auth #173 still runs first when the request carries an override field. OutcomeInsufficientLoad maps to InvalidArgument carrying available_kw, requested_kw, tolerance_kw, and per-reason exclusion counts (Connect-RPC error-detail propagation is a future enhancement).
  • cmd/fleetd/main.go instantiates NewSQLCurtailmentStore + curtailmentDomain.NewService and threads them into the curtailment route registration.
  • A new domain/curtailment/models package holds boundary types so the rest of the curtailment domain doesn't import sqlc-generated code.

Review-pass fixes

Layered on top of commits 1–4, from a multi-reviewer pass. All P0 / P1 behavior findings addressed; coverage-only items routed to follow-ups.

  • P0 — New orgs created post-migration no longer break Preview with NotFound. GetOrgConfig is now an INSERT ... ON CONFLICT DO UPDATE upsert that always returns a row with table-level DEFAULTs.
  • P1ListCurtailmentCandidatesByOrg no longer crashes the row scan when a device has no device_status row joined (COALESCE(ds.status::text, ''::text)::text).
  • P1InsufficientLoadDetail per-reason counters (below-threshold / phantom-load / dead-monitor) now populate; BuildPlan tracks each category locally and merges into the rejection detail post-mode.Select.
  • P1 — Empty DeviceStatus is skipped with SkipStaleTelemetry instead of admitted as eligible.
  • P1 — Device-list scope with cross-org device_identifiers rejects with NotFound listing the unrecognized IDs instead of returning InsufficientLoad.
  • P2latest_hourly CTE is org-scoped and 24h-bucket-bounded.
  • P2candidate_min_power_w_override = 0 is rejected at the service layer.
  • P2 — Capability gate (partial). Devices missing driver metadata are skipped with SkipCurtailFullUnsupported. The full plugin-registry-driven check is follow-up work.
  • P3reason_selected is derived from the request strategy via strategyReasonLabel(s) instead of hard-coded.
  • P3 — Migration's per-org seed excludes soft-deleted organizations.

Operational notes for reviewers

  • Write-on-read for curtailment_org_configGetOrgConfig upserts on every Preview call. The conflict arm is a PK-keyed no-op; on first hit for a new org it inserts a row with table DEFAULTs (max_duration_default_sec=14400, candidate_min_power_w=1500, post_event_cooldown_sec=600).
  • latest_hourly CTE plan — the new shape pulls strictly fewer rows (org-scoped + 24h window).
  • Migration immutability000040 is not yet on main, so this PR edits it in place. The next ticket inherits the post-merge immutability rule.

What is intentionally not in this PR

  • StartCurtailment, Curtail dispatch, reconciler, and drift handling.
  • StopCurtailment and staggered restore. The effective_batch_size column lands here; the restorer that populates it lives in a follow-up.
  • GetActiveCurtailment, ListCurtailmentEvents, UpdateCurtailmentEvent implementations.
  • CurtailmentActiveFilter registration on commandSvc.
  • An admin RPC for mutating curtailment_org_config — the table and seed exist; the write surface lands later.
  • Device-set scope resolution; full plugin-registry-driven curtail_full capability gating (partial driver-metadata gate landed inline); webhook-triggered curtailment; Fleet-level efficiency events; smart PDU / outlet / rack target controls.
  • Frontend settings page and preview/start flow.
  • Coverage-only follow-ups from the review pass: handler / translate.go / SQL-integration tests, migration CHECK DB-level test, OutcomeUndershootTolerated end-to-end test, InsufficientLoad printf format pinning, Strategy / Mode / Level drop-or-wire decision, P3 polish.

Test plan

  • buf lint clean; golangci-lint run clean for the changed packages; sqlc generate clean against the migration; go build ./server/... clean.
  • go test ./internal/domain/curtailment/... ./internal/handlers/curtailment/... ./internal/domain/activity/models/... — 39+ cases green.
  • TestCurtailmentEventStateNumericPins asserts the validator pins from feat(curtailment): admin RPC, override fields, and session-only auth #173 (CANCELLED == 6, FAILED == 7).
  • modes/FixedKw covers all three outcomes, boundary ordering, empty-input guard, and constructor input validation.
  • selector_test.go covers the dual-signal matrix, pre-filtered forwarding, ranking (worst-efficiency-first, unknown-last, stable equal-efficiency), realized + remaining kW arithmetic, and an end-to-end FixedKw integration.
  • service_test.go covers (against an in-memory fake store): happy path; request validation; scope resolution; the pre-selector filter matrix; maintenance-pair gate; cooldown vs. EMERGENCY bypass; active-event exclusion; candidate_min_power_w_override precedence; cross-tenant isolation; insufficient-load detail propagation; the partial capability gate; and four review-pass regressions (EmptyDeviceStatusSkipsAsStale, DeviceListScopeRejectsCrossOrgIdentifiers, DualSignalCountersPropagateIntoInsufficientLoadDetail, MissingDriverSkipsAsCurtailFullUnsupported).

Closes #140
Refs #110
Refs #118
Refs #173

🤖 Generated with Claude Code

rongxin-liu and others added 4 commits May 6, 2026 18:16
…-2 foundation)

Lays the foundation for the BE-2 ticket (#140): the curtailment persistence
layer, sqlc-backed store, domain models, the FIXED_KW mode implementation,
and the enum-stability guard for the AdminTerminateEvent validator pinned
in BE-1.x (#173).

Migration 000040:
- curtailment_event with full lifecycle columns plus the BE-1.x admin-only
  fields (allow_unbounded BOOLEAN, effective_batch_size INT). CHECK
  constraints enforce maintenance-pair consistency, non-empty external
  source/reference/idempotency_key, and non-empty reason. Partial UNIQUE
  indexes cover idempotency, webhook dedupe, and active-event lookup.
- curtailment_target with composite PK (event_id, device_identifier),
  partial indexes for pending work and active-by-device schedule lookup.
- curtailment_reconciler_heartbeat singleton seeded at migration time so
  the staleness alert always has a row to read.
- curtailment_org_config with per-org tunables seeded one row per existing
  org in the same migration transaction; down migration drops the table.

Domain layer:
- server/internal/domain/curtailment/models defines the boundary shapes
  (Event, Target, OrgConfig, Heartbeat, EventState/TargetState typed
  wrappers) so selector/handler/modes do not import sqlc-generated code.
- server/internal/domain/curtailment/modes ships the Mode interface and
  the FixedKw implementation. Pure logic — no I/O, no time, no shared
  state. Covers the three design-doc outcomes: target reached (overshoot
  bounded by last-added candidate), undershoot tolerated (only with
  explicit positive tolerance_kw), and insufficient curtailable load
  (with a structured InsufficientLoadDetail the handler can echo back).

Store layer:
- interfaces/curtailment.go defines the org-scoped CurtailmentStore;
  v1 surface is the minimum needed to support Preview plus the basic
  event/target CRUD primitives so store tests can verify the schema
  constraints round-trip.
- sqlstores/curtailment.go implements the interface using the sqlc-
  generated queries (GetCurtailmentOrgConfig, ListActiveCurtailedDevicesByOrg,
  ListRecentlyResolvedCurtailedDevicesByOrg, InsertCurtailmentEvent,
  GetCurtailmentEventByUUID, InsertCurtailmentTarget,
  ListCurtailmentTargetsByEvent, GetCurtailmentReconcilerHeartbeat).

BE-1.x guard:
- TestCurtailmentEventStateNumericPins asserts CANCELLED == 6 and
  FAILED == 7 at build time. The AdminTerminateEventRequest validator
  pins on (buf.validate.field).enum.in: [6, 7]; this test fails CI before
  any future enum reorder can silently desynchronize the validator.

Selector + handler implementation lands in follow-up commits on this branch.

Refs #140
Refs #118
Refs #173

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…de hand-off)

Pure-logic BuildPlan(inputs, preFiltered, candidateMinPowerW, mode) → Plan.
Three responsibilities, separable for testing:

1. Dual-signal filter. Each candidate is admitted only when both
   power_w >= candidate_min_power_w AND hash_rate_hs > 0. The skip
   reason vocabulary surfaces the diagnostic flavor:
   - phantom_load_no_hash: drawing power but not hashing (stuck firmware)
   - power_telemetry_unreliable: hashing but power reads near zero
     (broken AC monitor)
   - below_candidate_min_power_w: both signals fail (likely fully idle)

2. Ranking by avg_efficiency descending — worst J/H first. Devices with
   nil avg_efficiency rank LAST (not first via a COALESCE-to-zero
   artifact), so an unranked miner doesn't accidentally get treated as
   best-in-class. Stable sort preserves input order on ties.

3. Mode hand-off. The ranked list goes to mode.Select; the mode owns
   the stop condition. The selector maps the Result back to a Plan
   carrying selected/skipped lists, realized + remaining kW, outcome,
   and the InsufficientLoadDetail when applicable.

Status / pairing / capability / cooldown / active-event filters happen
upstream in the service layer before reaching the selector; their skip
reasons arrive via the preFiltered argument and forward through unchanged.

Tests cover all three dual-signal branches, pre-filtered forwarding, ranking
worst-first, unknown-efficiency-last, stable equal-efficiency ordering,
realized/remaining kW arithmetic, and an end-to-end FixedKw integration.
Insufficient-load propagation includes the selector-supplied exclusion
counts so the handler can echo CandidateMinPowerW and per-reason counts
back to the caller without a re-query.

Refs #140

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires PreviewCurtailmentPlan end-to-end on top of the BE-2 foundation.

Service layer (server/internal/domain/curtailment/service.go):
- Validates the request (mode/level/target_kw/tolerance + maintenance pair).
- Resolves scope: whole_org and device_list pass through; device_set is
  rejected as Unimplemented (DeviceSetStore wiring is a follow-up).
- Loads org config and resolves the effective candidate_min_power_w —
  per-org default tier or the admin-gated request override (the handler
  enforces the role gate before the service runs, per BE-1.x contract).
- Pulls active-event and cooldown exclusion sets from the store; EMERGENCY
  priority bypasses cooldown, matching the design doc.
- Calls ListCandidates and partitions the cross-table rows into
  (eligible, preFiltered) with skip-reason attribution: pairing,
  device_status (UPDATING / REBOOT_REQUIRED / OFFLINE / MAINTENANCE),
  stale telemetry, cooldown, active-event. The maintenance override pair
  admits MAINTENANCE-state miners only when both flags are true.
- Hands the partition to the existing selector + FixedKw mode, which
  applies the dual-signal filter, ranks worst-efficiency-first, and
  produces a Plan with selected/skipped lists, realized + remaining kW,
  and an InsufficientLoadDetail when the rejection branch fires.

Store + sqlc:
- ListCurtailmentCandidatesByOrg joins device + latest device_metrics
  (15-min freshness) + latest device_metrics_hourly + device_pairing +
  device_status. LEFT JOINs leave nil pointers on absent rows so the
  service can attribute stale-telemetry skips.
- ListCandidates added to CurtailmentStore + SQLCurtailmentStore.
- New domain Candidate model carries the cross-table state independent
  of sqlc-generated types.

Handler (server/internal/handlers/curtailment/handler.go):
- NewHandler(service *curtailment.Service): the service is optional —
  nil keeps the existing stub-test contract (PreviewCurtailmentPlan
  returns Unimplemented), and the production wiring at fleetd startup
  passes a real service.
- PreviewCurtailmentPlan body: admin gate from BE-1.x runs first when
  the request carries an override field; then session.OrganizationID
  drives the service call; then the Plan is translated to the proto
  response. OutcomeInsufficientLoad maps to InvalidArgument with a
  message body carrying available/requested kW and per-reason
  exclusion counts (Connect-RPC error-detail propagation is a future
  enhancement).
- translate.go isolates the proto<->service-shape conversion so the
  service stays decoupled from generated proto types and is testable
  with plain Go inputs.

Wiring (server/cmd/fleetd/main.go):
- NewSQLCurtailmentStore + NewService instantiated alongside the other
  domain services; passed into NewHandler at the curtailment route
  registration.

Existing handler tests pass nil and continue to assert the
Unimplemented stub contract for the non-Preview RPCs and the BE-1.x
admin-gate behavior; they all still pass after the constructor change.

Refs #140

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… / scope / org-isolation matrix

14 cases against an in-memory fake CurtailmentStore. Pins the behaviors the
BE-2 acceptance criteria call out plus the cross-tenant isolation
invariant. Each case exercises the Preview orchestration without touching
the database; the fake captures arguments so tests can assert that the
service threads the caller's org_id through every store call.

Coverage:

- Happy path: 3 ranked miners, FIXED_KW target reached at the second.
- Request validation: unsupported mode, unbalanced maintenance pair,
  zero/negative target.
- Scope resolution: device-set returns Unimplemented (deferred); device-list
  narrows the store query to the supplied identifiers; empty device list
  is rejected.
- Pre-selector filter matrix: each non-eligible device_status / pairing /
  staleness branch maps to its expected SkipReason; the lone ACTIVE +
  PAIRED + fresh miner is the only one selected.
- Maintenance pair: MAINTENANCE-state miner is admitted only when both
  include_maintenance and force_include_maintenance are true on the request.
- Cooldown: NORMAL priority calls the cooldown lookup with the org's
  post_event_cooldown_sec; EMERGENCY skips the lookup entirely.
- Active-event exclusion: device locked in another non-terminal event is
  skipped with reason active_event.
- candidate_min_power_w override: the request's admin-gated override
  takes precedence over the per-org default.
- Cross-tenant isolation: every store method receives the caller's org_id;
  the fake captures and asserts no leakage from a populated other-org row set.
- Insufficient-load propagation: AvailableKW / RequestedKW /
  CandidateMinPowerW / per-reason exclusion counts forward into the
  detail the handler echoes back to the caller.

The fake panics on methods Preview does not exercise (InsertEvent,
GetEventByUUID, InsertTarget, ListTargetsByEvent, GetHeartbeat) so a
regression that newly invokes one surfaces immediately rather than
silently consuming a zero value.

Refs #140

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 6, 2026 17:09
@rongxin-liu rongxin-liu requested a review from a team as a code owner May 6, 2026 17:09
@github-actions github-actions Bot added the server label May 6, 2026
@chatgpt-codex-connector

Copy link
Copy Markdown

💡 Codex Review

ds.status::text AS device_status,

P1 Badge Coalesce nullable device status in candidate query

device_status is selected from a LEFT JOIN as ds.status::text, so devices without a status row produce NULL. The generated sqlc row type for this column is string, and scanning a SQL NULL into a Go string fails the whole query. In practice, a single unstatused device causes ListCurtailmentCandidatesByOrg to error and PreviewCurtailmentPlan to return an internal failure instead of a plan.


INSERT INTO curtailment_org_config (org_id)
SELECT id FROM organization
ON CONFLICT (org_id) DO NOTHING;

P2 Badge Populate curtailment config for organizations created later

This migration only backfills curtailment_org_config rows for organizations that already exist at migration time. Service.Preview hard-depends on GetOrgConfig and returns an error when the row is missing, so any organization created after this migration will be unable to run preview until a manual insert happens. Add an automatic row-creation path (trigger/app write) or a safe fallback on read.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@chatgpt-codex-connector

Copy link
Copy Markdown

💡 Codex Review

ds.status::text AS device_status,

P1 Badge Coalesce nullable device_status before scanning

ListCurtailmentCandidatesByOrg selects ds.status::text from a LEFT JOIN, so devices without a device_status row produce NULL; sqlc generated DeviceStatus as a non-null string, which will fail scan at runtime (converting NULL to string is unsupported) and make PreviewCurtailmentPlan return an internal error for affected orgs. This is reachable because device_status is not guaranteed to exist for every device row.


INSERT INTO curtailment_org_config (org_id)
SELECT id FROM organization
ON CONFLICT (org_id) DO NOTHING;

P1 Badge Provision curtailment config for newly created orgs

This migration only backfills curtailment_org_config from organizations present at migration time, but does not add any mechanism for future organizations; new org creation (e.g., SQLUserStore.CreateAdminUserWithOrganization) inserts into organization without inserting a matching curtailment config row, so GetOrgConfig returns NotFound and Preview fails for every org created after this migration.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (98904856f189d82db44b62872d317116831526bf...d329ff326c0ed6461811eb6059bfc4cc77952546, exact PR three-dot diff)
  • Model: gpt-5.4

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: MEDIUM

Findings

[MEDIUM] Preview now performs a real config-table write on every read

  • Category: Reliability
  • Location: server/internal/domain/stores/sqlstores/curtailment.go:38
  • Description: GetOrgConfig() now unconditionally calls EnsureCurtailmentOrgConfig(). The new SQL behind that helper uses INSERT ... ON CONFLICT DO UPDATE SET org_id = EXCLUDED.org_id RETURNING ..., so an existing row is updated on every preview request. Because curtailment_org_config also has a BEFORE UPDATE trigger that rewrites updated_at, this is not a no-op read path.
  • Impact: Every preview hits the primary with a write, creates avoidable WAL/churn, hot-locks the same org row under concurrent previews, and makes updated_at look like the config was modified by read traffic.
  • Recommendation: Split this into a true read path: SELECT first, and only on ErrNoRows do an INSERT ... ON CONFLICT DO NOTHING followed by SELECT (or use a CTE that writes only when missing).

[MEDIUM] Preview can select miners that the plugin layer cannot actually curtail

  • Category: Plugin
  • Location: server/internal/domain/curtailment/service.go:218
  • Description: The new selector only rejects candidates with missing driver_name. It does not verify that the resolved device/model advertises curtail_full support, and the file itself notes this is still a TODO. A populated driver name is not equivalent to curtailment capability.
  • Impact: Preview can overstate available curtailable load and return plans containing devices that the plugin boundary will reject later. That makes the new API produce plans that are not executable on mixed-capability fleets.
  • Recommendation: Gate eligibility on actual curtailment capability, not just driver metadata. Inject a capability provider/registry lookup into preview and skip any candidate that lacks curtail_full.

[MEDIUM] Candidate ordering is nondeterministic for equal-efficiency miners

  • Category: Correctness
  • Location: server/sqlc/queries/curtailment.sql:207
  • Description: The final candidate query has no ORDER BY. In the selector, ties are intentionally left to sort.SliceStable() so input order is preserved for equal/unknown efficiencies, but PostgreSQL does not guarantee row order without an explicit sort.
  • Impact: Identical preview requests can pick different devices whenever multiple miners have the same hourly efficiency or no efficiency row at all. Because fixed-kW selection stops once the target is met, this can make plans flap across calls and will make future execution/retry behavior non-idempotent.
  • Recommendation: Add a deterministic tie-breaker before the selector runs, e.g. an ORDER BY on stable fields such as efficiency-preserving rank inputs plus device_identifier, or add an explicit secondary sort key in Go.

Notes

I did not find an auth bypass, SQL injection, command injection, or pool-hijack path in the reviewed hunks. The main risks in this diff are reliability and plan correctness in the new curtailment preview path.


Generated by Codex Security Review |
Triggered by: @rongxin-liu |
Review workflow run

CI's golangci-lint surfaced four issues on #185:

- exhaustive: EventState.IsTerminal switch was missing the non-terminal
  cases. Adding pending/active/restoring as an explicit no-match arm
  preserves the same return value while satisfying the linter.
- exhaustive: priorityName switch hid UNSPECIFIED / NORMAL / HIGH behind
  the default arm. Listing them explicitly documents that HIGH (rejected
  by the proto validator before this code runs) collapses to NORMAL.
- gosec G115: the uint32 -> int32 cast on candidate_min_power_w_override
  needed an explicit bounds check. The buf.validate ceiling already
  caps the value well below int32 max, but a defense-in-depth guard
  surfaces a clean InvalidArgument if interceptor wiring is ever
  bypassed instead of silently wrapping.
- unconvert: int32(*req.CandidateMinPowerWOverride) in the service was a
  no-op since the field is already *int32. Drop the conversion.
@chatgpt-codex-connector

Copy link
Copy Markdown

💡 Codex Review

ds.status::text AS device_status,

P1 Badge Coalesce nullable device status in candidate query

ListCurtailmentCandidatesByOrg LEFT JOINs device_status, so ds.status is NULL when a device has no status row, but the query selects ds.status::text directly. sqlc generated DeviceStatus as a plain string, so scanning a NULL here will fail at runtime and make PreviewCurtailmentPlan return an internal error for any org containing such devices. Please coalesce this column (or otherwise make the type nullable end-to-end) before scanning.


INSERT INTO curtailment_org_config (org_id)
SELECT id FROM organization
ON CONFLICT (org_id) DO NOTHING;

P2 Badge Backfill curtailment config for newly created orgs

This migration only inserts curtailment_org_config rows for organizations that already exist at migration time. Organizations created afterward will not get a config row, and GetOrgConfig currently treats that as NotFound, which blocks preview/start flows for those tenants until someone inserts the row manually. Add a creation hook/trigger or lazy upsert path so every new org has defaults automatically.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

rongxin-liu and others added 2 commits May 6, 2026 20:54
… dual-signal counters, latest_hourly scope, cross-org guard, capability-meta gate

Layered on top of the four logical commits already on this branch,
addressing review findings (multi-reviewer pass, with three of the
critical items previously flagged on Codex review runs without
resolution).

P0
- Lazy-upsert curtailment_org_config in GetOrgConfig so orgs created
  after migration 000040 don't break Preview with NotFound. New
  EnsureCurtailmentOrgConfig query (INSERT ... ON CONFLICT DO UPDATE)
  always returns a row with table-level DEFAULTs.

P1
- COALESCE(ds.status::text, ''::text)::text in
  ListCurtailmentCandidatesByOrg so a NULL device_status row no longer
  crashes the scan; sqlc now generates DeviceStatus as string.
- Skip empty-DeviceStatus candidates with SkipStaleTelemetry rather
  than admitting them as eligible (closes the COALESCE-fallout).
- Thread dual-signal exclusion counts through BuildPlan into
  Result.InsufficientDetail so per-reason counts reach the rejection
  detail (was always zero for below-threshold/phantom-load/dead-monitor).
- Reject device-list scope with cross-org device_identifiers via
  NotFound listing the unrecognized IDs (org-ownership boundary the
  BE-2 plan called for).

P2
- Org-scope and 24h-bucket the latest_hourly CTE
  (INNER JOIN device d3 ON ... org_id = $1, WHERE bucket > NOW() - 24h)
  so Preview is no longer a global TimescaleDB hot spot.
- Reject candidate_min_power_w_override < 1 at the service layer as a
  backstop for callers that bypass the proto validator.
- Defense-in-depth capability gate: skip candidates whose
  discovered_device LEFT JOIN missed (DriverName == nil or empty) with
  SkipCurtailFullUnsupported. Full plugin-registry check still defers
  to BE-3 / BE-4.

P3
- Derive reason_selected from the request strategy enum via
  strategyReasonLabel(s) so a future strategy forces the surface to be
  touched.
- Exclude soft-deleted orgs from the migration's per-org seed.

Tests: four additive regression tests in service_test.go pinning the
new behaviors:
- TestService_Preview_EmptyDeviceStatusSkipsAsStale
- TestService_Preview_DeviceListScopeRejectsCrossOrgIdentifiers
- TestService_Preview_DualSignalCountersPropagateIntoInsufficientLoadDetail
- TestService_Preview_MissingDriverSkipsAsCurtailFullUnsupported

Verified: go test ./internal/domain/curtailment/... ./internal/handlers/curtailment/...
green; golangci-lint 0 issues; go build ./... clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removes BE-1.x / BE-2 / BE-3 / BE-4 / BE-5 references and the related
PR/issue numbers from in-code comments across the curtailment surface.
Comments now describe behavior in terms of feature work (Start dispatch,
restorer, plugin registry, etc.) rather than internal ticket scopes,
matching the project convention of keeping process artifacts out of
shipped source.

Touches: cmd/fleetd/main.go, internal/domain/curtailment/{selector,
service,service_test,enum_stability_test}.go, internal/domain/stores/
interfaces/curtailment.go, internal/handlers/curtailment/handler.go,
migrations/000040_create_curtailment.up.sql, sqlc/queries/curtailment.sql,
and the regenerated server/generated/sqlc/curtailment.sql.go.

No behavior change. Tests, lint, and build remain green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 58c90ec0ed

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

-- and the caller can read the effective row in a single round trip.
INSERT INTO curtailment_org_config (org_id)
VALUES (sqlc.arg('org_id'))
ON CONFLICT (org_id) DO UPDATE SET org_id = EXCLUDED.org_id

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Replace no-op conflict update with non-mutating ensure

ON CONFLICT ... DO UPDATE SET org_id = EXCLUDED.org_id is not a read-only no-op here: curtailment_org_config has a BEFORE UPDATE trigger (update_curtailment_org_config_updated_at in migration 000040) that rewrites updated_at, so every Preview call performs a real row update. This introduces continuous write amplification on a hot read path and corrupts the semantic meaning of updated_at (it becomes "last preview read" rather than "last config change").

Useful? React with 👍 / 👎.

Comment on lines +230 to +234
AND d.deleted_at IS NULL
AND (
sqlc.narg('device_identifiers')::text[] IS NULL
OR d.device_identifier = ANY(sqlc.narg('device_identifiers')::text[])
);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add ORDER BY to make tie-breaking deterministic

This candidate query returns rows without an ORDER BY, but selector logic relies on stable ordering for equal-efficiency miners (sort.SliceStable preserves input order). Without deterministic SQL ordering, identical inputs can produce different selected miners across calls whenever efficiencies tie near the target boundary, making preview results non-reproducible.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 86e78aeff1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

}

// Loop ended without hitting target — totalW is the full-set sum.
if toleranceW > 0 && totalW >= targetW-toleranceW {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reject tolerance bands that make zero-load plans succeed

The undershoot branch treats any request with tolerance_kw >= target_kw as successful even when no load is available, because totalW >= targetW - toleranceW becomes true for totalW == 0. With current proto validation (tolerance_kw >= 0 only), a caller can get OutcomeUndershootTolerated and a 200 Preview response with an empty candidate list for a positive target, which is a false-success plan. Please add a guard (for example reject tolerance_kw > target_kw, or otherwise prevent zero/near-zero availability from entering the tolerated-success path).

Useful? React with 👍 / 👎.

…tes allocation

Simplification pass over the curtailment surface — comment trims plus one
small allocation fix. No behavior change.

Comments
- service.go: shorten the cross-org-guard rationale and the
  capability-gate TODO; per-PR depth lives in the ticket, not the source.
- selector.go: replace the 20-line BuildPlan doc that re-narrated the
  function body with a 6-line summary. Kept the load-bearing "pure: no
  time, no I/O, no shared state" invariant.
- modes/mode.go: trim per-Outcome doc comments; keep the non-obvious
  "small overshoot is unavoidable since miners are atomic" note.

Code
- translate.go: collapse the levelName two-if-same-return into a single
  OR-check.
- service.go: pre-allocate skipped with len(cands) capacity in
  classifyCandidates so a low-power fleet doesn't re-grow the slice on
  every iteration.

Verified: go test ./internal/domain/curtailment/...
./internal/handlers/curtailment/... green; golangci-lint 0 issues;
go build ./... clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d329ff326c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +90 to +93
Outcome: OutcomeInsufficientLoad,
Selected: nil,
RealizedReductionW: totalW,
InsufficientDetail: &detail,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Report zero reduction when insufficient load selects no miners

In the insufficient-load branch, Selected is empty but RealizedReductionW is set to totalW, which violates the mode contract that realized reduction is the sum of selected candidates. This makes downstream Plan math contradictory (EstimatedReductionKW can be non-zero while zero miners are selected, and remaining power becomes 0), which can corrupt diagnostics for any caller using Service.Preview directly or if the handler later returns plans for insufficient cases.

Useful? React with 👍 / 👎.

Comment on lines +232 to +233
sqlc.narg('device_identifiers')::text[] IS NULL
OR d.device_identifier = ANY(sqlc.narg('device_identifiers')::text[])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat empty device identifier filters as whole-org scope

The filter condition only treats NULL as whole-org scope; an empty array ([]) falls into ANY('{}'), which matches nothing. This conflicts with the store contract that nil/empty should both mean whole-org and can silently produce empty candidate sets (and false insufficient-load results) when callers pass an empty slice instead of nil.

Useful? React with 👍 / 👎.

@rongxin-liu

Copy link
Copy Markdown
Contributor Author

Closing in favor of a fresh PR off the same branch. Earlier iterations accumulated stale review feedback (now resolved in code) that adds noise to ongoing review. Re-opening with the same diff and PR description on the same branch; will link the replacement here once it's up.

@rongxin-liu rongxin-liu closed this May 6, 2026
@rongxin-liu

Copy link
Copy Markdown
Contributor Author

Replacement: #188. Same branch, same diff, same description — fresh review thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add curtailment persistence and preview selector

1 participant