Skip to content

device-health-oracle: add controller_success device activation criterion#3503

Merged
nikw9944 merged 3 commits intomainfrom
nikw9944/doublezero-3493
Apr 14, 2026
Merged

device-health-oracle: add controller_success device activation criterion#3503
nikw9944 merged 3 commits intomainfrom
nikw9944/doublezero-3493

Conversation

@nikw9944
Copy link
Copy Markdown
Contributor

@nikw9944 nikw9944 commented Apr 9, 2026

Resolves: #3493

Summary of Changes

  • Add controller_success device activation criterion that queries ClickHouse to verify devices have called the controller at least once per minute over the burn-in period before advancing health
  • Introduce criteria-based evaluation pattern with stage-aware health progression (Pending → ReadyForLinks → ReadyForUsers) to support the many criteria planned in RFC-12
  • Skip onchain health writes when the value is already at the desired state
  • ClickHouse connection is optional via CLICKHOUSE_ADDR env var — backward compatible when not set

Diff Breakdown

Category Files Lines (+/-) Net
Core logic 4 +332 / -4 +328
Scaffolding 3 +72 / -2 +70
Tests 3 +437 / -0 +437
Docs 2 +4 / -2 +2

~330 lines of core logic implementing the criteria pattern and ClickHouse integration, well-covered by ~440 lines of tests.

Key files (click to expand)
  • criteria.go — DeviceCriterion/LinkCriterion interfaces, BurnInTimes context helpers, and stage-aware evaluators
  • clickhouse.go — ClickHouse client with ControllerCallCoverage query and database name validation
  • controller_success.go — ControllerSuccessCriterion querying ClickHouse for per-minute call coverage
  • worker.go — Evaluator integration, GetBlockTime burn-in resolution, skip-update optimization

Testing Verification

  • Evaluator enforces stage progression — devices at Pending advance to ReadyForLinks (not directly to ReadyForUsers), and failing criteria block advancement
  • ClickHouse client tested via mocked driver.Conn, verifying query construction, database name quoting, and error propagation
  • Burn-in slot count correctly selects provisioning (200K) vs drained (5K) based on status.IsDrained()
  • Zero-length burn-in window (new environment) short-circuits to pass without querying ClickHouse

@nikw9944 nikw9944 marked this pull request as draft April 9, 2026 14:38
@nikw9944 nikw9944 force-pushed the nikw9944/doublezero-3493 branch from c4e29fb to efe6fa4 Compare April 9, 2026 14:45
@nikw9944 nikw9944 changed the title device-health-oracle: add controller_success activation criterion device-health-oracle: add controller_success device activation criterion Apr 9, 2026
Add a criteria-based evaluation pattern to the device-health-oracle and
implement the first criterion: devices must have called the controller
at least once per minute over the burn-in period (verified via ClickHouse
controller_grpc_getconfig_success table).

- Introduce DeviceCriterion/LinkCriterion interfaces and stage-aware
  evaluators that enforce Pending → ReadyForLinks → ReadyForUsers
  progression for devices (minimum two ticks to reach ReadyForUsers)
- Add ControllerSuccessCriterion querying ClickHouse for controller call
  coverage over the burn-in window, with start times resolved via
  GetBlockTime
- Optimize update logic to skip onchain health writes when the value is
  already at the desired state
- ClickHouse connection is optional via CLICKHOUSE_ADDR env var; when
  not set, the oracle falls back to no-criteria behavior
- Validate ClickHouse database name to prevent SQL injection
@nikw9944 nikw9944 force-pushed the nikw9944/doublezero-3493 branch from 00575f6 to efb5861 Compare April 13, 2026 14:57
@nikw9944 nikw9944 marked this pull request as ready for review April 13, 2026 15:35
Comment thread controlplane/device-health-oracle/internal/worker/worker.go
Comment thread controlplane/device-health-oracle/cmd/device-health-oracle/main.go
Comment thread controlplane/device-health-oracle/cmd/device-health-oracle/main.go Outdated
Comment thread controlplane/device-health-oracle/internal/worker/criteria.go
- Set burn-in start to Now when slot is 0 (new environments) so the
  zero-length window passes criteria immediately
- Default CLICKHOUSE_DB to "default" instead of env name, matching
  controller and geoprobe conventions
- Log error on every tick when no device criteria are configured
- Use base58 pubkey string instead of raw bytes in criterion logs
@nikw9944 nikw9944 enabled auto-merge (squash) April 14, 2026 20:49
@nikw9944 nikw9944 merged commit 7c3c957 into main Apr 14, 2026
33 checks passed
@nikw9944 nikw9944 deleted the nikw9944/doublezero-3493 branch April 14, 2026 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

device-health-oracle: Add activation criterion: devices must consistently call controller

2 participants