device-health-oracle: add controller_success device activation criterion#3503
Merged
device-health-oracle: add controller_success device activation criterion#3503
Conversation
c4e29fb to
efe6fa4
Compare
Add a criteria-based evaluation pattern to the device-health-oracle and implement the first criterion: devices must have called the controller at least once per minute over the burn-in period (verified via ClickHouse controller_grpc_getconfig_success table). - Introduce DeviceCriterion/LinkCriterion interfaces and stage-aware evaluators that enforce Pending → ReadyForLinks → ReadyForUsers progression for devices (minimum two ticks to reach ReadyForUsers) - Add ControllerSuccessCriterion querying ClickHouse for controller call coverage over the burn-in window, with start times resolved via GetBlockTime - Optimize update logic to skip onchain health writes when the value is already at the desired state - ClickHouse connection is optional via CLICKHOUSE_ADDR env var; when not set, the oracle falls back to no-criteria behavior - Validate ClickHouse database name to prevent SQL injection
00575f6 to
efb5861
Compare
- Set burn-in start to Now when slot is 0 (new environments) so the zero-length window passes criteria immediately - Default CLICKHOUSE_DB to "default" instead of env name, matching controller and geoprobe conventions - Log error on every tick when no device criteria are configured - Use base58 pubkey string instead of raw bytes in criterion logs
martinsander00
approved these changes
Apr 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves: #3493
Summary of Changes
controller_successdevice activation criterion that queries ClickHouse to verify devices have called the controller at least once per minute over the burn-in period before advancing healthPending → ReadyForLinks → ReadyForUsers) to support the many criteria planned in RFC-12CLICKHOUSE_ADDRenv var — backward compatible when not setDiff Breakdown
~330 lines of core logic implementing the criteria pattern and ClickHouse integration, well-covered by ~440 lines of tests.
Key files (click to expand)
criteria.go— DeviceCriterion/LinkCriterion interfaces, BurnInTimes context helpers, and stage-aware evaluatorsclickhouse.go— ClickHouse client with ControllerCallCoverage query and database name validationcontroller_success.go— ControllerSuccessCriterion querying ClickHouse for per-minute call coverageworker.go— Evaluator integration, GetBlockTime burn-in resolution, skip-update optimizationTesting Verification
Pendingadvance toReadyForLinks(not directly toReadyForUsers), and failing criteria block advancementdriver.Conn, verifying query construction, database name quoting, and error propagationstatus.IsDrained()