|
| 1 | +# OpenTelemetry Telemetry Strategy |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +Migrate the controller from `prometheus/client_golang` to the OpenTelemetry metrics SDK. Designed around the Google SRE Four Golden Signals — start with core service health, add granularity once we understand sharp edges. |
| 6 | + |
| 7 | +## Architecture |
| 8 | + |
| 9 | +Single `MeterProvider` with two readers: |
| 10 | + |
| 11 | +``` |
| 12 | + +---> OTLP PeriodicReader ---> user's collector |
| 13 | + | |
| 14 | +OTel MeterProvider (with 2 readers) | |
| 15 | + | |
| 16 | + +---> Prometheus Exporter ---> controller-runtime /metrics |
| 17 | +``` |
| 18 | + |
| 19 | +- **Prometheus exporter**: registers into controller-runtime's `metrics.Registry`. OTel metrics appear alongside controller-runtime's own metrics on `/metrics`. Zero scraping changes. |
| 20 | +- **OTLP exporter**: enabled when `OTEL_EXPORTER_OTLP_ENDPOINT` is set. Users point at any backend. |
| 21 | + |
| 22 | +## Configuration |
| 23 | + |
| 24 | +Standard OTel environment variables — no custom flags: |
| 25 | +- `OTEL_EXPORTER_OTLP_ENDPOINT` — collector address |
| 26 | +- `OTEL_SERVICE_NAME` — override (defaults to `sei-k8s-controller`) |
| 27 | +- `OTEL_RESOURCE_ATTRIBUTES` — user-provided extras |
| 28 | + |
| 29 | +## Resource Attributes (Soft Context) |
| 30 | + |
| 31 | +| Attribute | Source | |
| 32 | +|-----------|--------| |
| 33 | +| `service.name` | Hardcoded: `sei-k8s-controller` | |
| 34 | +| `service.version` | Build-time ldflags | |
| 35 | +| `k8s.pod.name` | Downward API | |
| 36 | +| `k8s.namespace.name` | Downward API | |
| 37 | + |
| 38 | +## Four Golden Signals → 7 Metrics |
| 39 | + |
| 40 | +### LATENCY (2 metrics) |
| 41 | + |
| 42 | +**1. `sei.controller.node.phase.duration`** — Float64 Histogram (unit: s) |
| 43 | +- Dimensions: `namespace`, `chain_id`, `phase` |
| 44 | +- Buckets: 10, 30, 60, 120, 300, 600, 900, 1800, 3600 |
| 45 | +- Records time spent in each phase when a node transitions out. Phase IS the dimension — no separate init-specific metric. Initializing durations tell you bootstrap health. Running durations are expected to be long. |
| 46 | +- Requires adding `PhaseTransitionTime` to SeiNodeStatus (set on every phase transition, observed on the next transition). |
| 47 | +- Cardinality at 200 nodes: ~5 phases × ~5 chains = ~25 histogram series. No per-node label. |
| 48 | + |
| 49 | +**2. `sei.controller.plan.duration`** — Float64 Histogram (unit: s) |
| 50 | +- Dimensions: `controller`, `namespace`, `plan_type` |
| 51 | +- Buckets: 1, 5, 10, 30, 60, 120, 300, 600, 1800 |
| 52 | +- Records wall-clock time from plan creation to completion/failure. Answers "how long do init plans take vs node-update plans?" |
| 53 | +- Cardinality: ~2 controllers × ~4 plan_types = ~8 histogram series. |
| 54 | + |
| 55 | +### TRAFFIC (2 metrics) |
| 56 | + |
| 57 | +**3. `sei.controller.node.phase`** — Observable Float64 Gauge |
| 58 | +- Dimensions: `namespace`, `name`, `phase` |
| 59 | +- 0/1 per phase (kube-state-metrics pattern). Observable gauge with phaseTracker callback — cleanup on deletion is automatic. |
| 60 | +- Cardinality at 200 nodes: 200 × 5 phases = 1,000 series. |
| 61 | + |
| 62 | +**4. `sei.controller.phase.transitions`** — Int64 Counter |
| 63 | +- Dimensions: `controller`, `namespace`, `from_phase`, `to_phase` |
| 64 | +- Unified for both SeiNode and SeiNodeDeployment (the `controller` label disambiguates). |
| 65 | +- Cardinality: ~2 controllers × ~10 valid transitions = ~20 series. |
| 66 | + |
| 67 | +### ERRORS (2 metrics) |
| 68 | + |
| 69 | +**5. `sei.controller.plan.failures`** — Int64 Counter |
| 70 | +- Dimensions: `controller`, `namespace`, `plan_type` |
| 71 | +- Incremented when a plan reaches terminal Failed state. |
| 72 | +- Cardinality: ~2 controllers × ~4 plan_types = ~8 series. |
| 73 | + |
| 74 | +**6. `sei.controller.reconcile.errors`** — Int64 Counter |
| 75 | +- Dimensions: `controller`, `namespace` |
| 76 | +- No per-node `name` label — unbounded cardinality at scale. Namespace-scoped is sufficient for alerting; logs have the node name for debugging. |
| 77 | +- Cardinality: ~2 controllers × ~3 namespaces = ~6 series. |
| 78 | + |
| 79 | +### SATURATION (1 metric) |
| 80 | + |
| 81 | +**7. `sei.controller.plan.active`** — Int64 Gauge |
| 82 | +- Dimensions: `controller`, `namespace` |
| 83 | +- Aggregate count of in-progress plans (not per-node). Answers "how loaded is the controller?" |
| 84 | +- Cardinality: ~2 controllers × ~3 namespaces = ~6 series. |
| 85 | + |
| 86 | +### Deployment-Level (optional, low cardinality) |
| 87 | + |
| 88 | +These cover SeiNodeDeployment resources (3-10 per cluster, not 200): |
| 89 | + |
| 90 | +**8. `sei.controller.deployment.phase`** — Observable Float64 Gauge |
| 91 | +- Dimensions: `namespace`, `name`, `phase` |
| 92 | +- Same 0/1 pattern. |
| 93 | + |
| 94 | +**9. `sei.controller.deployment.replicas`** — Float64 Gauge |
| 95 | +- Dimensions: `namespace`, `name`, `replica_state` (desired/ready) |
| 96 | +- Renamed from `type` to `replica_state` to avoid attribute overload. |
| 97 | + |
| 98 | +## What Gets Cut |
| 99 | + |
| 100 | +| Metric | Reason | |
| 101 | +|--------|--------| |
| 102 | +| `seinode.init_duration_seconds` | Replaced by generic `node.phase.duration{phase=Initializing}` | |
| 103 | +| `seinode.last_init_duration_seconds` | Per-node gauge, high cardinality, replaced by phase duration histogram | |
| 104 | +| `seinode.phase_transitions_total` | Replaced by unified `phase.transitions` counter | |
| 105 | +| `task.retries_total` | Task-level detail, too granular for launch | |
| 106 | +| `task.failures_total` | Rolled up into `plan.failures` | |
| 107 | +| `reconcile_errors_total` (per-name) | Fixed: drop `name` label, keep namespace-scoped | |
| 108 | +| `seinodedeployment.condition` | Conditions are in CRD status — queryable via kube API, not needed in metrics for launch | |
| 109 | +| `seinodedeployment.reconcile_substep_duration` | Internal implementation detail | |
| 110 | + |
| 111 | +## Attribute Constants |
| 112 | + |
| 113 | +```go |
| 114 | +var ( |
| 115 | + AttrController = attribute.Key("controller") |
| 116 | + AttrNamespace = attribute.Key("namespace") |
| 117 | + AttrName = attribute.Key("name") |
| 118 | + AttrPhase = attribute.Key("phase") |
| 119 | + AttrFromPhase = attribute.Key("from_phase") |
| 120 | + AttrToPhase = attribute.Key("to_phase") |
| 121 | + AttrChainID = attribute.Key("chain_id") |
| 122 | + AttrPlanType = attribute.Key("plan_type") |
| 123 | + AttrReplicaState = attribute.Key("replica_state") |
| 124 | +) |
| 125 | +``` |
| 126 | + |
| 127 | +No overloaded `type` attribute. |
| 128 | + |
| 129 | +## Cardinality at 200 Nodes |
| 130 | + |
| 131 | +| Metric | Series | |
| 132 | +|--------|--------| |
| 133 | +| node.phase.duration (histogram) | ~25 | |
| 134 | +| plan.duration (histogram) | ~8 | |
| 135 | +| node.phase (gauge) | ~1,000 | |
| 136 | +| phase.transitions (counter) | ~20 | |
| 137 | +| plan.failures (counter) | ~8 | |
| 138 | +| reconcile.errors (counter) | ~6 | |
| 139 | +| plan.active (gauge) | ~6 | |
| 140 | +| deployment.phase (gauge) | ~50 | |
| 141 | +| deployment.replicas (gauge) | ~20 | |
| 142 | +| **Total** | **~1,143 series** | |
| 143 | + |
| 144 | +~70% reduction from the previous design. |
| 145 | + |
| 146 | +## Alerts (4) |
| 147 | + |
| 148 | +| Alert | Signal | Expression | |
| 149 | +|-------|--------|-----------| |
| 150 | +| Node stuck | Latency | `node.phase{phase=Initializing}` held for >30min with no transition | |
| 151 | +| Plan failure rate | Errors | `rate(plan.failures) / rate(phase.transitions)` > threshold | |
| 152 | +| Sustained reconcile errors | Errors | controller-runtime `reconcile_total{result=error}` >25% over 10min | |
| 153 | +| Controller saturated | Saturation | `plan.active` growing monotonically OR `workqueue_depth` growing | |
| 154 | + |
| 155 | +## CRD Change |
| 156 | + |
| 157 | +Add `PhaseTransitionTime` to `SeiNodeStatus`: |
| 158 | +```go |
| 159 | +// PhaseTransitionTime is when the node last changed phases. |
| 160 | +// Used to compute phase duration metrics. |
| 161 | +// +optional |
| 162 | +PhaseTransitionTime *metav1.Time `json:"phaseTransitionTime,omitempty"` |
| 163 | +``` |
| 164 | + |
| 165 | +## Implementation Phases |
| 166 | + |
| 167 | +### Phase 1: OTel Infrastructure |
| 168 | +- Add OTel SDK dependencies |
| 169 | +- `initMeterProvider()` in `cmd/main.go` |
| 170 | +- Attribute constants in `observability/` |
| 171 | +- Downward API env vars |
| 172 | +- No metric changes |
| 173 | + |
| 174 | +### Phase 2: Core Metrics (the 7+2) |
| 175 | +- Replace all existing metrics with the new set |
| 176 | +- Add `PhaseTransitionTime` to CRD |
| 177 | +- Implement phaseTracker for observable gauges |
| 178 | +- Single PR, atomic swap |
| 179 | + |
| 180 | +### Phase 3: Remove Prometheus Client |
| 181 | +- Remove `prometheus/client_golang` direct dependency |
| 182 | +- Only transitive through controller-runtime |
0 commit comments