Skip to content

Commit 7027b90

Browse files
bdchathamclaude
andauthored
design: OpenTelemetry telemetry strategy (#96)
* design: OpenTelemetry telemetry strategy Operational-first telemetry design for the sei-k8s-controller. Migrates from prometheus/client_golang to OTel metrics SDK so users can configure any backend. Key decisions: - 15 metrics (cut 4, keep 8 with new dimensions, add 7) - Every metric answers a specific operational question - OTel MeterProvider with dual readers (Prometheus bridge + OTLP) - Resource attributes for soft context (service identity) - Per-datapoint attributes match existing Prometheus labels for compat - 4 alerts designed around "stuck vs slow" distinction - 5-phase incremental migration, each phase independently deployable Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * design: slim to Four Golden Signals — 7+2 metrics, cut task granularity Address PR feedback: start with Google SRE Golden Signals, add granularity later. Key changes: - 7 core metrics + 2 optional deployment metrics (down from 15) - Phase duration is generic — phase IS the dimension, not a separate metric - Cut task-level details (retries, duration, progress) — too granular for launch - Cut sidecar observability — separate dashboard later - Fix attribute overload: type → replica_state / condition_type - ~1,143 series at 200 nodes (70% reduction) - 3 implementation phases instead of 5 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 993c785 commit 7027b90

1 file changed

Lines changed: 182 additions & 0 deletions

File tree

Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
# OpenTelemetry Telemetry Strategy
2+
3+
## Overview
4+
5+
Migrate the controller from `prometheus/client_golang` to the OpenTelemetry metrics SDK. Designed around the Google SRE Four Golden Signals — start with core service health, add granularity once we understand sharp edges.
6+
7+
## Architecture
8+
9+
Single `MeterProvider` with two readers:
10+
11+
```
12+
+---> OTLP PeriodicReader ---> user's collector
13+
|
14+
OTel MeterProvider (with 2 readers) |
15+
|
16+
+---> Prometheus Exporter ---> controller-runtime /metrics
17+
```
18+
19+
- **Prometheus exporter**: registers into controller-runtime's `metrics.Registry`. OTel metrics appear alongside controller-runtime's own metrics on `/metrics`. Zero scraping changes.
20+
- **OTLP exporter**: enabled when `OTEL_EXPORTER_OTLP_ENDPOINT` is set. Users point at any backend.
21+
22+
## Configuration
23+
24+
Standard OTel environment variables — no custom flags:
25+
- `OTEL_EXPORTER_OTLP_ENDPOINT` — collector address
26+
- `OTEL_SERVICE_NAME` — override (defaults to `sei-k8s-controller`)
27+
- `OTEL_RESOURCE_ATTRIBUTES` — user-provided extras
28+
29+
## Resource Attributes (Soft Context)
30+
31+
| Attribute | Source |
32+
|-----------|--------|
33+
| `service.name` | Hardcoded: `sei-k8s-controller` |
34+
| `service.version` | Build-time ldflags |
35+
| `k8s.pod.name` | Downward API |
36+
| `k8s.namespace.name` | Downward API |
37+
38+
## Four Golden Signals → 7 Metrics
39+
40+
### LATENCY (2 metrics)
41+
42+
**1. `sei.controller.node.phase.duration`** — Float64 Histogram (unit: s)
43+
- Dimensions: `namespace`, `chain_id`, `phase`
44+
- Buckets: 10, 30, 60, 120, 300, 600, 900, 1800, 3600
45+
- Records time spent in each phase when a node transitions out. Phase IS the dimension — no separate init-specific metric. Initializing durations tell you bootstrap health. Running durations are expected to be long.
46+
- Requires adding `PhaseTransitionTime` to SeiNodeStatus (set on every phase transition, observed on the next transition).
47+
- Cardinality at 200 nodes: ~5 phases × ~5 chains = ~25 histogram series. No per-node label.
48+
49+
**2. `sei.controller.plan.duration`** — Float64 Histogram (unit: s)
50+
- Dimensions: `controller`, `namespace`, `plan_type`
51+
- Buckets: 1, 5, 10, 30, 60, 120, 300, 600, 1800
52+
- Records wall-clock time from plan creation to completion/failure. Answers "how long do init plans take vs node-update plans?"
53+
- Cardinality: ~2 controllers × ~4 plan_types = ~8 histogram series.
54+
55+
### TRAFFIC (2 metrics)
56+
57+
**3. `sei.controller.node.phase`** — Observable Float64 Gauge
58+
- Dimensions: `namespace`, `name`, `phase`
59+
- 0/1 per phase (kube-state-metrics pattern). Observable gauge with phaseTracker callback — cleanup on deletion is automatic.
60+
- Cardinality at 200 nodes: 200 × 5 phases = 1,000 series.
61+
62+
**4. `sei.controller.phase.transitions`** — Int64 Counter
63+
- Dimensions: `controller`, `namespace`, `from_phase`, `to_phase`
64+
- Unified for both SeiNode and SeiNodeDeployment (the `controller` label disambiguates).
65+
- Cardinality: ~2 controllers × ~10 valid transitions = ~20 series.
66+
67+
### ERRORS (2 metrics)
68+
69+
**5. `sei.controller.plan.failures`** — Int64 Counter
70+
- Dimensions: `controller`, `namespace`, `plan_type`
71+
- Incremented when a plan reaches terminal Failed state.
72+
- Cardinality: ~2 controllers × ~4 plan_types = ~8 series.
73+
74+
**6. `sei.controller.reconcile.errors`** — Int64 Counter
75+
- Dimensions: `controller`, `namespace`
76+
- No per-node `name` label — unbounded cardinality at scale. Namespace-scoped is sufficient for alerting; logs have the node name for debugging.
77+
- Cardinality: ~2 controllers × ~3 namespaces = ~6 series.
78+
79+
### SATURATION (1 metric)
80+
81+
**7. `sei.controller.plan.active`** — Int64 Gauge
82+
- Dimensions: `controller`, `namespace`
83+
- Aggregate count of in-progress plans (not per-node). Answers "how loaded is the controller?"
84+
- Cardinality: ~2 controllers × ~3 namespaces = ~6 series.
85+
86+
### Deployment-Level (optional, low cardinality)
87+
88+
These cover SeiNodeDeployment resources (3-10 per cluster, not 200):
89+
90+
**8. `sei.controller.deployment.phase`** — Observable Float64 Gauge
91+
- Dimensions: `namespace`, `name`, `phase`
92+
- Same 0/1 pattern.
93+
94+
**9. `sei.controller.deployment.replicas`** — Float64 Gauge
95+
- Dimensions: `namespace`, `name`, `replica_state` (desired/ready)
96+
- Renamed from `type` to `replica_state` to avoid attribute overload.
97+
98+
## What Gets Cut
99+
100+
| Metric | Reason |
101+
|--------|--------|
102+
| `seinode.init_duration_seconds` | Replaced by generic `node.phase.duration{phase=Initializing}` |
103+
| `seinode.last_init_duration_seconds` | Per-node gauge, high cardinality, replaced by phase duration histogram |
104+
| `seinode.phase_transitions_total` | Replaced by unified `phase.transitions` counter |
105+
| `task.retries_total` | Task-level detail, too granular for launch |
106+
| `task.failures_total` | Rolled up into `plan.failures` |
107+
| `reconcile_errors_total` (per-name) | Fixed: drop `name` label, keep namespace-scoped |
108+
| `seinodedeployment.condition` | Conditions are in CRD status — queryable via kube API, not needed in metrics for launch |
109+
| `seinodedeployment.reconcile_substep_duration` | Internal implementation detail |
110+
111+
## Attribute Constants
112+
113+
```go
114+
var (
115+
AttrController = attribute.Key("controller")
116+
AttrNamespace = attribute.Key("namespace")
117+
AttrName = attribute.Key("name")
118+
AttrPhase = attribute.Key("phase")
119+
AttrFromPhase = attribute.Key("from_phase")
120+
AttrToPhase = attribute.Key("to_phase")
121+
AttrChainID = attribute.Key("chain_id")
122+
AttrPlanType = attribute.Key("plan_type")
123+
AttrReplicaState = attribute.Key("replica_state")
124+
)
125+
```
126+
127+
No overloaded `type` attribute.
128+
129+
## Cardinality at 200 Nodes
130+
131+
| Metric | Series |
132+
|--------|--------|
133+
| node.phase.duration (histogram) | ~25 |
134+
| plan.duration (histogram) | ~8 |
135+
| node.phase (gauge) | ~1,000 |
136+
| phase.transitions (counter) | ~20 |
137+
| plan.failures (counter) | ~8 |
138+
| reconcile.errors (counter) | ~6 |
139+
| plan.active (gauge) | ~6 |
140+
| deployment.phase (gauge) | ~50 |
141+
| deployment.replicas (gauge) | ~20 |
142+
| **Total** | **~1,143 series** |
143+
144+
~70% reduction from the previous design.
145+
146+
## Alerts (4)
147+
148+
| Alert | Signal | Expression |
149+
|-------|--------|-----------|
150+
| Node stuck | Latency | `node.phase{phase=Initializing}` held for >30min with no transition |
151+
| Plan failure rate | Errors | `rate(plan.failures) / rate(phase.transitions)` > threshold |
152+
| Sustained reconcile errors | Errors | controller-runtime `reconcile_total{result=error}` >25% over 10min |
153+
| Controller saturated | Saturation | `plan.active` growing monotonically OR `workqueue_depth` growing |
154+
155+
## CRD Change
156+
157+
Add `PhaseTransitionTime` to `SeiNodeStatus`:
158+
```go
159+
// PhaseTransitionTime is when the node last changed phases.
160+
// Used to compute phase duration metrics.
161+
// +optional
162+
PhaseTransitionTime *metav1.Time `json:"phaseTransitionTime,omitempty"`
163+
```
164+
165+
## Implementation Phases
166+
167+
### Phase 1: OTel Infrastructure
168+
- Add OTel SDK dependencies
169+
- `initMeterProvider()` in `cmd/main.go`
170+
- Attribute constants in `observability/`
171+
- Downward API env vars
172+
- No metric changes
173+
174+
### Phase 2: Core Metrics (the 7+2)
175+
- Replace all existing metrics with the new set
176+
- Add `PhaseTransitionTime` to CRD
177+
- Implement phaseTracker for observable gauges
178+
- Single PR, atomic swap
179+
180+
### Phase 3: Remove Prometheus Client
181+
- Remove `prometheus/client_golang` direct dependency
182+
- Only transitive through controller-runtime

0 commit comments

Comments
 (0)