You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: replace all Prometheus metrics with OTel instruments (7+2) (#98)
* feat: replace all Prometheus metrics with OTel instruments (7+2)
Phase 2 of the OTel migration. Atomic swap of all 12 prometheus/client_golang
metrics with 7 core + 2 deployment OTel instruments mapped to the Four
Golden Signals.
LATENCY:
- sei.controller.seinode.phase.duration — time in each phase (histogram)
- sei.controller.plan.duration — plan wall-clock time (histogram)
TRAFFIC:
- sei.controller.seinode.phase — observable gauge with PhaseTracker callback
- sei.controller.seinodedeployment.phase — same pattern
ERRORS:
- sei.controller.plan.failures — terminal plan failure counter
- sei.controller.reconcile.errors — namespace-scoped error counter
SATURATION:
- sei.controller.plan.active — UpDownCounter for concurrent plans
DEPLOYMENT:
- sei.controller.seinodedeployment.replicas — replica_state dimension
CRD change: PhaseTransitionTime added to SeiNodeStatus for phase duration
computation. Stamped in setTargetPhase (executor) and ResolvePlan
(Pending→Initializing transition).
Cut: phase_transitions_total (combinatorial), last_init_duration (per-node),
reconcile_substep_duration (internal), reconcile_errors per-name (unbounded),
condition gauge (CRD queryable), timeSubstep wrapper.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: address OTel expert review — per-package meters, no redundant counter
Three fixes from the OpenTelemetry expert review:
1. Per-package meters: each package now creates its own meter via
observability.NewMeter("pkg") for proper scope identification in
backends. Replaces the shared global Meter.
2. Remove redundant planFailures counter: planDuration histogram with
an outcome attribute (complete/failed) provides both duration AND
count via _count. The histogram's outcome dimension replaces the
separate failure counter.
3. Document context.Background() usage in handleTerminalPlan as a
TODO for when tracing is added. Acceptable for metrics-only
recording in a controller reconcile loop.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: thread context through metric emission sites
Thread the reconcile ctx to all metric emission call sites instead of
using context.Background(). This ensures proper context propagation
when tracing is added.
- ResolvePlan now accepts ctx, passes it to handleTerminalPlan
- emitGroupReplicas now accepts ctx from the reconciler
- No remaining context.Background() in production metric code
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: use t.Context() in planner tests instead of context.Background()
Per OTel expert guide: use t.Context() in tests for automatic
cancellation when the test ends.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: address PR comments — gauge for plan.active, cleanup, constant
1. planActiveCount: switched from UpDownCounter to Int64Gauge backed
by an atomic counter. Records the absolute count on each change.
2. Remove dead CleanupPlanMetrics no-op function.
3. Use seiNodeControllerName constant instead of "seinode" string.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* revert: use UpDownCounter for plan.active — designed for increment/decrement
Int64UpDownCounter is the correct OTel instrument for "number of active X"
patterns. It has Add(+1)/Add(-1) built in, maps to a Prometheus gauge,
and doesn't need a separate atomic counter. Reverts the gauge change.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0 commit comments