Skip to content

Commit fd8708a

Browse files
bdchathamclaude
andauthored
feat: replace all Prometheus metrics with OTel instruments (7+2) (#98)
* feat: replace all Prometheus metrics with OTel instruments (7+2) Phase 2 of the OTel migration. Atomic swap of all 12 prometheus/client_golang metrics with 7 core + 2 deployment OTel instruments mapped to the Four Golden Signals. LATENCY: - sei.controller.seinode.phase.duration — time in each phase (histogram) - sei.controller.plan.duration — plan wall-clock time (histogram) TRAFFIC: - sei.controller.seinode.phase — observable gauge with PhaseTracker callback - sei.controller.seinodedeployment.phase — same pattern ERRORS: - sei.controller.plan.failures — terminal plan failure counter - sei.controller.reconcile.errors — namespace-scoped error counter SATURATION: - sei.controller.plan.active — UpDownCounter for concurrent plans DEPLOYMENT: - sei.controller.seinodedeployment.replicas — replica_state dimension CRD change: PhaseTransitionTime added to SeiNodeStatus for phase duration computation. Stamped in setTargetPhase (executor) and ResolvePlan (Pending→Initializing transition). Cut: phase_transitions_total (combinatorial), last_init_duration (per-node), reconcile_substep_duration (internal), reconcile_errors per-name (unbounded), condition gauge (CRD queryable), timeSubstep wrapper. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: address OTel expert review — per-package meters, no redundant counter Three fixes from the OpenTelemetry expert review: 1. Per-package meters: each package now creates its own meter via observability.NewMeter("pkg") for proper scope identification in backends. Replaces the shared global Meter. 2. Remove redundant planFailures counter: planDuration histogram with an outcome attribute (complete/failed) provides both duration AND count via _count. The histogram's outcome dimension replaces the separate failure counter. 3. Document context.Background() usage in handleTerminalPlan as a TODO for when tracing is added. Acceptable for metrics-only recording in a controller reconcile loop. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: thread context through metric emission sites Thread the reconcile ctx to all metric emission call sites instead of using context.Background(). This ensures proper context propagation when tracing is added. - ResolvePlan now accepts ctx, passes it to handleTerminalPlan - emitGroupReplicas now accepts ctx from the reconciler - No remaining context.Background() in production metric code Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: use t.Context() in planner tests instead of context.Background() Per OTel expert guide: use t.Context() in tests for automatic cancellation when the test ends. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: address PR comments — gauge for plan.active, cleanup, constant 1. planActiveCount: switched from UpDownCounter to Int64Gauge backed by an atomic counter. Records the absolute count on each change. 2. Remove dead CleanupPlanMetrics no-op function. 3. Use seiNodeControllerName constant instead of "seinode" string. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * revert: use UpDownCounter for plan.active — designed for increment/decrement Int64UpDownCounter is the correct OTel instrument for "number of active X" patterns. It has Add(+1)/Add(-1) built in, maps to a Prometheus gauge, and doesn't need a separate atomic counter. Reverts the gauge change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 2bf87e4 commit fd8708a

16 files changed

Lines changed: 336 additions & 251 deletions

File tree

api/v1alpha1/seinode_types.go

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -229,6 +229,11 @@ type SeiNodeStatus struct {
229229
// +optional
230230
Conditions []metav1.Condition `json:"conditions,omitempty"`
231231

232+
// PhaseTransitionTime is when the node last changed phases.
233+
// Used to compute phase duration metrics.
234+
// +optional
235+
PhaseTransitionTime *metav1.Time `json:"phaseTransitionTime,omitempty"`
236+
232237
// Plan tracks the active task sequence for this node. A planner generates
233238
// the plan based on the node's current state and conditions.
234239
// +optional

api/v1alpha1/zz_generated.deepcopy.go

Lines changed: 4 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

config/crd/sei.io_seinodes.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -582,6 +582,12 @@ spec:
582582
- Failed
583583
- Terminating
584584
type: string
585+
phaseTransitionTime:
586+
description: |-
587+
PhaseTransitionTime is when the node last changed phases.
588+
Used to compute phase duration metrics.
589+
format: date-time
590+
type: string
585591
plan:
586592
description: |-
587593
Plan tracks the active task sequence for this node. A planner generates

go.mod

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,12 +8,12 @@ require (
88
github.com/aws/aws-sdk-go-v2/service/s3 v1.97.2
99
github.com/google/uuid v1.6.0
1010
github.com/onsi/gomega v1.38.2
11-
github.com/prometheus/client_golang v1.23.2
1211
github.com/sei-protocol/sei-config v0.0.11
1312
github.com/sei-protocol/seictl v0.0.30
1413
go.opentelemetry.io/otel v1.43.0
1514
go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc v1.43.0
1615
go.opentelemetry.io/otel/exporters/prometheus v0.65.0
16+
go.opentelemetry.io/otel/metric v1.43.0
1717
go.opentelemetry.io/otel/sdk v1.43.0
1818
go.opentelemetry.io/otel/sdk/metric v1.43.0
1919
k8s.io/api v0.35.0
@@ -78,6 +78,7 @@ require (
7878
github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 // indirect
7979
github.com/oapi-codegen/runtime v1.2.0 // indirect
8080
github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 // indirect
81+
github.com/prometheus/client_golang v1.23.2 // indirect
8182
github.com/prometheus/client_model v0.6.2 // indirect
8283
github.com/prometheus/common v0.67.5 // indirect
8384
github.com/prometheus/otlptranslator v1.0.0 // indirect
@@ -92,7 +93,6 @@ require (
9293
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.62.0 // indirect
9394
go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.37.0 // indirect
9495
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.37.0 // indirect
95-
go.opentelemetry.io/otel/metric v1.43.0 // indirect
9696
go.opentelemetry.io/otel/trace v1.43.0 // indirect
9797
go.opentelemetry.io/proto/otlp v1.10.0 // indirect
9898
go.uber.org/multierr v1.11.0 // indirect

internal/controller/node/controller.go

Lines changed: 21 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ import (
55
"fmt"
66
"time"
77

8+
"go.opentelemetry.io/otel/metric"
89
appsv1 "k8s.io/api/apps/v1"
910
batchv1 "k8s.io/api/batch/v1"
1011
corev1 "k8s.io/api/core/v1"
@@ -20,6 +21,7 @@ import (
2021
"sigs.k8s.io/controller-runtime/pkg/predicate"
2122

2223
seiv1alpha1 "github.com/sei-protocol/sei-k8s-controller/api/v1alpha1"
24+
"github.com/sei-protocol/sei-k8s-controller/internal/controller/observability"
2325
"github.com/sei-protocol/sei-k8s-controller/internal/noderesource"
2426
"github.com/sei-protocol/sei-k8s-controller/internal/planner"
2527
"github.com/sei-protocol/sei-k8s-controller/internal/platform"
@@ -101,7 +103,7 @@ func (r *SeiNodeReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ct
101103
// builds a new one based on the node's phase, stamping it onto
102104
// node.Status.Plan (and transitioning Pending → Initializing).
103105
planAlreadyActive := node.Status.Plan != nil && node.Status.Plan.Phase == seiv1alpha1.TaskPlanActive
104-
if err := planner.ResolvePlan(node); err != nil {
106+
if err := planner.ResolvePlan(ctx, node); err != nil {
105107
return ctrl.Result{}, fmt.Errorf("resolving plan: %w", err)
106108
}
107109

@@ -136,15 +138,28 @@ func (r *SeiNodeReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ct
136138
// Emit metrics/events if the phase changed.
137139
if node.Status.Phase != observedPhase {
138140
ns, name := node.Namespace, node.Name
139-
nodePhaseTransitions.WithLabelValues(ns, string(observedPhase), string(node.Status.Phase)).Inc()
141+
nodePhaseTransitions.Add(ctx, 1,
142+
metric.WithAttributes(
143+
observability.AttrController.String(seiNodeControllerName),
144+
observability.AttrNamespace.String(ns),
145+
observability.AttrFromPhase.String(string(observedPhase)),
146+
observability.AttrToPhase.String(string(node.Status.Phase)),
147+
),
148+
)
140149
emitNodePhase(ns, name, node.Status.Phase)
141150
r.Recorder.Eventf(node, corev1.EventTypeNormal, "PhaseTransition",
142151
"Phase changed from %s to %s", observedPhase, node.Status.Phase)
143152

144-
if node.Status.Phase == seiv1alpha1.PhaseRunning {
145-
dur := time.Since(node.CreationTimestamp.Time).Seconds()
146-
nodeInitDuration.WithLabelValues(ns, node.Spec.ChainID).Observe(dur)
147-
nodeLastInitDuration.WithLabelValues(ns, name).Set(dur)
153+
// Record time spent in the previous phase.
154+
if node.Status.PhaseTransitionTime != nil && observedPhase != "" {
155+
dur := time.Since(node.Status.PhaseTransitionTime.Time).Seconds()
156+
nodePhaseDuration.Record(ctx, dur,
157+
metric.WithAttributes(
158+
observability.AttrNamespace.String(ns),
159+
observability.AttrChainID.String(node.Spec.ChainID),
160+
observability.AttrPhase.String(string(observedPhase)),
161+
),
162+
)
148163
}
149164
}
150165

Lines changed: 37 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,10 @@
11
package node
22

33
import (
4-
"github.com/prometheus/client_golang/prometheus"
5-
"sigs.k8s.io/controller-runtime/pkg/metrics"
4+
"go.opentelemetry.io/otel/metric"
65

76
seiv1alpha1 "github.com/sei-protocol/sei-k8s-controller/api/v1alpha1"
87
"github.com/sei-protocol/sei-k8s-controller/internal/controller/observability"
9-
"github.com/sei-protocol/sei-k8s-controller/internal/planner"
108
)
119

1210
var allNodePhases = []string{
@@ -17,57 +15,54 @@ var allNodePhases = []string{
1715
string(seiv1alpha1.PhaseTerminating),
1816
}
1917

18+
var nodePhaseTracker = observability.NewPhaseTracker(allNodePhases)
19+
2020
var (
21-
nodePhaseGauge = prometheus.NewGaugeVec(
22-
prometheus.GaugeOpts{
23-
Name: "sei_controller_seinode_phase",
24-
Help: "Current phase of each SeiNode (1=active, 0=inactive)",
25-
},
26-
[]string{"namespace", "name", "phase"},
27-
)
21+
// nodePhaseTransitions counts phase transitions.
22+
nodePhaseTransitions metric.Int64Counter
2823

29-
nodePhaseTransitions = prometheus.NewCounterVec(
30-
prometheus.CounterOpts{
31-
Name: "sei_controller_seinode_phase_transitions_total",
32-
Help: "Phase state machine transitions",
33-
},
34-
[]string{"namespace", "from", "to"},
35-
)
24+
// nodePhaseDuration records time spent in each phase when transitioning out.
25+
nodePhaseDuration metric.Float64Histogram
26+
)
27+
28+
var meter = observability.NewMeter("node")
3629

37-
nodeInitDuration = prometheus.NewHistogramVec(
38-
prometheus.HistogramOpts{
39-
Name: "sei_controller_seinode_init_duration_seconds",
40-
Help: "Time from Pending to Running",
41-
Buckets: observability.InitBuckets,
42-
},
43-
[]string{"namespace", "chain_id"},
30+
func init() {
31+
var err error
32+
33+
// The observable gauge is registered for its callback side effect.
34+
_, err = meter.Float64ObservableGauge(
35+
"sei.controller.seinode.phase",
36+
metric.WithDescription("Current phase of each SeiNode (1=active, 0=inactive)"),
37+
metric.WithFloat64Callback(nodePhaseTracker.Observe),
4438
)
39+
handleInitErr(err)
4540

46-
nodeLastInitDuration = prometheus.NewGaugeVec(
47-
prometheus.GaugeOpts{
48-
Name: "sei_controller_seinode_last_init_duration_seconds",
49-
Help: "Per-node init duration, set once when node reaches Running",
50-
},
51-
[]string{"namespace", "name"},
41+
nodePhaseTransitions, err = meter.Int64Counter(
42+
"sei.controller.seinode.phase.transitions",
43+
metric.WithDescription("Phase state machine transitions"),
5244
)
53-
)
45+
handleInitErr(err)
5446

55-
func init() {
56-
metrics.Registry.MustRegister(
57-
nodePhaseGauge,
58-
nodePhaseTransitions,
59-
nodeInitDuration,
60-
nodeLastInitDuration,
47+
nodePhaseDuration, err = meter.Float64Histogram(
48+
"sei.controller.seinode.phase.duration",
49+
metric.WithDescription("Time spent in each phase before transitioning"),
50+
metric.WithUnit("s"),
51+
metric.WithExplicitBucketBoundaries(observability.InitBuckets...),
6152
)
53+
handleInitErr(err)
54+
}
55+
56+
func handleInitErr(err error) {
57+
if err != nil {
58+
panic("otel metric init: " + err.Error())
59+
}
6260
}
6361

6462
func emitNodePhase(ns, name string, phase seiv1alpha1.SeiNodePhase) {
65-
observability.EmitPhaseGauge(nodePhaseGauge, ns, name, string(phase), allNodePhases)
63+
nodePhaseTracker.Set(ns, name, string(phase))
6664
}
6765

6866
func cleanupNodeMetrics(namespace, name string) {
69-
observability.DeletePhaseGauge(nodePhaseGauge, namespace, name, allNodePhases)
70-
nodeLastInitDuration.DeleteLabelValues(namespace, name)
71-
observability.ReconcileErrorsTotal.DeleteLabelValues(seiNodeControllerName, namespace, name)
72-
planner.CleanupPlanMetrics("seinode", namespace, name)
67+
nodePhaseTracker.Delete(namespace, name)
7368
}

internal/controller/node/plan_execution_test.go

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ const (
3636

3737
func mustBuildPlan(t *testing.T, node *seiv1alpha1.SeiNode) *seiv1alpha1.TaskPlan {
3838
t.Helper()
39-
if err := planner.ResolvePlan(node); err != nil {
39+
if err := planner.ResolvePlan(context.Background(), node); err != nil {
4040
t.Fatalf("ResolvePlan: %v", err)
4141
}
4242
return node.Status.Plan
@@ -550,7 +550,7 @@ func TestExecutePlan_NilPlan_ReturnsError(t *testing.T) {
550550

551551
func TestResolvePlan_FullNode(t *testing.T) {
552552
node := snapshotNode()
553-
if err := planner.ResolvePlan(node); err != nil {
553+
if err := planner.ResolvePlan(context.Background(), node); err != nil {
554554
t.Fatal(err)
555555
}
556556
if node.Status.Plan == nil {
@@ -560,7 +560,7 @@ func TestResolvePlan_FullNode(t *testing.T) {
560560

561561
func TestResolvePlan_Archive(t *testing.T) {
562562
node := snapshotterNode()
563-
if err := planner.ResolvePlan(node); err != nil {
563+
if err := planner.ResolvePlan(context.Background(), node); err != nil {
564564
t.Fatal(err)
565565
}
566566
if node.Status.Plan == nil {
@@ -570,7 +570,7 @@ func TestResolvePlan_Archive(t *testing.T) {
570570

571571
func TestResolvePlan_Validator(t *testing.T) {
572572
node := genesisNode()
573-
if err := planner.ResolvePlan(node); err != nil {
573+
if err := planner.ResolvePlan(context.Background(), node); err != nil {
574574
t.Fatal(err)
575575
}
576576
if node.Status.Plan == nil {
@@ -580,7 +580,7 @@ func TestResolvePlan_Validator(t *testing.T) {
580580

581581
func TestResolvePlan_Replayer(t *testing.T) {
582582
node := replayerNode()
583-
if err := planner.ResolvePlan(node); err != nil {
583+
if err := planner.ResolvePlan(context.Background(), node); err != nil {
584584
t.Fatal(err)
585585
}
586586
if node.Status.Plan == nil {
@@ -595,7 +595,7 @@ func TestResolvePlan_NoSubSpec(t *testing.T) {
595595
Image: "sei:latest",
596596
},
597597
}
598-
err := planner.ResolvePlan(node)
598+
err := planner.ResolvePlan(context.Background(), node)
599599
if err == nil {
600600
t.Error("expected error for node with no sub-spec")
601601
}
@@ -607,7 +607,7 @@ func TestResolvePlan_ResumesActivePlan(t *testing.T) {
607607
ID: "existing-plan",
608608
Phase: seiv1alpha1.TaskPlanActive,
609609
}
610-
if err := planner.ResolvePlan(node); err != nil {
610+
if err := planner.ResolvePlan(context.Background(), node); err != nil {
611611
t.Fatal(err)
612612
}
613613
if node.Status.Plan.ID != "existing-plan" {
@@ -684,7 +684,7 @@ func TestExecutePlan_CompletedPlan_StaysForPlannerCleanup(t *testing.T) {
684684
node.Status.Phase = seiv1alpha1.PhaseRunning
685685
node.Status.Plan = nil
686686
// Build a NodeUpdate plan for a Running node with drift.
687-
if err := planner.ResolvePlan(node); err != nil {
687+
if err := planner.ResolvePlan(context.Background(), node); err != nil {
688688
t.Fatal(err)
689689
}
690690
g.Expect(node.Status.Plan).NotTo(BeNil())

internal/controller/nodedeployment/controller.go

Lines changed: 4 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,6 @@ import (
1818
"sigs.k8s.io/controller-runtime/pkg/predicate"
1919

2020
seiv1alpha1 "github.com/sei-protocol/sei-k8s-controller/api/v1alpha1"
21-
"github.com/sei-protocol/sei-k8s-controller/internal/controller/observability"
2221
"github.com/sei-protocol/sei-k8s-controller/internal/planner"
2322
)
2423

@@ -82,13 +81,8 @@ func (r *SeiNodeDeploymentReconciler) Reconcile(ctx context.Context, req ctrl.Re
8281
statusBase := client.MergeFromWithOptions(group.DeepCopy(), client.MergeFromWithOptimisticLock{})
8382
ns, name := group.Namespace, group.Name
8483

85-
// Networking runs before node creation so the DNS readiness gate
86-
// can block until public routes are resolvable.
87-
if err := timeSubstep("reconcileNetworking", func() error {
88-
return r.reconcileNetworking(ctx, group)
89-
}); err != nil {
84+
if err := r.reconcileNetworking(ctx, group); err != nil {
9085
logger.Error(err, "reconciling networking")
91-
observability.ReconcileErrorsTotal.WithLabelValues(controllerName, ns, name).Inc()
9286
return ctrl.Result{}, fmt.Errorf("reconciling networking: %w", err)
9387
}
9488

@@ -101,18 +95,14 @@ func (r *SeiNodeDeploymentReconciler) Reconcile(ctx context.Context, req ctrl.Re
10195
return ctrl.Result{RequeueAfter: 10 * time.Second}, nil
10296
}
10397

104-
if err := timeSubstep("reconcileSeiNodes", func() error {
105-
return r.reconcileSeiNodes(ctx, group)
106-
}); err != nil {
98+
if err := r.reconcileSeiNodes(ctx, group); err != nil {
10799
logger.Error(err, "reconciling SeiNodes")
108-
observability.ReconcileErrorsTotal.WithLabelValues(controllerName, ns, name).Inc()
109100
return ctrl.Result{}, fmt.Errorf("reconciling SeiNodes: %w", err)
110101
}
111102

112103
planResult, planErr := r.reconcilePlan(ctx, group)
113104
if planErr != nil {
114105
logger.Error(planErr, "reconciling plan")
115-
observability.ReconcileErrorsTotal.WithLabelValues(controllerName, ns, name).Inc()
116106
return ctrl.Result{}, fmt.Errorf("reconciling plan: %w", planErr)
117107
}
118108
if shouldRequeue(planResult) {
@@ -122,11 +112,8 @@ func (r *SeiNodeDeploymentReconciler) Reconcile(ctx context.Context, req ctrl.Re
122112
return planResult, nil
123113
}
124114

125-
if err := timeSubstep("reconcileMonitoring", func() error {
126-
return r.reconcileMonitoring(ctx, group)
127-
}); err != nil {
115+
if err := r.reconcileMonitoring(ctx, group); err != nil {
128116
logger.Error(err, "reconciling monitoring")
129-
observability.ReconcileErrorsTotal.WithLabelValues(controllerName, ns, name).Inc()
130117
return ctrl.Result{}, fmt.Errorf("reconciling monitoring: %w", err)
131118
}
132119

@@ -135,8 +122,7 @@ func (r *SeiNodeDeploymentReconciler) Reconcile(ctx context.Context, req ctrl.Re
135122
}
136123

137124
emitGroupPhase(ns, name, group.Status.Phase)
138-
emitGroupReplicas(ns, name, group.Spec.Replicas, group.Status.ReadyReplicas)
139-
emitGroupConditions(ns, name, group.Status.Conditions)
125+
emitGroupReplicas(ctx, ns, name, group.Spec.Replicas, group.Status.ReadyReplicas)
140126

141127
return ctrl.Result{RequeueAfter: statusPollInterval}, nil
142128
}

0 commit comments

Comments
 (0)