perf(erpc:PLA-1058): reduce metric cardinality#59
perf(erpc:PLA-1058): reduce metric cardinality#590x666c6f wants to merge 6 commits intomorpho-mainfrom
Conversation
|
@codex review |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if len(a.cfg.AllowMethods) > 0 { | ||
| for _, allowMethod := range a.cfg.AllowMethods { | ||
| match, err := common.WildcardMatch(allowMethod, method) | ||
| if err != nil { | ||
| a.logger.Error().Err(err).Msgf("error matching ignore method %s with method %s", ignoreMethod, method) | ||
| a.logger.Error().Err(err).Msgf("error matching allow method %s with method %s", allowMethod, method) | ||
| continue | ||
| } | ||
| if match { | ||
| shouldApply = false | ||
| break | ||
| return true | ||
| } | ||
| } | ||
| return false | ||
| } | ||
|
|
||
| if len(a.cfg.AllowMethods) > 0 { | ||
| for _, allowMethod := range a.cfg.AllowMethods { | ||
| match, err := common.WildcardMatch(allowMethod, method) | ||
| if len(a.cfg.IgnoreMethods) > 0 { | ||
| for _, ignoreMethod := range a.cfg.IgnoreMethods { | ||
| match, err := common.WildcardMatch(ignoreMethod, method) | ||
| if err != nil { | ||
| a.logger.Error().Err(err).Msgf("error matching allow method %s with method %s", allowMethod, method) | ||
| a.logger.Error().Err(err).Msgf("error matching ignore method %s with method %s", ignoreMethod, method) | ||
| continue | ||
| } | ||
| if match { | ||
| shouldApply = true | ||
| break | ||
| return false | ||
| } | ||
| } | ||
| } | ||
|
|
||
| return shouldApply | ||
| return true |
There was a problem hiding this comment.
shouldApplyToMethod now treats AllowMethods as a strict allowlist (returns false when AllowMethods is non-empty and no pattern matches). This is a behavioral change from the documented semantics where allowMethods “takes precedence over ignoreMethods” (i.e., it overrides blocks) but doesn’t restrict methods unless ignoreMethods blocks them. This can unintentionally deny requests for configs that set allowMethods while leaving ignoreMethods empty.
Consider restoring the previous override semantics (start with shouldApply=true, apply IgnoreMethods to set false on match, then apply AllowMethods to set true on match) or, if the new allowlist behavior is intended, update the config defaults/docs to reflect that AllowMethods implies IgnoreMethods=["*"] unless explicitly overridden (similar to upstream method filters).
There was a problem hiding this comment.
Devin Review found 1 new potential issue.
🐛 1 issue in files not directly in the diff
🐛 MetricUpstreamErrorTotal WithLabelValues passes 11 args but metric now only has 10 labels — runtime panic (erpc/networks.go:1431-1436)
The PR removed the agent_name label from MetricUpstreamErrorTotal (telemetry/metrics.go:30), reducing its label count from 11 to 10. However, the call site at erpc/networks.go:1431-1436 was not updated and still passes 11 values (including req.AgentName() as the 11th argument). The Prometheus client library panics when WithLabelValues receives more values than there are label names defined on the metric vector. This code path is hit during the block-ahead skip logic in network forwarding, so it will cause a crash in production whenever an upstream is skipped due to stale block data.
View 13 additional findings in Devin Review.
Summary
erpc_network_request_duration_secondsfromproject,network,vendor,upstream,category,finality,usertoproject,network,category,finality,outcome.userfromerpc_upstream_request_duration_seconds, all cache duration histograms, anderpc_network_evm_get_logs_range_requested; normalize cache get/set error labels toErrorFingerprint(err).go test ./telemetry ./architecture/evm,make build,pnpm build, andjq empty monitoring/grafana/dashboards/erpc.json monitoring/datadog/dashboard.json. Broadermake agent-gateremains blocked locally by unrelated untrackedauth/authorizer_test.go, whichgo testpicks up under./auth.Changes
outcome=success|cache|errorfor network latency instead ofvendor,upstream, anduser.userfrom cache duration histograms anderpc_network_evm_get_logs_range_requested.ErrorSummary(err)toErrorFingerprint(err).monitoring/cardinality-audit-2026-04.mdwith prod snapshot, expected impact, and follow-up ticketsPLA-1064andPLA-1065.Metrics Diff Summary
erpc_network_request_duration_seconds:project,network,vendor,upstream,category,finality,user->project,network,category,finality,outcome.erpc_upstream_request_duration_seconds:project,vendor,network,upstream,category,composite,finality,user->project,vendor,network,upstream,category,composite,finality.erpc_cache_set_success_duration_seconds,erpc_cache_set_error_duration_seconds,erpc_cache_get_success_hit_duration_seconds,erpc_cache_get_success_miss_duration_seconds,erpc_cache_get_error_duration_seconds: droppeduser.erpc_network_evm_get_logs_range_requested:project,network,category,user,finality->project,network,category,finality.erpc_cache_get_error_duration_seconds/erpc_cache_set_error_duration_seconds:ErrorSummary(err)->ErrorFingerprint(err).328kactive series overall; top app-side families includederpc_upstream_request_duration_seconds_bucketat about51kseries anderpc_network_request_duration_seconds_bucketat about41k.Additional Series Savings In This PR
prd-morphoon April 7, 2026 for the newly coarsened families:erpc_cache_set_success_duration_seconds_bucket:22,625 -> 3,665(-18,960,-83.8%)erpc_cache_get_success_miss_duration_seconds_bucket:18,915 -> 2,685(-16,230,-85.8%)erpc_cache_get_success_hit_duration_seconds_bucket:11,780 -> 1,430(-10,350,-87.9%)erpc_network_evm_get_logs_range_requested_bucket:3,438 -> 468(-2,970,-86.4%)erpc_network_hedge_delay_seconds_bucketis still large at about30,920live series, but labels are already lean (project,network,category,finality); next cut there is bucket-budget/category coarsening, not another obvious label drop.Linear