Skip to content

perf(erpc:PLA-1058): reduce metric cardinality#59

Open
0x666c6f wants to merge 6 commits intomorpho-mainfrom
feature/pla-1058-audit-erpc-metric-cardinality-and-histogram-budget
Open

perf(erpc:PLA-1058): reduce metric cardinality#59
0x666c6f wants to merge 6 commits intomorpho-mainfrom
feature/pla-1058-audit-erpc-metric-cardinality-and-histogram-budget

Conversation

@0x666c6f
Copy link
Copy Markdown
Collaborator

@0x666c6f 0x666c6f commented Apr 2, 2026

Summary

  • Audit and reduce Prometheus series pressure from eRPC latency histograms, cache duration histograms, get-logs range histograms, and cache error labels.
  • Coarsen erpc_network_request_duration_seconds from project,network,vendor,upstream,category,finality,user to project,network,category,finality,outcome.
  • Drop user from erpc_upstream_request_duration_seconds, all cache duration histograms, and erpc_network_evm_get_logs_range_requested; normalize cache get/set error labels to ErrorFingerprint(err).
  • Update bundled Grafana and Datadog queries, monitoring docs, and add an audit note plus regression tests.
  • Validated with go test ./telemetry ./architecture/evm, make build, pnpm build, and jq empty monitoring/grafana/dashboards/erpc.json monitoring/datadog/dashboard.json. Broader make agent-gate remains blocked locally by unrelated untracked auth/authorizer_test.go, which go test picks up under ./auth.

Changes

  • Emit outcome=success|cache|error for network latency instead of vendor, upstream, and user.
  • Keep detailed counters as-is; only coarsen the highest-cost histograms.
  • Drop user from cache duration histograms and erpc_network_evm_get_logs_range_requested.
  • Switch cache error metrics from ErrorSummary(err) to ErrorFingerprint(err).
  • Refresh bundled dashboards and monitoring docs for the new histogram shapes.
  • Add monitoring/cardinality-audit-2026-04.md with prod snapshot, expected impact, and follow-up tickets PLA-1064 and PLA-1065.

Metrics Diff Summary

  • erpc_network_request_duration_seconds: project,network,vendor,upstream,category,finality,user -> project,network,category,finality,outcome.
  • erpc_upstream_request_duration_seconds: project,vendor,network,upstream,category,composite,finality,user -> project,vendor,network,upstream,category,composite,finality.
  • erpc_cache_set_success_duration_seconds, erpc_cache_set_error_duration_seconds, erpc_cache_get_success_hit_duration_seconds, erpc_cache_get_success_miss_duration_seconds, erpc_cache_get_error_duration_seconds: dropped user.
  • erpc_network_evm_get_logs_range_requested: project,network,category,user,finality -> project,network,category,finality.
  • erpc_cache_get_error_duration_seconds / erpc_cache_set_error_duration_seconds: ErrorSummary(err) -> ErrorFingerprint(err).
  • Prod incident snapshot behind this change: about 328k active series overall; top app-side families included erpc_upstream_request_duration_seconds_bucket at about 51k series and erpc_network_request_duration_seconds_bucket at about 41k.

Additional Series Savings In This PR

  • Live prod snapshot from prd-morpho on April 7, 2026 for the newly coarsened families:
    • erpc_cache_set_success_duration_seconds_bucket: 22,625 -> 3,665 (-18,960, -83.8%)
    • erpc_cache_get_success_miss_duration_seconds_bucket: 18,915 -> 2,685 (-16,230, -85.8%)
    • erpc_cache_get_success_hit_duration_seconds_bucket: 11,780 -> 1,430 (-10,350, -87.9%)
    • erpc_network_evm_get_logs_range_requested_bucket: 3,438 -> 468 (-2,970, -86.4%)
  • erpc_network_hedge_delay_seconds_bucket is still large at about 30,920 live series, but labels are already lean (project,network,category,finality); next cut there is bucket-budget/category coarsening, not another obvious label drop.

Linear


Open with Devin

@0x666c6f 0x666c6f self-assigned this Apr 2, 2026
Copilot AI review requested due to automatic review settings April 2, 2026 13:11
@linear
Copy link
Copy Markdown

linear bot commented Apr 2, 2026

This comment was marked as resolved.

@0x666c6f
Copy link
Copy Markdown
Collaborator Author

0x666c6f commented Apr 7, 2026

@codex review

@0x666c6f 0x666c6f marked this pull request as ready for review April 7, 2026 15:18
chatgpt-codex-connector[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

Copilot AI review requested due to automatic review settings April 7, 2026 16:21

This comment was marked as resolved.

Copilot AI review requested due to automatic review settings April 7, 2026 16:45
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +84 to +111
if len(a.cfg.AllowMethods) > 0 {
for _, allowMethod := range a.cfg.AllowMethods {
match, err := common.WildcardMatch(allowMethod, method)
if err != nil {
a.logger.Error().Err(err).Msgf("error matching ignore method %s with method %s", ignoreMethod, method)
a.logger.Error().Err(err).Msgf("error matching allow method %s with method %s", allowMethod, method)
continue
}
if match {
shouldApply = false
break
return true
}
}
return false
}

if len(a.cfg.AllowMethods) > 0 {
for _, allowMethod := range a.cfg.AllowMethods {
match, err := common.WildcardMatch(allowMethod, method)
if len(a.cfg.IgnoreMethods) > 0 {
for _, ignoreMethod := range a.cfg.IgnoreMethods {
match, err := common.WildcardMatch(ignoreMethod, method)
if err != nil {
a.logger.Error().Err(err).Msgf("error matching allow method %s with method %s", allowMethod, method)
a.logger.Error().Err(err).Msgf("error matching ignore method %s with method %s", ignoreMethod, method)
continue
}
if match {
shouldApply = true
break
return false
}
}
}

return shouldApply
return true
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldApplyToMethod now treats AllowMethods as a strict allowlist (returns false when AllowMethods is non-empty and no pattern matches). This is a behavioral change from the documented semantics where allowMethods “takes precedence over ignoreMethods” (i.e., it overrides blocks) but doesn’t restrict methods unless ignoreMethods blocks them. This can unintentionally deny requests for configs that set allowMethods while leaving ignoreMethods empty.

Consider restoring the previous override semantics (start with shouldApply=true, apply IgnoreMethods to set false on match, then apply AllowMethods to set true on match) or, if the new allowlist behavior is intended, update the config defaults/docs to reflect that AllowMethods implies IgnoreMethods=["*"] unless explicitly overridden (similar to upstream method filters).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

🐛 1 issue in files not directly in the diff

🐛 MetricUpstreamErrorTotal WithLabelValues passes 11 args but metric now only has 10 labels — runtime panic (erpc/networks.go:1431-1436)

The PR removed the agent_name label from MetricUpstreamErrorTotal (telemetry/metrics.go:30), reducing its label count from 11 to 10. However, the call site at erpc/networks.go:1431-1436 was not updated and still passes 11 values (including req.AgentName() as the 11th argument). The Prometheus client library panics when WithLabelValues receives more values than there are label names defined on the metric vector. This code path is hit during the block-ahead skip logic in network forwarding, so it will cause a crash in production whenever an upstream is skipped due to stale block data.

View 13 additional findings in Devin Review.

Open in Devin Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants