Skip to content

Add Prometheus alerting rule for agent loop consecutive failures #224

Description

@robertocarlous

Summary

The agent loop can silently fail repeatedly without triggering any alert. An alert rule on consecutive agent loop errors would page on-call before user funds are affected.

Proposed Solution

Add to deploy/monitoring/prometheus/alert-rules.yaml:

- alert: AgentLoopConsecutiveFailures
  expr: increase(agent_loop_errors_total[5m]) > 3
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Agent loop has failed {{ $value }} times in 5 minutes"

Also add:

  • AgentLoopStalled: no successful loop tick in > 10 minutes
  • DLQDepthHigh: DLQ depth > 50 unprocessed events

Acceptance Criteria

  • Alert rules added and validated with promtool check rules
  • Alerts fire correctly in a local Prometheus instance
  • Runbook URL added to each alert annotation
  • CI step added to validate alert rules in k8s-validate.yml

Metadata

Metadata

Assignees

Labels

Stellar WaveIssues in the Stellar wave programenhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions