Summary
The agent loop can silently fail repeatedly without triggering any alert. An alert rule on consecutive agent loop errors would page on-call before user funds are affected.
Proposed Solution
Add to deploy/monitoring/prometheus/alert-rules.yaml:
- alert: AgentLoopConsecutiveFailures
expr: increase(agent_loop_errors_total[5m]) > 3
for: 2m
labels:
severity: critical
annotations:
summary: "Agent loop has failed {{ $value }} times in 5 minutes"
Also add:
AgentLoopStalled: no successful loop tick in > 10 minutes
DLQDepthHigh: DLQ depth > 50 unprocessed events
Acceptance Criteria
Summary
The agent loop can silently fail repeatedly without triggering any alert. An alert rule on consecutive agent loop errors would page on-call before user funds are affected.
Proposed Solution
Add to
deploy/monitoring/prometheus/alert-rules.yaml:Also add:
AgentLoopStalled: no successful loop tick in > 10 minutesDLQDepthHigh: DLQ depth > 50 unprocessed eventsAcceptance Criteria
promtool check rulesk8s-validate.yml