Skip to content

Add prometheus alerting rule for agent loop consecutive failures#248

Open
oyinade247 wants to merge 2 commits into
Neurowealth:mainfrom
oyinade247:Add-Prometheus-alerting-rule-for-agent-loop-consecutive-failures
Open

Add prometheus alerting rule for agent loop consecutive failures#248
oyinade247 wants to merge 2 commits into
Neurowealth:mainfrom
oyinade247:Add-Prometheus-alerting-rule-for-agent-loop-consecutive-failures

Conversation

@oyinade247

@oyinade247 oyinade247 commented Jun 27, 2026

Copy link
Copy Markdown

Closes #224

Summary

Add agent loop health monitoring with Prometheus metrics, alerting rules, and CI validation.

Changes

Metrics (src/utils/metrics.ts)

  • agent_loop_errors_total (Counter) — incremented on each loop tick failure
  • agent_loop_last_success_timestamp (Gauge) — set on each successful loop tick

Instrumentation (src/agent/loop.ts)

  • recordAgentLoopSuccess() on success paths in rebalanceCheckJob and snapshotJob
  • recordAgentLoopError() on each catch block

Alerting rules (deploy/monitoring/prometheus/alert-rules.yaml)

Alert Condition Severity
AgentLoopConsecutiveFailures increase(agent_loop_errors_total[5m]) > 3 for 2m critical
AgentLoopStalled time() - agent_loop_last_success_timestamp > 600 for 2m critical
DLQDepthHigh dlq_size > 50 for 1m critical

Each new alert includes a runbook_url annotation.

CI validation (.github/workflows/k8s-validate.yml)

  • deploy/monitoring/** added to push/PR trigger paths
  • promtool check rules step added to validate alert rules

@drips-wave

drips-wave Bot commented Jun 27, 2026

Copy link
Copy Markdown

@oyinade247 Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits.

You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀

Learn more about application limits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Prometheus alerting rule for agent loop consecutive failures

1 participant