Add prometheus alerting rule for agent loop consecutive failures by oyinade247 · Pull Request #248 · Neurowealth/Backend

oyinade247 · 2026-06-27T00:38:34Z

Closes #224

Summary

Add agent loop health monitoring with Prometheus metrics, alerting rules, and CI validation.

Changes

Metrics (`src/utils/metrics.ts`)

agent_loop_errors_total (Counter) — incremented on each loop tick failure
agent_loop_last_success_timestamp (Gauge) — set on each successful loop tick

Instrumentation (`src/agent/loop.ts`)

recordAgentLoopSuccess() on success paths in rebalanceCheckJob and snapshotJob
recordAgentLoopError() on each catch block

Alerting rules (`deploy/monitoring/prometheus/alert-rules.yaml`)

Alert	Condition	Severity
`AgentLoopConsecutiveFailures`	`increase(agent_loop_errors_total[5m]) > 3` for 2m	critical
`AgentLoopStalled`	`time() - agent_loop_last_success_timestamp > 600` for 2m	critical
`DLQDepthHigh`	`dlq_size > 50` for 1m	critical

Each new alert includes a runbook_url annotation.

CI validation (`.github/workflows/k8s-validate.yml`)

deploy/monitoring/** added to push/PR trigger paths
promtool check rules step added to validate alert rules

…th specific types and unknown

…tion and Prometheus alerting rules

drips-wave · 2026-06-27T00:39:40Z

@oyinade247 Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits.

You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀

Learn more about application limits

oyinade247 added 2 commits June 27, 2026 01:18

refactor: improve type safety across core modules by replacing any wi…

26d9666

…th specific types and unknown

feat: implement agent loop health monitoring with metrics instrumenta…

67e4a9a

…tion and Prometheus alerting rules

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add prometheus alerting rule for agent loop consecutive failures#248

Add prometheus alerting rule for agent loop consecutive failures#248
oyinade247 wants to merge 2 commits into
Neurowealth:mainfrom
oyinade247:Add-Prometheus-alerting-rule-for-agent-loop-consecutive-failures

oyinade247 commented Jun 27, 2026 •

edited

Loading

Uh oh!

drips-wave Bot commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

oyinade247 commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Metrics (src/utils/metrics.ts)

Instrumentation (src/agent/loop.ts)

Alerting rules (deploy/monitoring/prometheus/alert-rules.yaml)

CI validation (.github/workflows/k8s-validate.yml)

Uh oh!

drips-wave Bot commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

oyinade247 commented Jun 27, 2026 •

edited

Loading

Metrics (`src/utils/metrics.ts`)

Instrumentation (`src/agent/loop.ts`)

Alerting rules (`deploy/monitoring/prometheus/alert-rules.yaml`)

CI validation (`.github/workflows/k8s-validate.yml`)