fix: prevent A2A on_message trigger infinite loop#666
Conversation
- Fix 1 (P0): Add role filter to on_message query — only match 'assistant' and 'user' messages, exclude 'tool_call' and 'system' - Fix 2 (P1): Exclude trigger internal sessions (source_channel='trigger') from on_message message scanning - Fix 3 (P1): Add per-agent on_message hourly rate limiter (30/hr cap) with auto-disable and warning log - Fix 4 (P2): Default max_fires=100 and expires_at=7d for on_message triggers created via set_trigger tool - Cleanup: Remove unused MAX_AGENT_CHAIN_DEPTH dead code
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 101a7ce498
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if ttype == "on_message": | ||
| trigger.max_fires = trigger.max_fires or 100 | ||
| if not trigger.expires_at: | ||
| trigger.expires_at = datetime.now(timezone.utc) + timedelta(days=7) |
There was a problem hiding this comment.
Apply safety caps when re-enabling on_message triggers
Move these defaults into the existing-trigger path as well. When an agent cancels and recreates a previously unlimited on_message trigger (or converts any disabled trigger to this type), the function returns at line 8389 before reaching this block, preserving max_fires=None and expires_at=None; consequently the triggers most likely to predate this fix remain permanently uncapped.
Useful? React with 👍 / 👎.
| recent.append(now) | ||
| _on_msg_fire_log[trigger.agent_id] = recent | ||
| await enqueue_due_trigger(trigger, now) |
There was a problem hiding this comment.
Count only newly enqueued on_message executions
Record a rate-limit hit only when enqueue_due_trigger actually creates an execution. The current increment occurs before enqueueing and the enqueue helper discards its (execution, created) result, so an idempotency collision—such as an agent sending the same message text twice, since the key hashes only sender and content—leaves last_fired_at unchanged while this code increments every 15-second tick; the trigger is then disabled after about 7.5 minutes despite no new execution firing.
Useful? React with 👍 / 👎.
| # Safety: per-agent on_message fire rate limiter | ||
| _ON_MSG_RATE_WINDOW = 3600 # 1 hour window | ||
| _ON_MSG_RATE_LIMIT = 30 # max on_message fires per agent per hour | ||
| _on_msg_fire_log: dict[uuid.UUID, list[datetime]] = {} # agent_id -> list of fire timestamps |
There was a problem hiding this comment.
Persist the on_message rate-limit window
Store this safety state in shared durable storage rather than process memory. Every backend worker starts its own trigger daemon, Helm explicitly supports multiple backend replicas, and this dictionary is empty after every restart or rolling deployment, so an existing unlimited looping trigger receives a fresh allowance on each process lifecycle and the advertised per-agent hourly limit is not reliably enforced across the deployment.
Useful? React with 👍 / 👎.
Problem
Two agents with mutual
on_messagetriggers could enter an infinite loop — Agent A's internal tool_call records were misidentified as "new messages" by Agent B's trigger, causing Agent B to wake up and vice versa.Impact: Tenant
musan.aiconsumed 85,846 credits in one day (normal: ~2,000/day).Root Cause
5 defects identified, 4 fixed (1 was already fixed):
Fixes
Fix 1 (P0): Add
rolefilter to on_message queryOnly match
assistantandusermessages; excludetool_callandsystemrecords that were causing false matches.Fix 2 (P1): Limit session scope
Exclude trigger internal sessions (
source_channel='trigger') from message scanning, preventing cross-trigger false matches.Fix 3 (P1): Per-agent on_message rate limiter
Auto-disable on_message triggers if a single agent fires them >30 times per hour. Logs warning for monitoring.
Fix 4 (P2): Default safety caps for
set_triggerAgent-created
on_messagetriggers now default tomax_fires=100andexpires_at=7 days.Cleanup
Removed dead code
MAX_AGENT_CHAIN_DEPTH = 5(defined but never referenced).Defense in Depth
Files Changed
backend/app/services/trigger_runtime/evaluator.py— Fix 1 + Fix 2backend/app/services/trigger_daemon.py— Fix 3 + dead code cleanupbackend/app/services/agent_tools.py— Fix 4No database migrations required.