Skip to content

fix: prevent A2A on_message trigger infinite loop#666

Merged
yaojin3616 merged 1 commit into
mainfrom
fix/a2a-on-message-loop-bug
Jun 10, 2026
Merged

fix: prevent A2A on_message trigger infinite loop#666
yaojin3616 merged 1 commit into
mainfrom
fix/a2a-on-message-loop-bug

Conversation

@wisdomqin

Copy link
Copy Markdown
Contributor

Problem

Two agents with mutual on_message triggers could enter an infinite loop — Agent A's internal tool_call records were misidentified as "new messages" by Agent B's trigger, causing Agent B to wake up and vice versa.

Impact: Tenant musan.ai consumed 85,846 credits in one day (normal: ~2,000/day).

Root Cause

5 defects identified, 4 fixed (1 was already fixed):

Fixes

Fix 1 (P0): Add role filter to on_message query

Only match assistant and user messages; exclude tool_call and system records that were causing false matches.

Fix 2 (P1): Limit session scope

Exclude trigger internal sessions (source_channel='trigger') from message scanning, preventing cross-trigger false matches.

Fix 3 (P1): Per-agent on_message rate limiter

Auto-disable on_message triggers if a single agent fires them >30 times per hour. Logs warning for monitoring.

Fix 4 (P2): Default safety caps for set_trigger

Agent-created on_message triggers now default to max_fires=100 and expires_at=7 days.

Cleanup

Removed dead code MAX_AGENT_CHAIN_DEPTH = 5 (defined but never referenced).

Defense in Depth

Layer What it prevents
Role filter (Fix 1) Root cause — tool_call messages no longer trigger on_message
Session scope (Fix 2) Cross-session false matches from trigger internal sessions
Rate limiter (Fix 3) Auto-disables on_message triggers if 30+ fires/hr per agent
Default caps (Fix 4) Limits agent-created on_message triggers to 100 fires / 7 days
Cooldown (existing) 60s minimum between consecutive fires of same trigger

Files Changed

  • backend/app/services/trigger_runtime/evaluator.py — Fix 1 + Fix 2
  • backend/app/services/trigger_daemon.py — Fix 3 + dead code cleanup
  • backend/app/services/agent_tools.py — Fix 4

No database migrations required.

- Fix 1 (P0): Add role filter to on_message query — only match
  'assistant' and 'user' messages, exclude 'tool_call' and 'system'
- Fix 2 (P1): Exclude trigger internal sessions (source_channel='trigger')
  from on_message message scanning
- Fix 3 (P1): Add per-agent on_message hourly rate limiter (30/hr cap)
  with auto-disable and warning log
- Fix 4 (P2): Default max_fires=100 and expires_at=7d for on_message
  triggers created via set_trigger tool
- Cleanup: Remove unused MAX_AGENT_CHAIN_DEPTH dead code

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 101a7ce498

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +8401 to +8404
if ttype == "on_message":
trigger.max_fires = trigger.max_fires or 100
if not trigger.expires_at:
trigger.expires_at = datetime.now(timezone.utc) + timedelta(days=7)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Apply safety caps when re-enabling on_message triggers

Move these defaults into the existing-trigger path as well. When an agent cancels and recreates a previously unlimited on_message trigger (or converts any disabled trigger to this type), the function returns at line 8389 before reaching this block, preserving max_fires=None and expires_at=None; consequently the triggers most likely to predate this fix remain permanently uncapped.

Useful? React with 👍 / 👎.

Comment on lines +154 to 156
recent.append(now)
_on_msg_fire_log[trigger.agent_id] = recent
await enqueue_due_trigger(trigger, now)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Count only newly enqueued on_message executions

Record a rate-limit hit only when enqueue_due_trigger actually creates an execution. The current increment occurs before enqueueing and the enqueue helper discards its (execution, created) result, so an idempotency collision—such as an agent sending the same message text twice, since the key hashes only sender and content—leaves last_fired_at unchanged while this code increments every 15-second tick; the trigger is then disabled after about 7.5 minutes despite no new execution firing.

Useful? React with 👍 / 👎.

Comment on lines +37 to +40
# Safety: per-agent on_message fire rate limiter
_ON_MSG_RATE_WINDOW = 3600 # 1 hour window
_ON_MSG_RATE_LIMIT = 30 # max on_message fires per agent per hour
_on_msg_fire_log: dict[uuid.UUID, list[datetime]] = {} # agent_id -> list of fire timestamps

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Persist the on_message rate-limit window

Store this safety state in shared durable storage rather than process memory. Every backend worker starts its own trigger daemon, Helm explicitly supports multiple backend replicas, and this dictionary is empty after every restart or rolling deployment, so an existing unlimited looping trigger receives a fresh allowance on each process lifecycle and the advertised per-agent hourly limit is not reliably enforced across the deployment.

Useful? React with 👍 / 👎.

@yaojin3616 yaojin3616 merged commit 43ff3e8 into main Jun 10, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants