fix(voice): scope playout state per generation step in SpeechHandle by anshulkulhari7 · Pull Request #6107 · livekit/agents

anshulkulhari7 · 2026-06-15T09:59:13Z

Summary

Follow-up to #5594 (the cross-step stale-paused-speech fix).

SpeechHandle is the public whole-turn handle, but a multi-step turn reuses the
same handle across generation steps while some state (playout / pause) is
generation-step local. #5594 resets _paused_speech in _scheduling_task after
the current generation's _wait_for_generation completes, which covers the
cross-step leak. It does not cover the case @jayeshp19 flagged in #5533:

the false-interruption timer fires before the silent tool-call generation
finishes. In that case the runtime can still do a thinking → listening → thinking false-resume during the silent step itself … the pause-on-thinking
path can still record _paused_speech.

This is the "pausing while no audio is actually playing" edge case, and it is what
issue #5545 proposes to fix by scoping playout/pause state per generation step.

The bug (reproduced)

User asks "what's the weather in Tokyo?"
The model emits only a function call — a silent tool-call step, no audio.
VAD-only speech starts and ends during that silent step, so the pause-on-thinking
path records _paused_speech and the false-interruption timer is scheduled.
The timer fires while the silent step is still active (before it advances to
the tool reply).

On main, _on_false_interruption resumes and emits
AgentFalseInterruptionEvent(resumed=True) even though no audio ever played for
this step — leaking stale per-step playout state. The added regression test
asserts no such resume is emitted; it fails on main:

>       assert not any(ev.resumed for ev in false_interruption_events)
E       assert not True
tests/test_agent_session.py: AssertionError

Approach

Mirroring the issue's proposal ("a private generation-step ref … playout callbacks,
pause state, and cleanup timers should only mutate state when their captured
generation ref still matches the handle's current generation"), this scopes playout
state to the current step:

SpeechHandle._playout_started — set when the current step starts audio playout
(in the existing _on_first_frame callbacks), reset on every step advance
(_authorize_generation). It records whether the current step is actually
playing audio.
_on_false_interruption now only treats a pause as a resumable false interruption
when the current step actually started playout. For a silent step it undoes the
preemptive pause silently and emits no resume event.

The change is additive and backward-compatible. The existing #5594 regression test
(test_silent_tool_call_pause_state_does_not_leak_into_tool_reply) and the rest of
the interruption suite still pass, since a real tool reply has started playout by the
time its timer fires.

Question for maintainers

@longcw @davidzhao — does this per-step _playout_started gating match the internal
direction you intended for #5545? I kept it minimal (a per-step playout flag + a
single early-return in the false-interruption timer) rather than introducing a
broader public per-step handle. Happy to fold playout-state tracking more fully into
a generation-step ref if you'd prefer that shape.

Verification

uv run pytest tests/test_agent_session.py --unit — 34 passed (incl. new test and
the fix: clear stale paused speech state across generation steps #5594 regression test)
uv run ruff check / ruff format --check — clean on changed files
uv run mypy -p livekit.agents (strict) — Success, no issues

Closes #5545

AI-assisted: implemented with AI assistance; all changes were reviewed, tested, and verified by the author.

A SpeechHandle represents a whole assistant turn, but a multi-step turn (e.g. a silent tool-call step followed by a tool-reply step) reuses the same handle across steps while playout state is generation-step local. When a silent tool-call step records a preemptive pause and the false-interruption timer fires while that step is still active, the runtime resumed and emitted an AgentFalseInterruptionEvent(resumed=True) even though no audio ever played for that step, leaking stale per-step playout state. Track playout per generation step via a _playout_started flag on SpeechHandle that is set when a step starts audio playout and reset on every step advance. The false-interruption resume now only fires when the current generation step actually started playout; otherwise the preemptive pause is undone silently. Closes livekit#5545

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

tsushanth

Solid fix. The reproduction in the PR body is precise, the regression test exhibits the bug on main, and the chosen scope (per-step _playout_started flag) is the right minimal shape for closing #5545 without dragging in the broader per-step generation-ref refactor that the issue gestures at. A few items worth checking before merge but nothing structural.

Worth verifying

1. Coverage of all audio-playout entry points.

This PR patches three _on_first_frame callbacks in livekit-agents/livekit/agents/voice/agent_activity.py (around lines 2411, 2806, and 3299), each adding speech_handle._mark_generation_playout_started() immediately before the "speaking" state transition. If there's a fourth audio-start callsite anywhere (a different playout path that emits the "speaking" state without going through one of these three callbacks), it would skip the marker, leave _playout_started=False while audio actually plays, and the new guard in _on_false_interruption would then suppress a legitimate resume event on that path.

A quick scope check like rg '_update_agent_state\(\s*"speaking"' livekit-agents to confirm three matches and no others — or rg 'on_first_frame|on_playout_started' livekit-agents to confirm three callbacks total — would close that risk. The bug being prevented (false suppress) is the inverse of the bug being fixed, so it's worth a moment of due diligence.

2. Timer cleanup symmetry in _on_false_interruption.

The new early-return branch in agent_activity.py nullifies the timer explicitly:

self._paused_speech = None
self._false_interruption_timer = None
return

Whereas the existing audio-bearing path further down doesn't appear to do that in the diff context. If the existing path also resets the timer somewhere downstream, this is consistent. If not, the new branch is doing extra cleanup the old one doesn't — defensible, but worth a one-line comment explaining why the silent-step path specifically needs explicit nullification (e.g. "no resume event is emitted, so downstream resume-completion teardown won't fire and clear the timer for us").

Worth discussing

A. _generation_playout_started (property in speech_handle.py) duplicates _playout_started (the field at the same line):

# speech_handle.py
self._playout_started = False
...
@property
def _generation_playout_started(self) -> bool:
    return self._playout_started

def _mark_generation_playout_started(self) -> None:
    self._playout_started = True

The property is a pure pass-through. Two clean options:

Drop the property entirely; have the caller in _on_false_interruption access self._paused_speech.handle._playout_started directly. Less indirection, same readability since the field name is already descriptive.
Or keep the property and add an @property.setter for the write side, so there's exactly one symbol pair (_generation_playout_started getter+setter) instead of three names referring to the same state.

The current shape forces a reader to verify three names point at one piece of state. Not a blocker, but the indirection costs more than it earns when the property does nothing beyond returning the field.

B. On the design question raised in the PR body (per-step generation refs vs the minimal _playout_started flag): I'd argue the minimal approach is right here. The bug is narrowly scoped (playout state leaking across steps), and a broader per-step ref refactor would expand the public/internal surface (the issue gestures at "a private generation-step ref … passed through playout callbacks") without a current second use case. If a future bug surfaces another piece of state that needs per-step scoping, the broader refactor can be done then with two concrete examples to design against. YAGNI applies — landing the narrow fix now keeps the change reviewable and reversible.

Decision

Approving on the strength of the reproduction + regression test + correct minimal scope. The two verification items above are quick to confirm and would tighten merge confidence; the property/field consolidation is a follow-up nit, not a blocker.

anshulkulhari7 requested a review from a team as a code owner June 15, 2026 09:59

devin-ai-integration Bot reviewed Jun 15, 2026

View reviewed changes

tsushanth approved these changes Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(voice): scope playout state per generation step in SpeechHandle#6107

fix(voice): scope playout state per generation step in SpeechHandle#6107
anshulkulhari7 wants to merge 1 commit into
livekit:mainfrom
anshulkulhari7:refactor/5545-playout-pause-per-step

anshulkulhari7 commented Jun 15, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

tsushanth left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anshulkulhari7 commented Jun 15, 2026

Summary

The bug (reproduced)

Approach

Question for maintainers

Verification

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

tsushanth left a comment

Choose a reason for hiding this comment

Worth verifying

Worth discussing

Decision

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants