Skip to content

fix(voice): scope playout state per generation step in SpeechHandle#6107

Open
anshulkulhari7 wants to merge 1 commit into
livekit:mainfrom
anshulkulhari7:refactor/5545-playout-pause-per-step
Open

fix(voice): scope playout state per generation step in SpeechHandle#6107
anshulkulhari7 wants to merge 1 commit into
livekit:mainfrom
anshulkulhari7:refactor/5545-playout-pause-per-step

Conversation

@anshulkulhari7

Copy link
Copy Markdown
Contributor

Summary

Follow-up to #5594 (the cross-step stale-paused-speech fix).

SpeechHandle is the public whole-turn handle, but a multi-step turn reuses the
same handle across generation steps while some state (playout / pause) is
generation-step local. #5594 resets _paused_speech in _scheduling_task after
the current generation's _wait_for_generation completes, which covers the
cross-step leak. It does not cover the case @jayeshp19 flagged in #5533:

the false-interruption timer fires before the silent tool-call generation
finishes. In that case the runtime can still do a thinking → listening → thinking false-resume during the silent step itself … the pause-on-thinking
path can still record _paused_speech.

This is the "pausing while no audio is actually playing" edge case, and it is what
issue #5545 proposes to fix by scoping playout/pause state per generation step.

The bug (reproduced)

  1. User asks "what's the weather in Tokyo?"
  2. The model emits only a function call — a silent tool-call step, no audio.
  3. VAD-only speech starts and ends during that silent step, so the pause-on-thinking
    path records _paused_speech and the false-interruption timer is scheduled.
  4. The timer fires while the silent step is still active (before it advances to
    the tool reply).

On main, _on_false_interruption resumes and emits
AgentFalseInterruptionEvent(resumed=True) even though no audio ever played for
this step — leaking stale per-step playout state. The added regression test
asserts no such resume is emitted; it fails on main:

>       assert not any(ev.resumed for ev in false_interruption_events)
E       assert not True
tests/test_agent_session.py: AssertionError

Approach

Mirroring the issue's proposal ("a private generation-step ref … playout callbacks,
pause state, and cleanup timers should only mutate state when their captured
generation ref still matches the handle's current generation"), this scopes playout
state to the current step:

  • SpeechHandle._playout_started — set when the current step starts audio playout
    (in the existing _on_first_frame callbacks), reset on every step advance
    (_authorize_generation). It records whether the current step is actually
    playing audio.
  • _on_false_interruption now only treats a pause as a resumable false interruption
    when the current step actually started playout. For a silent step it undoes the
    preemptive pause silently and emits no resume event.

The change is additive and backward-compatible. The existing #5594 regression test
(test_silent_tool_call_pause_state_does_not_leak_into_tool_reply) and the rest of
the interruption suite still pass, since a real tool reply has started playout by the
time its timer fires.

Question for maintainers

@longcw @davidzhao — does this per-step _playout_started gating match the internal
direction you intended for #5545? I kept it minimal (a per-step playout flag + a
single early-return in the false-interruption timer) rather than introducing a
broader public per-step handle. Happy to fold playout-state tracking more fully into
a generation-step ref if you'd prefer that shape.

Verification

Closes #5545


AI-assisted: implemented with AI assistance; all changes were reviewed, tested, and verified by the author.

A SpeechHandle represents a whole assistant turn, but a multi-step turn
(e.g. a silent tool-call step followed by a tool-reply step) reuses the
same handle across steps while playout state is generation-step local.

When a silent tool-call step records a preemptive pause and the
false-interruption timer fires while that step is still active, the
runtime resumed and emitted an AgentFalseInterruptionEvent(resumed=True)
even though no audio ever played for that step, leaking stale per-step
playout state.

Track playout per generation step via a _playout_started flag on
SpeechHandle that is set when a step starts audio playout and reset on
every step advance. The false-interruption resume now only fires when the
current generation step actually started playout; otherwise the preemptive
pause is undone silently.

Closes livekit#5545
@anshulkulhari7 anshulkulhari7 requested a review from a team as a code owner June 15, 2026 09:59

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

@tsushanth tsushanth left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid fix. The reproduction in the PR body is precise, the regression test exhibits the bug on main, and the chosen scope (per-step _playout_started flag) is the right minimal shape for closing #5545 without dragging in the broader per-step generation-ref refactor that the issue gestures at. A few items worth checking before merge but nothing structural.

Worth verifying

1. Coverage of all audio-playout entry points.

This PR patches three _on_first_frame callbacks in livekit-agents/livekit/agents/voice/agent_activity.py (around lines 2411, 2806, and 3299), each adding speech_handle._mark_generation_playout_started() immediately before the "speaking" state transition. If there's a fourth audio-start callsite anywhere (a different playout path that emits the "speaking" state without going through one of these three callbacks), it would skip the marker, leave _playout_started=False while audio actually plays, and the new guard in _on_false_interruption would then suppress a legitimate resume event on that path.

A quick scope check like rg '_update_agent_state\(\s*"speaking"' livekit-agents to confirm three matches and no others — or rg 'on_first_frame|on_playout_started' livekit-agents to confirm three callbacks total — would close that risk. The bug being prevented (false suppress) is the inverse of the bug being fixed, so it's worth a moment of due diligence.

2. Timer cleanup symmetry in _on_false_interruption.

The new early-return branch in agent_activity.py nullifies the timer explicitly:

self._paused_speech = None
self._false_interruption_timer = None
return

Whereas the existing audio-bearing path further down doesn't appear to do that in the diff context. If the existing path also resets the timer somewhere downstream, this is consistent. If not, the new branch is doing extra cleanup the old one doesn't — defensible, but worth a one-line comment explaining why the silent-step path specifically needs explicit nullification (e.g. "no resume event is emitted, so downstream resume-completion teardown won't fire and clear the timer for us").

Worth discussing

A. _generation_playout_started (property in speech_handle.py) duplicates _playout_started (the field at the same line):

# speech_handle.py
self._playout_started = False
...
@property
def _generation_playout_started(self) -> bool:
    return self._playout_started

def _mark_generation_playout_started(self) -> None:
    self._playout_started = True

The property is a pure pass-through. Two clean options:

  • Drop the property entirely; have the caller in _on_false_interruption access self._paused_speech.handle._playout_started directly. Less indirection, same readability since the field name is already descriptive.
  • Or keep the property and add an @property.setter for the write side, so there's exactly one symbol pair (_generation_playout_started getter+setter) instead of three names referring to the same state.

The current shape forces a reader to verify three names point at one piece of state. Not a blocker, but the indirection costs more than it earns when the property does nothing beyond returning the field.

B. On the design question raised in the PR body (per-step generation refs vs the minimal _playout_started flag): I'd argue the minimal approach is right here. The bug is narrowly scoped (playout state leaking across steps), and a broader per-step ref refactor would expand the public/internal surface (the issue gestures at "a private generation-step ref … passed through playout callbacks") without a current second use case. If a future bug surfaces another piece of state that needs per-step scoping, the broader refactor can be done then with two concrete examples to design against. YAGNI applies — landing the narrow fix now keeps the change reviewable and reversible.

Decision

Approving on the strength of the reproduction + regression test + correct minimal scope. The two verification items above are quick to confirm and would tighten merge confidence; the property/field consolidation is a follow-up nit, not a blocker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Track voice playout and pause state by generation step instead of reused SpeechHandle

2 participants