fix(voice): scope playout state per generation step in SpeechHandle#6107
fix(voice): scope playout state per generation step in SpeechHandle#6107anshulkulhari7 wants to merge 1 commit into
Conversation
A SpeechHandle represents a whole assistant turn, but a multi-step turn (e.g. a silent tool-call step followed by a tool-reply step) reuses the same handle across steps while playout state is generation-step local. When a silent tool-call step records a preemptive pause and the false-interruption timer fires while that step is still active, the runtime resumed and emitted an AgentFalseInterruptionEvent(resumed=True) even though no audio ever played for that step, leaking stale per-step playout state. Track playout per generation step via a _playout_started flag on SpeechHandle that is set when a step starts audio playout and reset on every step advance. The false-interruption resume now only fires when the current generation step actually started playout; otherwise the preemptive pause is undone silently. Closes livekit#5545
tsushanth
left a comment
There was a problem hiding this comment.
Solid fix. The reproduction in the PR body is precise, the regression test exhibits the bug on main, and the chosen scope (per-step _playout_started flag) is the right minimal shape for closing #5545 without dragging in the broader per-step generation-ref refactor that the issue gestures at. A few items worth checking before merge but nothing structural.
Worth verifying
1. Coverage of all audio-playout entry points.
This PR patches three _on_first_frame callbacks in livekit-agents/livekit/agents/voice/agent_activity.py (around lines 2411, 2806, and 3299), each adding speech_handle._mark_generation_playout_started() immediately before the "speaking" state transition. If there's a fourth audio-start callsite anywhere (a different playout path that emits the "speaking" state without going through one of these three callbacks), it would skip the marker, leave _playout_started=False while audio actually plays, and the new guard in _on_false_interruption would then suppress a legitimate resume event on that path.
A quick scope check like rg '_update_agent_state\(\s*"speaking"' livekit-agents to confirm three matches and no others — or rg 'on_first_frame|on_playout_started' livekit-agents to confirm three callbacks total — would close that risk. The bug being prevented (false suppress) is the inverse of the bug being fixed, so it's worth a moment of due diligence.
2. Timer cleanup symmetry in _on_false_interruption.
The new early-return branch in agent_activity.py nullifies the timer explicitly:
self._paused_speech = None
self._false_interruption_timer = None
returnWhereas the existing audio-bearing path further down doesn't appear to do that in the diff context. If the existing path also resets the timer somewhere downstream, this is consistent. If not, the new branch is doing extra cleanup the old one doesn't — defensible, but worth a one-line comment explaining why the silent-step path specifically needs explicit nullification (e.g. "no resume event is emitted, so downstream resume-completion teardown won't fire and clear the timer for us").
Worth discussing
A. _generation_playout_started (property in speech_handle.py) duplicates _playout_started (the field at the same line):
# speech_handle.py
self._playout_started = False
...
@property
def _generation_playout_started(self) -> bool:
return self._playout_started
def _mark_generation_playout_started(self) -> None:
self._playout_started = TrueThe property is a pure pass-through. Two clean options:
- Drop the property entirely; have the caller in
_on_false_interruptionaccessself._paused_speech.handle._playout_starteddirectly. Less indirection, same readability since the field name is already descriptive. - Or keep the property and add an
@property.setterfor the write side, so there's exactly one symbol pair (_generation_playout_startedgetter+setter) instead of three names referring to the same state.
The current shape forces a reader to verify three names point at one piece of state. Not a blocker, but the indirection costs more than it earns when the property does nothing beyond returning the field.
B. On the design question raised in the PR body (per-step generation refs vs the minimal _playout_started flag): I'd argue the minimal approach is right here. The bug is narrowly scoped (playout state leaking across steps), and a broader per-step ref refactor would expand the public/internal surface (the issue gestures at "a private generation-step ref … passed through playout callbacks") without a current second use case. If a future bug surfaces another piece of state that needs per-step scoping, the broader refactor can be done then with two concrete examples to design against. YAGNI applies — landing the narrow fix now keeps the change reviewable and reversible.
Decision
Approving on the strength of the reproduction + regression test + correct minimal scope. The two verification items above are quick to confirm and would tighten merge confidence; the property/field consolidation is a follow-up nit, not a blocker.
Summary
Follow-up to #5594 (the cross-step stale-paused-speech fix).
SpeechHandleis the public whole-turn handle, but a multi-step turn reuses thesame handle across generation steps while some state (playout / pause) is
generation-step local. #5594 resets
_paused_speechin_scheduling_taskafterthe current generation's
_wait_for_generationcompletes, which covers thecross-step leak. It does not cover the case @jayeshp19 flagged in #5533:
This is the "pausing while no audio is actually playing" edge case, and it is what
issue #5545 proposes to fix by scoping playout/pause state per generation step.
The bug (reproduced)
path records
_paused_speechand the false-interruption timer is scheduled.the tool reply).
On
main,_on_false_interruptionresumes and emitsAgentFalseInterruptionEvent(resumed=True)even though no audio ever played forthis step — leaking stale per-step playout state. The added regression test
asserts no such resume is emitted; it fails on
main:Approach
Mirroring the issue's proposal ("a private generation-step ref … playout callbacks,
pause state, and cleanup timers should only mutate state when their captured
generation ref still matches the handle's current generation"), this scopes playout
state to the current step:
SpeechHandle._playout_started— set when the current step starts audio playout(in the existing
_on_first_framecallbacks), reset on every step advance(
_authorize_generation). It records whether the current step is actuallyplaying audio.
_on_false_interruptionnow only treats a pause as a resumable false interruptionwhen the current step actually started playout. For a silent step it undoes the
preemptive pause silently and emits no resume event.
The change is additive and backward-compatible. The existing #5594 regression test
(
test_silent_tool_call_pause_state_does_not_leak_into_tool_reply) and the rest ofthe interruption suite still pass, since a real tool reply has started playout by the
time its timer fires.
Question for maintainers
@longcw @davidzhao — does this per-step
_playout_startedgating match the internaldirection you intended for #5545? I kept it minimal (a per-step playout flag + a
single early-return in the false-interruption timer) rather than introducing a
broader public per-step handle. Happy to fold playout-state tracking more fully into
a generation-step ref if you'd prefer that shape.
Verification
uv run pytest tests/test_agent_session.py --unit— 34 passed (incl. new test andthe fix: clear stale paused speech state across generation steps #5594 regression test)
uv run ruff check/ruff format --check— clean on changed filesuv run mypy -p livekit.agents(strict) — Success, no issuesCloses #5545
AI-assisted: implemented with AI assistance; all changes were reviewed, tested, and verified by the author.