feat(backchannel): add agent backchannel#6130
Conversation
| tts_instructions_template: Instructions | str | ||
| tts_instructions_append: str | ||
| audio_recognition_instructions_template: Instructions | str | ||
| backchannel: NotGivenOr[bool | list[str | AudioSource | BackchannelConfig] | BackchannelOptions] |
There was a problem hiding this comment.
i'm wondering if we could simplify these types somehow
There was a problem hiding this comment.
claude suggested:
# backchannel.py
+ BackchannelSource: TypeAlias = "str | AudioSource | BackchannelConfig"
class BackchannelOptions(TypedDict, total=False):
frequency: float
- source: NotGivenOr[list[str | AudioSource | BackchannelConfig]]
+ source: list[BackchannelSource]
- DEFAULT_BACKCHANNEL_SOURCE: list[str | AudioSource | BackchannelConfig] = [...]
+ DEFAULT_BACKCHANNEL_SOURCE: list[BackchannelSource] = [...]
def resolve_backchannel_options(
- backchannel: NotGivenOr[bool | list[str | AudioSource | BackchannelConfig] | BackchannelOptions],
+ backchannel: NotGivenOr[bool | list[BackchannelSource] | BackchannelOptions],
) -> BackchannelOptions | None: ...
- def _as_config(entry: str | AudioSource | BackchannelConfig) -> BackchannelConfig: ...
+ def _as_config(entry: BackchannelSource) -> BackchannelConfig: ... # agent_session.py
class ExpressiveOptions(TypedDict, total=False):
...
- backchannel: NotGivenOr[bool | list[str | AudioSource | BackchannelConfig] | BackchannelOptions]
+ backchannel: bool | list[BackchannelSource] | BackchannelOptionsβ¦channel-expressive-mode
β¦channel-expressive-mode # Conflicts: # examples/hotel_receptionist/agent.py
tsushanth
left a comment
There was a problem hiding this comment.
Thoughtful design. The EOT-band gating (safe sounds near threshold, risky lexical words only at low EOT fractions), the per-clip cooldown via _HOTNESS_DECAY, and the explicit TTFF cap on the first render all show genuine engineering taste about the latency/relevance tradeoff that's specific to backchannels β "a late one is worse than none" is exactly the right framing. The 300ms first-frame budget with silent-drop fallback is good defensive design.
One concrete architectural concern, plus a few worth-flagging items.
Must-confirm
1. activity.say(...) while the agent may already be mid-utterance.
on_agent_backchannel_opportunity fires (per audio_recognition.py:1511) when the cloud turn detector emits a backchannel signal β which is gated by the user's speech state. But the agent's speech state isn't checked anywhere in _BackchannelEmitter. The edge case I'm worried about:
- Agent finishes saying a long response, the activity is still draining the last few audio frames
- User starts a new utterance overlapping the tail of the agent's audio
- Turn detector emits a backchannel opportunity ("user has just started talking")
maybe_emitβ_playβactivity.say(transcript, audio=_iter_frames(frames), allow_interruptions=False)- The backchannel audio now overlaps the agent's own still-draining response
activity.say may already gate this β I don't see the implementation in the diff, but if it queues backchannel speech behind active speech, the backchannel arrives late (and the in-code comment "a late one is worse than none" applies β better to drop than queue). If activity.say plays immediately on a separate channel, you get overlap of the agent's own audio.
Two clean resolutions, either works:
- (a) Gate
maybe_emitonnot activity.is_speaking()(or whatever the AgentActivity equivalent is) before adding to the pool. Backchannels are dropped silently when the agent is mid-utterance. - (b) Document the contract on
activity.sayand either rely on it (with a comment in_playreferencing the assumed behavior) or pick (a).
Worth confirming the current behavior with a test that has the agent mid-say() when a backchannel opportunity fires β currently test_backchannel.py doesn't appear to exercise this overlap case based on a quick scan.
Worth discussing
A. _synthesize enforces TTFF only on the first frame:
first = await asyncio.wait_for(it.__anext__(), timeout=_SYNTH_TTFF_TIMEOUT)
frames.append(first.frame)
async for ev in it: # no timeout on subsequent frames
frames.append(ev.frame)The first-frame cap matches the "late one is worse than none" design β good. But once the first frame arrives, the remaining frames have no time budget. A TTS that's fast on TTFF but slow on subsequent frames could hang the _render task. For typical backchannel clips (sub-second utterances) this is unlikely to be a real issue, but a total-time budget (e.g. _SYNTH_TTFF_TIMEOUT * 4 for the full clip) would be a cheap defensive guard.
B. _clip_key(source) for AsyncIterator sources uses id(source) and the iterator is one-shot:
def _clip_key(source: str | AudioSource) -> str:
...
return f"iter:{id(source)}"If _render raises before populating self._cache[key] (the broad except Exception swallows it), the iterator is consumed but no cached frames exist. The next opportunity for the same clip key re-enters _render_and_play, but _decode_source(source) returns the same already-consumed iterator. Result: empty frames list, silent failure on every subsequent attempt.
For str (TTS text) or BuiltinAudioClip sources this isn't an issue β re-rendering / re-decoding from a file works. Only the raw AsyncIterator case has the one-shot consumption hazard. Two options:
- Materialize the iterator into a list eagerly inside
_decode_source(defeats some of the streaming benefit but avoids the silent-failure case) - Detect the failed-render case explicitly and refuse to register the same iterator key (force the caller to provide a factory function rather than a raw iterator)
C. Cloud-turn-detector-only gate with only a warning log:
if not isinstance(self._turn_detection, inference.TurnDetector):
logger.warning(
"backchannel is enabled but the active turn detector does not provide a "
"backchannel signal (requires the LiveKit cloud turn detector); disabling it"
)
return NoneA user who passes backchannel=True and never sees a backchannel fire (because they're on the local mini detector) will likely scroll past a WARNING log line. The contract that backchannels require the cloud detector is significant enough to surface in:
- The
BackchannelOptionsTypedDict docstring - A note in the
ExpressiveOptions["backchannel"]documentation (when those docs land β see item E) - Possibly upgrade the log to
ERRORfor visibility, since the feature is silently disabled
Quality-of-merge
D. Empty PR body. The architectural decisions here (EOT-band gating, render-then-cache, TTFF cap, frequency pre-gate) all warrant 2-3 sentences in the PR description so a maintainer reading the diff doesn't have to reverse-engineer the design from the inline comments. Most of the design rationale is already written excellently as docstrings in backchannel.py β just hoisting the key paragraphs into the PR body would be enough.
E. No public docs entry. BackchannelOptions (TypedDict) and BackchannelConfig (dataclass) are exposed via livekit/agents/__init__.py and voice/__init__.py β both are public surface that users will configure. Worth a docs page or at least an expressive config section update. (Mentioning because the in-tree docstrings are good; just need to be discoverable.)
Acknowledging what's good
A few things that are worth not regressing during iteration:
- The two-tier default source list (safe sounds near EOT threshold, risky words only at low EOT) shows real understanding of the backchannel UX failure modes
_HOTNESS_DECAY = 3with_cooldownreturning1.0 - heatis a clean implementation of recency-suppression without needing time tracking- The render-once-then-cache design is the right shape β pre-warming at activity init would be marginally better but adds complexity that's hard to justify before observing the failure mode in production
add_to_chat_ctx=Falsecorrectly excludes backchannels from conversation history
Decision
Requesting changes β but only because of item 1 (the agent-speech overlap question). The other items are worth-flagging or polish. If activity.say already gates against active speech, item 1 reduces to a documentation ask; if it doesn't, the gate needs adding.
Happy to re-approve once item 1 is confirmed/resolved.
No description provided.