feat(backchannel): add agent backchannel by chenghao-mou · Pull Request #6130 · livekit/agents

chenghao-mou · 2026-06-16T21:04:30Z

No description provided.

tinalenguyen · 2026-06-17T00:08:44Z

    tts_instructions_template: Instructions | str
    tts_instructions_append: str
    audio_recognition_instructions_template: Instructions | str
+    backchannel: NotGivenOr[bool | list[str | AudioSource | BackchannelConfig] | BackchannelOptions]


i'm wondering if we could simplify these types somehow

claude suggested:

# backchannel.py + BackchannelSource: TypeAlias = "str | AudioSource | BackchannelConfig" class BackchannelOptions(TypedDict, total=False): frequency: float - source: NotGivenOr[list[str | AudioSource | BackchannelConfig]] + source: list[BackchannelSource] - DEFAULT_BACKCHANNEL_SOURCE: list[str | AudioSource | BackchannelConfig] = [...] + DEFAULT_BACKCHANNEL_SOURCE: list[BackchannelSource] = [...] def resolve_backchannel_options( - backchannel: NotGivenOr[bool | list[str | AudioSource | BackchannelConfig] | BackchannelOptions], + backchannel: NotGivenOr[bool | list[BackchannelSource] | BackchannelOptions], ) -> BackchannelOptions | None: ... - def _as_config(entry: str | AudioSource | BackchannelConfig) -> BackchannelConfig: ... + def _as_config(entry: BackchannelSource) -> BackchannelConfig: ...

# agent_session.py class ExpressiveOptions(TypedDict, total=False): ... - backchannel: NotGivenOr[bool | list[str | AudioSource | BackchannelConfig] | BackchannelOptions] + backchannel: bool | list[BackchannelSource] | BackchannelOptions

…channel-expressive-mode

…channel-expressive-mode # Conflicts: # examples/hotel_receptionist/agent.py

tsushanth

Thoughtful design. The EOT-band gating (safe sounds near threshold, risky lexical words only at low EOT fractions), the per-clip cooldown via _HOTNESS_DECAY, and the explicit TTFF cap on the first render all show genuine engineering taste about the latency/relevance tradeoff that's specific to backchannels — "a late one is worse than none" is exactly the right framing. The 300ms first-frame budget with silent-drop fallback is good defensive design.

One concrete architectural concern, plus a few worth-flagging items.

Must-confirm

1. activity.say(...) while the agent may already be mid-utterance.

on_agent_backchannel_opportunity fires (per audio_recognition.py:1511) when the cloud turn detector emits a backchannel signal — which is gated by the user's speech state. But the agent's speech state isn't checked anywhere in _BackchannelEmitter. The edge case I'm worried about:

Agent finishes saying a long response, the activity is still draining the last few audio frames
User starts a new utterance overlapping the tail of the agent's audio
Turn detector emits a backchannel opportunity ("user has just started talking")
maybe_emit → _play → activity.say(transcript, audio=_iter_frames(frames), allow_interruptions=False)
The backchannel audio now overlaps the agent's own still-draining response

activity.say may already gate this — I don't see the implementation in the diff, but if it queues backchannel speech behind active speech, the backchannel arrives late (and the in-code comment "a late one is worse than none" applies — better to drop than queue). If activity.say plays immediately on a separate channel, you get overlap of the agent's own audio.

Two clean resolutions, either works:

(a) Gate maybe_emit on not activity.is_speaking() (or whatever the AgentActivity equivalent is) before adding to the pool. Backchannels are dropped silently when the agent is mid-utterance.
(b) Document the contract on activity.say and either rely on it (with a comment in _play referencing the assumed behavior) or pick (a).

Worth confirming the current behavior with a test that has the agent mid-say() when a backchannel opportunity fires — currently test_backchannel.py doesn't appear to exercise this overlap case based on a quick scan.

Worth discussing

A. _synthesize enforces TTFF only on the first frame:

first = await asyncio.wait_for(it.__anext__(), timeout=_SYNTH_TTFF_TIMEOUT)
frames.append(first.frame)
async for ev in it:  # no timeout on subsequent frames
    frames.append(ev.frame)

The first-frame cap matches the "late one is worse than none" design — good. But once the first frame arrives, the remaining frames have no time budget. A TTS that's fast on TTFF but slow on subsequent frames could hang the _render task. For typical backchannel clips (sub-second utterances) this is unlikely to be a real issue, but a total-time budget (e.g. _SYNTH_TTFF_TIMEOUT * 4 for the full clip) would be a cheap defensive guard.

B. _clip_key(source) for AsyncIterator sources uses id(source) and the iterator is one-shot:

def _clip_key(source: str | AudioSource) -> str:
    ...
    return f"iter:{id(source)}"

If _render raises before populating self._cache[key] (the broad except Exception swallows it), the iterator is consumed but no cached frames exist. The next opportunity for the same clip key re-enters _render_and_play, but _decode_source(source) returns the same already-consumed iterator. Result: empty frames list, silent failure on every subsequent attempt.

For str (TTS text) or BuiltinAudioClip sources this isn't an issue — re-rendering / re-decoding from a file works. Only the raw AsyncIterator case has the one-shot consumption hazard. Two options:

Materialize the iterator into a list eagerly inside _decode_source (defeats some of the streaming benefit but avoids the silent-failure case)
Detect the failed-render case explicitly and refuse to register the same iterator key (force the caller to provide a factory function rather than a raw iterator)

C. Cloud-turn-detector-only gate with only a warning log:

if not isinstance(self._turn_detection, inference.TurnDetector):
    logger.warning(
        "backchannel is enabled but the active turn detector does not provide a "
        "backchannel signal (requires the LiveKit cloud turn detector); disabling it"
    )
    return None

A user who passes backchannel=True and never sees a backchannel fire (because they're on the local mini detector) will likely scroll past a WARNING log line. The contract that backchannels require the cloud detector is significant enough to surface in:

The BackchannelOptions TypedDict docstring
A note in the ExpressiveOptions["backchannel"] documentation (when those docs land — see item E)
Possibly upgrade the log to ERROR for visibility, since the feature is silently disabled

Quality-of-merge

D. Empty PR body. The architectural decisions here (EOT-band gating, render-then-cache, TTFF cap, frequency pre-gate) all warrant 2-3 sentences in the PR description so a maintainer reading the diff doesn't have to reverse-engineer the design from the inline comments. Most of the design rationale is already written excellently as docstrings in backchannel.py — just hoisting the key paragraphs into the PR body would be enough.

E. No public docs entry. BackchannelOptions (TypedDict) and BackchannelConfig (dataclass) are exposed via livekit/agents/__init__.py and voice/__init__.py — both are public surface that users will configure. Worth a docs page or at least an expressive config section update. (Mentioning because the in-tree docstrings are good; just need to be discoverable.)

Acknowledging what's good

A few things that are worth not regressing during iteration:

The two-tier default source list (safe sounds near EOT threshold, risky words only at low EOT) shows real understanding of the backchannel UX failure modes
_HOTNESS_DECAY = 3 with _cooldown returning 1.0 - heat is a clean implementation of recency-suppression without needing time tracking
The render-once-then-cache design is the right shape — pre-warming at activity init would be marginally better but adds complexity that's hard to justify before observing the failure mode in production
add_to_chat_ctx=False correctly excludes backchannels from conversation history

Decision

Requesting changes — but only because of item 1 (the agent-speech overlap question). The other items are worth-flagging or polish. If activity.say already gates against active speech, item 1 reduces to a documentation ask; if it doesn't, the gate needs adding.

Happy to re-approve once item 1 is confirmed/resolved.

feat(backchannel): add agent backchannel

02ee73c

tinalenguyen reviewed Jun 17, 2026

View reviewed changes

tinalenguyen added 4 commits June 16, 2026 20:29

Merge branch 'tina/expressive-mode' into chenghao/feat/add-agent-back…

00e0da2

…channel-expressive-mode

add backchannel to examples

f805d43

add hotel agent and update backchannels for drive thru

84038b9

Merge branch 'tina/expressive-mode' into chenghao/feat/add-agent-back…

257d2ab

…channel-expressive-mode # Conflicts: # examples/hotel_receptionist/agent.py

tsushanth suggested changes Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(backchannel): add agent backchannel#6130

feat(backchannel): add agent backchannel#6130
chenghao-mou wants to merge 5 commits into
tina/expressive-modefrom
chenghao/feat/add-agent-backchannel-expressive-mode

chenghao-mou commented Jun 16, 2026

Uh oh!

tinalenguyen Jun 17, 2026

Uh oh!

tinalenguyen Jun 17, 2026

Uh oh!

chenghao-mou Jun 17, 2026

Uh oh!

tsushanth left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chenghao-mou commented Jun 16, 2026

Uh oh!

tinalenguyen Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

tinalenguyen Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

chenghao-mou Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

tsushanth left a comment

Choose a reason for hiding this comment

Must-confirm

Worth discussing

Quality-of-merge

Acknowledging what's good

Decision

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants