Skip to content

feat(backchannel): add agent backchannel#6130

Draft
chenghao-mou wants to merge 5 commits into
tina/expressive-modefrom
chenghao/feat/add-agent-backchannel-expressive-mode
Draft

feat(backchannel): add agent backchannel#6130
chenghao-mou wants to merge 5 commits into
tina/expressive-modefrom
chenghao/feat/add-agent-backchannel-expressive-mode

Conversation

@chenghao-mou

Copy link
Copy Markdown
Member

No description provided.

tts_instructions_template: Instructions | str
tts_instructions_append: str
audio_recognition_instructions_template: Instructions | str
backchannel: NotGivenOr[bool | list[str | AudioSource | BackchannelConfig] | BackchannelOptions]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm wondering if we could simplify these types somehow

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

claude suggested:

  # backchannel.py
+ BackchannelSource: TypeAlias = "str | AudioSource | BackchannelConfig"

  class BackchannelOptions(TypedDict, total=False):
      frequency: float
-     source: NotGivenOr[list[str | AudioSource | BackchannelConfig]]
+     source: list[BackchannelSource]

- DEFAULT_BACKCHANNEL_SOURCE: list[str | AudioSource | BackchannelConfig] = [...]
+ DEFAULT_BACKCHANNEL_SOURCE: list[BackchannelSource] = [...]

  def resolve_backchannel_options(
-     backchannel: NotGivenOr[bool | list[str | AudioSource | BackchannelConfig] | BackchannelOptions],
+     backchannel: NotGivenOr[bool | list[BackchannelSource] | BackchannelOptions],
  ) -> BackchannelOptions | None: ...

- def _as_config(entry: str | AudioSource | BackchannelConfig) -> BackchannelConfig: ...
+ def _as_config(entry: BackchannelSource) -> BackchannelConfig: ...
  # agent_session.py
  class ExpressiveOptions(TypedDict, total=False):
      ...
-     backchannel: NotGivenOr[bool | list[str | AudioSource | BackchannelConfig] | BackchannelOptions]
+     backchannel: bool | list[BackchannelSource] | BackchannelOptions

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

@tsushanth tsushanth left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughtful design. The EOT-band gating (safe sounds near threshold, risky lexical words only at low EOT fractions), the per-clip cooldown via _HOTNESS_DECAY, and the explicit TTFF cap on the first render all show genuine engineering taste about the latency/relevance tradeoff that's specific to backchannels β€” "a late one is worse than none" is exactly the right framing. The 300ms first-frame budget with silent-drop fallback is good defensive design.

One concrete architectural concern, plus a few worth-flagging items.

Must-confirm

1. activity.say(...) while the agent may already be mid-utterance.

on_agent_backchannel_opportunity fires (per audio_recognition.py:1511) when the cloud turn detector emits a backchannel signal β€” which is gated by the user's speech state. But the agent's speech state isn't checked anywhere in _BackchannelEmitter. The edge case I'm worried about:

  • Agent finishes saying a long response, the activity is still draining the last few audio frames
  • User starts a new utterance overlapping the tail of the agent's audio
  • Turn detector emits a backchannel opportunity ("user has just started talking")
  • maybe_emit β†’ _play β†’ activity.say(transcript, audio=_iter_frames(frames), allow_interruptions=False)
  • The backchannel audio now overlaps the agent's own still-draining response

activity.say may already gate this β€” I don't see the implementation in the diff, but if it queues backchannel speech behind active speech, the backchannel arrives late (and the in-code comment "a late one is worse than none" applies β€” better to drop than queue). If activity.say plays immediately on a separate channel, you get overlap of the agent's own audio.

Two clean resolutions, either works:

  • (a) Gate maybe_emit on not activity.is_speaking() (or whatever the AgentActivity equivalent is) before adding to the pool. Backchannels are dropped silently when the agent is mid-utterance.
  • (b) Document the contract on activity.say and either rely on it (with a comment in _play referencing the assumed behavior) or pick (a).

Worth confirming the current behavior with a test that has the agent mid-say() when a backchannel opportunity fires β€” currently test_backchannel.py doesn't appear to exercise this overlap case based on a quick scan.

Worth discussing

A. _synthesize enforces TTFF only on the first frame:

first = await asyncio.wait_for(it.__anext__(), timeout=_SYNTH_TTFF_TIMEOUT)
frames.append(first.frame)
async for ev in it:  # no timeout on subsequent frames
    frames.append(ev.frame)

The first-frame cap matches the "late one is worse than none" design β€” good. But once the first frame arrives, the remaining frames have no time budget. A TTS that's fast on TTFF but slow on subsequent frames could hang the _render task. For typical backchannel clips (sub-second utterances) this is unlikely to be a real issue, but a total-time budget (e.g. _SYNTH_TTFF_TIMEOUT * 4 for the full clip) would be a cheap defensive guard.

B. _clip_key(source) for AsyncIterator sources uses id(source) and the iterator is one-shot:

def _clip_key(source: str | AudioSource) -> str:
    ...
    return f"iter:{id(source)}"

If _render raises before populating self._cache[key] (the broad except Exception swallows it), the iterator is consumed but no cached frames exist. The next opportunity for the same clip key re-enters _render_and_play, but _decode_source(source) returns the same already-consumed iterator. Result: empty frames list, silent failure on every subsequent attempt.

For str (TTS text) or BuiltinAudioClip sources this isn't an issue β€” re-rendering / re-decoding from a file works. Only the raw AsyncIterator case has the one-shot consumption hazard. Two options:

  • Materialize the iterator into a list eagerly inside _decode_source (defeats some of the streaming benefit but avoids the silent-failure case)
  • Detect the failed-render case explicitly and refuse to register the same iterator key (force the caller to provide a factory function rather than a raw iterator)

C. Cloud-turn-detector-only gate with only a warning log:

if not isinstance(self._turn_detection, inference.TurnDetector):
    logger.warning(
        "backchannel is enabled but the active turn detector does not provide a "
        "backchannel signal (requires the LiveKit cloud turn detector); disabling it"
    )
    return None

A user who passes backchannel=True and never sees a backchannel fire (because they're on the local mini detector) will likely scroll past a WARNING log line. The contract that backchannels require the cloud detector is significant enough to surface in:

  • The BackchannelOptions TypedDict docstring
  • A note in the ExpressiveOptions["backchannel"] documentation (when those docs land β€” see item E)
  • Possibly upgrade the log to ERROR for visibility, since the feature is silently disabled

Quality-of-merge

D. Empty PR body. The architectural decisions here (EOT-band gating, render-then-cache, TTFF cap, frequency pre-gate) all warrant 2-3 sentences in the PR description so a maintainer reading the diff doesn't have to reverse-engineer the design from the inline comments. Most of the design rationale is already written excellently as docstrings in backchannel.py β€” just hoisting the key paragraphs into the PR body would be enough.

E. No public docs entry. BackchannelOptions (TypedDict) and BackchannelConfig (dataclass) are exposed via livekit/agents/__init__.py and voice/__init__.py β€” both are public surface that users will configure. Worth a docs page or at least an expressive config section update. (Mentioning because the in-tree docstrings are good; just need to be discoverable.)

Acknowledging what's good

A few things that are worth not regressing during iteration:

  • The two-tier default source list (safe sounds near EOT threshold, risky words only at low EOT) shows real understanding of the backchannel UX failure modes
  • _HOTNESS_DECAY = 3 with _cooldown returning 1.0 - heat is a clean implementation of recency-suppression without needing time tracking
  • The render-once-then-cache design is the right shape β€” pre-warming at activity init would be marginally better but adds complexity that's hard to justify before observing the failure mode in production
  • add_to_chat_ctx=False correctly excludes backchannels from conversation history

Decision

Requesting changes β€” but only because of item 1 (the agent-speech overlap question). The other items are worth-flagging or polish. If activity.say already gates against active speech, item 1 reduces to a documentation ask; if it doesn't, the gate needs adding.

Happy to re-approve once item 1 is confirmed/resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants