feat(voice): expose item_id on UserInputTranscribedEvent (closes #6109)#6127
Open
tsushanth wants to merge 1 commit into
Open
feat(voice): expose item_id on UserInputTranscribedEvent (closes #6109)#6127tsushanth wants to merge 1 commit into
tsushanth wants to merge 1 commit into
Conversation
β¦kit#6109) `AgentSession`'s `user_input_transcribed` event fires once per interim transcript on realtime models (every streamed delta produces a new `is_final=False` event), so a single user utterance produces many events. The internal `llm.InputTranscriptionCompleted` already carries an `item_id` that uniquely identifies the utterance, but when `AgentActivity._on_input_audio_transcription_completed` re-emits upward as `UserInputTranscribedEvent`, the id is dropped: self._session._user_input_transcribed( UserInputTranscribedEvent(transcript=ev.transcript, is_final=ev.is_final) ) # ev.item_id is dropped Consequence today: consumers that need to react exactly once per utterance β e.g. notify the frontend "user speech received" so it renders a placeholder before the agent responds β must either keep manual `last_item_id` state the event can't actually provide, or bypass the provider-agnostic event entirely and read item_id from the raw provider stream (e.g. `openai_server_event_received`). The raw-event escape hatch isn't portable β Gemini's realtime plugin doesn't emit `openai_server_event_received` at all, so the same consumer code can't work across realtime backends. This commit adds `item_id: str | None = None` to `UserInputTranscribedEvent` and threads it through the realtime emission site. STT paths leave it at the default `None` because there's no corresponding upstream item id there. Pydantic's optional default keeps the field fully backwards-compatible: existing event subscribers reading `transcript` / `is_final` / `language` / `speaker_id` / `created_at` see no change. - livekit-agents/livekit/agents/voice/events.py: add `item_id` field with a docstring documenting the realtime-stable / STT-none semantics - livekit-agents/livekit/agents/voice/agent_activity.py: thread `ev.item_id` through `_on_input_audio_transcription_completed` - tests/test_user_input_transcribed_event.py: new unit test module pinning (1) the field round-trips on the schema, (2) it defaults to None on omission, (3) it survives `model_dump` (relevant for the cross-process transport in `test_session_host.py`), and (4) the realtime-path data flow can thread `InputTranscriptionCompleted.item_id` through without modification. All four fail on unpatched source because Pydantic rejects the unknown `item_id` kwarg.
Comment on lines
+1737
to
1740
| UserInputTranscribedEvent( | ||
| transcript=ev.transcript, is_final=ev.is_final, item_id=ev.item_id | ||
| ) | ||
| ) |
Contributor
There was a problem hiding this comment.
π© Remote session transport does not forward item_id
The _on_user_input_transcribed handler in livekit-agents/livekit/agents/voice/remote_session.py:466-474 constructs the protobuf UserInputTranscribed message with only transcript and is_final β the new item_id field is not forwarded. This means remote session consumers won't receive the item_id for dedup purposes. This is not a regression (the protobuf schema would need a separate update), but it limits the utility of the feature for remote/distributed deployments.
Was this helpful? React with π or π to provide feedback.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Closes #6109.
`AgentSession`'s `user_input_transcribed` event fires once per interim transcript on realtime models β every streamed delta produces a new `is_final=False` event, so a single user utterance produces many events with no stable correlation key. The internal `llm.InputTranscriptionCompleted` already carries an `item_id` that uniquely identifies the utterance, but when `AgentActivity._on_input_audio_transcription_completed` re-emits it upward as `UserInputTranscribedEvent`, the id is dropped on the floor:
```python
livekit-agents/livekit/agents/voice/agent_activity.py β before
def _on_input_audio_transcription_completed(self, ev: llm.InputTranscriptionCompleted) -> None:
self._session._user_input_transcribed(
UserInputTranscribedEvent(transcript=ev.transcript, is_final=ev.is_final)
) # ev.item_id is dropped
```
Consequence: consumers that need per-utterance dedup β e.g. "notify the frontend 'user speech received' so it renders a placeholder exactly once" β must either keep manual `last_item_id` state the event can't actually provide, or bypass the provider-agnostic event entirely and read `item_id` from `openai_server_event_received`. That escape hatch isn't portable: Gemini's realtime plugin doesn't emit `openai_server_event_received` at all.
Fix
Add `item_id: str | None = None` to `UserInputTranscribedEvent` and thread it through the realtime emission site:
```python
livekit-agents/livekit/agents/voice/events.py β added field
class UserInputTranscribedEvent(BaseModel):
...
item_id: str | None = None
"""Stable id identifying the user utterance this transcript belongs to. On
realtime models, every interim and final UserInputTranscribedEvent for a
single utterance shares the same item_id, so consumers can dedup interim
transcripts and react exactly once per utterance using the provider-agnostic
event surface. None on STT paths where no upstream item id exists."""
```
```python
livekit-agents/livekit/agents/voice/agent_activity.py β threaded through
def _on_input_audio_transcription_completed(self, ev: llm.InputTranscriptionCompleted) -> None:
self._session._user_input_transcribed(
UserInputTranscribedEvent(
transcript=ev.transcript, is_final=ev.is_final, item_id=ev.item_id
)
)
```
STT paths (`on_interim_transcript` / `on_final_transcript` in the same file) leave `item_id` at the default `None` because the STT layer has no corresponding upstream id concept. Existing subscribers reading `transcript` / `is_final` / `language` / `speaker_id` / `created_at` see no behavioral change.
Test
New `tests/test_user_input_transcribed_event.py` (4 unit tests):
All four fail on unpatched source because Pydantic rejects the unknown `item_id` kwarg.
```
$ uv run pytest tests/test_user_input_transcribed_event.py --unit -v
PASSED tests/test_user_input_transcribed_event.py::test_user_input_transcribed_event_carries_item_id
PASSED tests/test_user_input_transcribed_event.py::test_user_input_transcribed_event_item_id_defaults_to_none
PASSED tests/test_user_input_transcribed_event.py::test_user_input_transcribed_event_serialises_item_id
PASSED tests/test_user_input_transcribed_event.py::test_input_transcription_completed_item_id_can_thread_to_event
4 passed in 0.05s
```
Backwards compatibility
Strict additive change. `item_id` is optional and defaults to `None`, so: