Skip to content

feat(voice): expose item_id on UserInputTranscribedEvent (closes #6109)#6127

Open
tsushanth wants to merge 1 commit into
livekit:mainfrom
tsushanth:fix/user-input-transcribed-event-item-id
Open

feat(voice): expose item_id on UserInputTranscribedEvent (closes #6109)#6127
tsushanth wants to merge 1 commit into
livekit:mainfrom
tsushanth:fix/user-input-transcribed-event-item-id

Conversation

@tsushanth

Copy link
Copy Markdown

Why

Closes #6109.

`AgentSession`'s `user_input_transcribed` event fires once per interim transcript on realtime models β€” every streamed delta produces a new `is_final=False` event, so a single user utterance produces many events with no stable correlation key. The internal `llm.InputTranscriptionCompleted` already carries an `item_id` that uniquely identifies the utterance, but when `AgentActivity._on_input_audio_transcription_completed` re-emits it upward as `UserInputTranscribedEvent`, the id is dropped on the floor:

```python

livekit-agents/livekit/agents/voice/agent_activity.py β€” before

def _on_input_audio_transcription_completed(self, ev: llm.InputTranscriptionCompleted) -> None:
self._session._user_input_transcribed(
UserInputTranscribedEvent(transcript=ev.transcript, is_final=ev.is_final)
) # ev.item_id is dropped
```

Consequence: consumers that need per-utterance dedup β€” e.g. "notify the frontend 'user speech received' so it renders a placeholder exactly once" β€” must either keep manual `last_item_id` state the event can't actually provide, or bypass the provider-agnostic event entirely and read `item_id` from `openai_server_event_received`. That escape hatch isn't portable: Gemini's realtime plugin doesn't emit `openai_server_event_received` at all.

Fix

Add `item_id: str | None = None` to `UserInputTranscribedEvent` and thread it through the realtime emission site:

```python

livekit-agents/livekit/agents/voice/events.py β€” added field

class UserInputTranscribedEvent(BaseModel):
...
item_id: str | None = None
"""Stable id identifying the user utterance this transcript belongs to. On
realtime models, every interim and final UserInputTranscribedEvent for a
single utterance shares the same item_id, so consumers can dedup interim
transcripts and react exactly once per utterance using the provider-agnostic
event surface. None on STT paths where no upstream item id exists."""
```

```python

livekit-agents/livekit/agents/voice/agent_activity.py β€” threaded through

def _on_input_audio_transcription_completed(self, ev: llm.InputTranscriptionCompleted) -> None:
self._session._user_input_transcribed(
UserInputTranscribedEvent(
transcript=ev.transcript, is_final=ev.is_final, item_id=ev.item_id
)
)
```

STT paths (`on_interim_transcript` / `on_final_transcript` in the same file) leave `item_id` at the default `None` because the STT layer has no corresponding upstream id concept. Existing subscribers reading `transcript` / `is_final` / `language` / `speaker_id` / `created_at` see no behavioral change.

Test

New `tests/test_user_input_transcribed_event.py` (4 unit tests):

  • `test_user_input_transcribed_event_carries_item_id` β€” schema accepts the field
  • `test_user_input_transcribed_event_item_id_defaults_to_none` β€” backwards-compat for STT paths
  • `test_user_input_transcribed_event_serialises_item_id` β€” `model_dump` includes the field (relevant for the cross-process host transport already exercised in `test_session_host.py`)
  • `test_input_transcription_completed_item_id_can_thread_to_event` β€” pins the realtime data flow without instantiating the full `AgentActivity` (Pydantic schema contract is the boundary under test)

All four fail on unpatched source because Pydantic rejects the unknown `item_id` kwarg.

```
$ uv run pytest tests/test_user_input_transcribed_event.py --unit -v
PASSED tests/test_user_input_transcribed_event.py::test_user_input_transcribed_event_carries_item_id
PASSED tests/test_user_input_transcribed_event.py::test_user_input_transcribed_event_item_id_defaults_to_none
PASSED tests/test_user_input_transcribed_event.py::test_user_input_transcribed_event_serialises_item_id
PASSED tests/test_user_input_transcribed_event.py::test_input_transcription_completed_item_id_can_thread_to_event
4 passed in 0.05s
```

Backwards compatibility

Strict additive change. `item_id` is optional and defaults to `None`, so:

  • Existing call sites that construct `UserInputTranscribedEvent` without `item_id` continue to work (verified in the test where the STT-style construction without `item_id` round-trips with `item_id=None`)
  • Existing subscribers that don't read `item_id` see no behavioral change
  • The new field surfaces via the existing event surface, so no new event-name registration is needed downstream

…kit#6109)

`AgentSession`'s `user_input_transcribed` event fires once per interim
transcript on realtime models (every streamed delta produces a new
`is_final=False` event), so a single user utterance produces many
events. The internal `llm.InputTranscriptionCompleted` already carries
an `item_id` that uniquely identifies the utterance, but when
`AgentActivity._on_input_audio_transcription_completed` re-emits
upward as `UserInputTranscribedEvent`, the id is dropped:

    self._session._user_input_transcribed(
        UserInputTranscribedEvent(transcript=ev.transcript, is_final=ev.is_final)
    )  # ev.item_id is dropped

Consequence today: consumers that need to react exactly once per
utterance β€” e.g. notify the frontend "user speech received" so it
renders a placeholder before the agent responds β€” must either keep
manual `last_item_id` state the event can't actually provide, or
bypass the provider-agnostic event entirely and read item_id from
the raw provider stream (e.g. `openai_server_event_received`). The
raw-event escape hatch isn't portable β€” Gemini's realtime plugin
doesn't emit `openai_server_event_received` at all, so the same
consumer code can't work across realtime backends.

This commit adds `item_id: str | None = None` to
`UserInputTranscribedEvent` and threads it through the realtime
emission site. STT paths leave it at the default `None` because
there's no corresponding upstream item id there. Pydantic's optional
default keeps the field fully backwards-compatible: existing event
subscribers reading `transcript` / `is_final` / `language` /
`speaker_id` / `created_at` see no change.

- livekit-agents/livekit/agents/voice/events.py: add `item_id` field
  with a docstring documenting the realtime-stable / STT-none
  semantics
- livekit-agents/livekit/agents/voice/agent_activity.py: thread
  `ev.item_id` through `_on_input_audio_transcription_completed`
- tests/test_user_input_transcribed_event.py: new unit test module
  pinning (1) the field round-trips on the schema, (2) it defaults
  to None on omission, (3) it survives `model_dump` (relevant for
  the cross-process transport in `test_session_host.py`), and (4)
  the realtime-path data flow can thread `InputTranscriptionCompleted.item_id`
  through without modification. All four fail on unpatched source
  because Pydantic rejects the unknown `item_id` kwarg.
@tsushanth tsushanth requested a review from a team as a code owner June 16, 2026 18:36

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

Open in Devin Review

Comment on lines +1737 to 1740
UserInputTranscribedEvent(
transcript=ev.transcript, is_final=ev.is_final, item_id=ev.item_id
)
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Remote session transport does not forward item_id

The _on_user_input_transcribed handler in livekit-agents/livekit/agents/voice/remote_session.py:466-474 constructs the protobuf UserInputTranscribed message with only transcript and is_final β€” the new item_id field is not forwarded. This means remote session consumers won't receive the item_id for dedup purposes. This is not a regression (the protobuf schema would need a separate update), but it limits the utility of the feature for remote/distributed deployments.

Open in Devin Review

Was this helpful? React with πŸ‘ or πŸ‘Ž to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UserInputTranscribedEvent drops item_id, making per-utterance dedup of interim transcripts impossible without provider-specific events

1 participant