Skip to content

UserInputTranscribedEvent drops item_id, making per-utterance dedup of interim transcripts impossible without provider-specific events #6109

@areebkhan-tech

Description

@areebkhan-tech

Bug Description

AgentSession's user_input_transcribed event fires once per interim transcript. For realtime models (OpenAI/xAI), the input-transcription delta handler emits a new input_audio_transcription_completed(is_final=False) for every streamed delta chunk, so a single user utterance produces many user_input_transcribed events with is_final=False.
A common need is to react exactly once per utterance — e.g. notify the frontend "user speech received" so it can render a placeholder before the agent responds. To do that you need a stable key to correlate all the interim events of the same utterance.
That key already exists internally: livekit.agents.llm.InputTranscriptionCompleted carries item_id (and confidence). But when AgentActivity re-emits it upward, the id is discarded:

# livekit/agents/voice/agent_activity.py
def _on_input_audio_transcription_completed(self, ev: llm.InputTranscriptionCompleted) -> None:
    self._session._user_input_transcribed(
        UserInputTranscribedEvent(transcript=ev.transcript, is_final=ev.is_final)
    )  # ev.item_id is dropped
UserInputTranscribedEvent itself only has transcript, is_final, speaker_id, language, created_at (livekit/agents/voice/events.py) — no item_id.

Consequence: to dedup per utterance, a consumer must either (a) keep manual last_item_id state that the event can't actually provide, or (b) bypass the provider-agnostic event entirely and subscribe to the raw openai_server_event_received to read item_id from the raw dict. Option (b) is not portable — Gemini's realtime plugin doesn't emit openai_server_event_received at all, so the same code can't work across providers

### Expected Behavior

UserInputTranscribedEvent should expose the item_id that is already present on llm.InputTranscriptionCompleted, so consumers can correlate all interim/final transcripts of a single utterance and act exactly once per utteranceusing the provider-agnostic user_input_transcribed event, without dropping down to provider-specific raw events.

Concretely:

Add item_id: str | None to UserInputTranscribedEvent and pass ev.item_id through in _on_input_audio_transcription_completed.
(Optional) include a stable timestamp marking when this item/utterance first started, so consumers can dedup/order without tracking state themselves.
This keeps a single, provider-agnostic subscription point that works uniformly across OpenAI, xAI, and Gemini.

### Reproduction Steps

```bash
1.Start an AgentSession with any realtime model (e.g. OpenAI/xAI gpt-realtime).
2.session.on("user_input_transcribed", handler).
3.Speak one sentence.
4.Observe handler fires many times with is_final=False during the single utterance, and the event provides no item_id to group them.

Operating System

macOS

Models Used

Realtime xai realtime , openai gpt-realtime , gemini

Package Versions

livekit-agents 1.5.17
livekit (rtc) 1.1.8
livekit-plugins-openai 1.5.17
Python 3.13, macOS

Session/Room/Call IDs

No response

Proposed Solution

Additional Context

No response

Screenshots and Recordings

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions