Feature Type
I cannot use LiveKit without it
Feature Description
AgentSession's user_input_transcribed event fires once per interim transcript. With realtime models (OpenAI/xAI), the input-transcription delta handler emits a new input_audio_transcription_completed(is_final=False) for every streamed delta chunk, so a single user utterance produces many user_input_transcribed events with is_final=False.
A common need is to react exactly once per utterance — e.g. send an RPC to the frontend so it can show a "user is being transcribed" placeholder before the agent responds. To do that, a consumer needs a stable key to correlate all interim events belonging to the same utterance.
That key already exists internally: livekit.agents.llm.InputTranscriptionCompleted carries item_id. But it is discarded when the framework re-emits the event upward:
# livekit/agents/voice/agent_activity.py
def _on_input_audio_transcription_completed(self, ev: llm.InputTranscriptionCompleted) -> None:
self._session._user_input_transcribed(
UserInputTranscribedEvent(transcript=ev.transcript, is_final=ev.is_final)
) # ev.item_id is dropped
UserInputTranscribedEvent (livekit/agents/voice/events.py) only exposes transcript, is_final, speaker_id, language, created_at — no item_id.
Request:
- Add
item_id: str | None to UserInputTranscribedEvent and forward ev.item_id in _on_input_audio_transcription_completed.
- Optionally, include a stable timestamp marking when the utterance/item first started, so consumers can dedup/order without holding their own state.
Why: this lets a single, provider-agnostic user_input_transcribed subscription dedup to one action per utterance, instead of forcing consumers to either track last_item_id manually or drop down to a provider-specific raw event.
Workarounds / Alternatives
Two workarounds, both unsatisfying:
-
Manual state: keep a last_item_id in the handler closure and skip duplicates. But UserInputTranscribedEvent has no item_id, so this isn't actually possible from the high-level event — the only stable correlation key is unavailable.
-
Bypass the abstraction: subscribe to the OpenAI plugin's raw openai_server_event_received, parse the raw dict, and read item_id to dedup:
session.current_agent.realtime_llm_session.on("openai_server_event_received")(handler)
# handler filters event["type"] == "conversation.item.input_audio_transcription.delta"
# and dedups on event["item_id"]
This works for OpenAI/xAI but is NOT portable — the Gemini realtime plugin (livekit-plugins-google) never emits openai_server_event_received, so the same code silently does nothing on Gemini.
Subscribing to the provider-agnostic input_audio_transcription_completed on the realtime session works across providers, but still can't dedup because the re-emitted UserInputTranscribedEvent lacks item_id (the underlying InputTranscriptionCompleted has it). Hence this request.
Additional Context
Versions:
- livekit-agents 1.5.17
- livekit (rtc) 1.1.8
- livekit-plugins-openai 1.5.17
- livekit-plugins-google 1.5.17
- Python 3.13, macOS
Relevant code references:
- UserInputTranscribedEvent fields: livekit/agents/voice/events.py (class UserInputTranscribedEvent)
- item_id dropped on re-emit: livekit/agents/voice/agent_activity.py (_on_input_audio_transcription_completed)
- item_id present at source: livekit/agents/llm/realtime.py (class InputTranscriptionCompleted)
- OpenAI/xAI emit one interim per delta chunk: livekit/plugins/openai/realtime/realtime_model.py (_handle_conversion_item_input_audio_transcription_delta)
- Gemini emits the same normalized event from a different base class: livekit/plugins/google/realtime/realtime_api.py (_handle_server_content)
Confirmed behavior: with the same handler bound to input_audio_transcription_completed, interim events fire repeatedly per utterance on both OpenAI/xAI (token-by-token) and Gemini (whole-sentence in one interim). A passed-through item_id would let consumers collapse these to one action per utterance uniformly.
Feature Type
I cannot use LiveKit without it
Feature Description
AgentSession'suser_input_transcribedevent fires once per interim transcript. With realtime models (OpenAI/xAI), the input-transcription delta handler emits a newinput_audio_transcription_completed(is_final=False)for every streamed delta chunk, so a single user utterance produces manyuser_input_transcribedevents withis_final=False.A common need is to react exactly once per utterance — e.g. send an RPC to the frontend so it can show a "user is being transcribed" placeholder before the agent responds. To do that, a consumer needs a stable key to correlate all interim events belonging to the same utterance.
That key already exists internally:
livekit.agents.llm.InputTranscriptionCompletedcarriesitem_id. But it is discarded when the framework re-emits the event upward:UserInputTranscribedEvent(livekit/agents/voice/events.py) only exposestranscript,is_final,speaker_id,language,created_at— noitem_id.Request:
item_id: str | NonetoUserInputTranscribedEventand forwardev.item_idin_on_input_audio_transcription_completed.Why: this lets a single, provider-agnostic
user_input_transcribedsubscription dedup to one action per utterance, instead of forcing consumers to either tracklast_item_idmanually or drop down to a provider-specific raw event.Workarounds / Alternatives
Two workarounds, both unsatisfying:
Manual state: keep a
last_item_idin the handler closure and skip duplicates. ButUserInputTranscribedEventhas noitem_id, so this isn't actually possible from the high-level event — the only stable correlation key is unavailable.Bypass the abstraction: subscribe to the OpenAI plugin's raw
openai_server_event_received, parse the raw dict, and readitem_idto dedup:This works for OpenAI/xAI but is NOT portable — the Gemini realtime plugin (livekit-plugins-google) never emits
openai_server_event_received, so the same code silently does nothing on Gemini.Subscribing to the provider-agnostic
input_audio_transcription_completedon the realtime session works across providers, but still can't dedup because the re-emittedUserInputTranscribedEventlacksitem_id(the underlyingInputTranscriptionCompletedhas it). Hence this request.Additional Context
Versions:
Relevant code references:
Confirmed behavior: with the same handler bound to
input_audio_transcription_completed, interim events fire repeatedly per utterance on both OpenAI/xAI (token-by-token) and Gemini (whole-sentence in one interim). A passed-throughitem_idwould let consumers collapse these to one action per utterance uniformly.