Skip to content

Expose item_id (and speech-start timestamp) on UserInputTranscribedEvent for per-utterance dedup #6110

@areebkhan-tech

Description

@areebkhan-tech

Feature Type

I cannot use LiveKit without it

Feature Description

AgentSession's user_input_transcribed event fires once per interim transcript. With realtime models (OpenAI/xAI), the input-transcription delta handler emits a new input_audio_transcription_completed(is_final=False) for every streamed delta chunk, so a single user utterance produces many user_input_transcribed events with is_final=False.

A common need is to react exactly once per utterance — e.g. send an RPC to the frontend so it can show a "user is being transcribed" placeholder before the agent responds. To do that, a consumer needs a stable key to correlate all interim events belonging to the same utterance.

That key already exists internally: livekit.agents.llm.InputTranscriptionCompleted carries item_id. But it is discarded when the framework re-emits the event upward:

# livekit/agents/voice/agent_activity.py
def _on_input_audio_transcription_completed(self, ev: llm.InputTranscriptionCompleted) -> None:
    self._session._user_input_transcribed(
        UserInputTranscribedEvent(transcript=ev.transcript, is_final=ev.is_final)
    )  # ev.item_id is dropped

UserInputTranscribedEvent (livekit/agents/voice/events.py) only exposes transcript, is_final, speaker_id, language, created_at — no item_id.

Request:

  • Add item_id: str | None to UserInputTranscribedEvent and forward ev.item_id in _on_input_audio_transcription_completed.
  • Optionally, include a stable timestamp marking when the utterance/item first started, so consumers can dedup/order without holding their own state.

Why: this lets a single, provider-agnostic user_input_transcribed subscription dedup to one action per utterance, instead of forcing consumers to either track last_item_id manually or drop down to a provider-specific raw event.

Workarounds / Alternatives

Two workarounds, both unsatisfying:

  1. Manual state: keep a last_item_id in the handler closure and skip duplicates. But UserInputTranscribedEvent has no item_id, so this isn't actually possible from the high-level event — the only stable correlation key is unavailable.

  2. Bypass the abstraction: subscribe to the OpenAI plugin's raw openai_server_event_received, parse the raw dict, and read item_id to dedup:

    session.current_agent.realtime_llm_session.on("openai_server_event_received")(handler)
    # handler filters event["type"] == "conversation.item.input_audio_transcription.delta"
    # and dedups on event["item_id"]
    

    This works for OpenAI/xAI but is NOT portable — the Gemini realtime plugin (livekit-plugins-google) never emits openai_server_event_received, so the same code silently does nothing on Gemini.

Subscribing to the provider-agnostic input_audio_transcription_completed on the realtime session works across providers, but still can't dedup because the re-emitted UserInputTranscribedEvent lacks item_id (the underlying InputTranscriptionCompleted has it). Hence this request.

Additional Context

Versions:

  • livekit-agents 1.5.17
  • livekit (rtc) 1.1.8
  • livekit-plugins-openai 1.5.17
  • livekit-plugins-google 1.5.17
  • Python 3.13, macOS

Relevant code references:

  • UserInputTranscribedEvent fields: livekit/agents/voice/events.py (class UserInputTranscribedEvent)
  • item_id dropped on re-emit: livekit/agents/voice/agent_activity.py (_on_input_audio_transcription_completed)
  • item_id present at source: livekit/agents/llm/realtime.py (class InputTranscriptionCompleted)
  • OpenAI/xAI emit one interim per delta chunk: livekit/plugins/openai/realtime/realtime_model.py (_handle_conversion_item_input_audio_transcription_delta)
  • Gemini emits the same normalized event from a different base class: livekit/plugins/google/realtime/realtime_api.py (_handle_server_content)

Confirmed behavior: with the same handler bound to input_audio_transcription_completed, interim events fire repeatedly per utterance on both OpenAI/xAI (token-by-token) and Gemini (whole-sentence in one interim). A passed-through item_id would let consumers collapse these to one action per utterance uniformly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions