Expose item_id (and speech-start timestamp) on UserInputTranscribedEvent for per-utterance dedup

### Feature Type

I cannot use LiveKit without it

### Feature Description

`AgentSession`'s `user_input_transcribed` event fires once per interim transcript. With realtime models (OpenAI/xAI), the input-transcription delta handler emits a new `input_audio_transcription_completed(is_final=False)` for every streamed delta chunk, so a single user utterance produces many `user_input_transcribed` events with `is_final=False`.

A common need is to react exactly once per utterance — e.g. send an RPC to the frontend so it can show a "user is being transcribed" placeholder before the agent responds. To do that, a consumer needs a stable key to correlate all interim events belonging to the same utterance.

That key already exists internally: `livekit.agents.llm.InputTranscriptionCompleted` carries `item_id`. But it is discarded when the framework re-emits the event upward:

    # livekit/agents/voice/agent_activity.py
    def _on_input_audio_transcription_completed(self, ev: llm.InputTranscriptionCompleted) -> None:
        self._session._user_input_transcribed(
            UserInputTranscribedEvent(transcript=ev.transcript, is_final=ev.is_final)
        )  # ev.item_id is dropped

`UserInputTranscribedEvent` (livekit/agents/voice/events.py) only exposes `transcript`, `is_final`, `speaker_id`, `language`, `created_at` — no `item_id`.

Request:
- Add `item_id: str | None` to `UserInputTranscribedEvent` and forward `ev.item_id` in `_on_input_audio_transcription_completed`.
- Optionally, include a stable timestamp marking when the utterance/item first started, so consumers can dedup/order without holding their own state.

Why: this lets a single, provider-agnostic `user_input_transcribed` subscription dedup to one action per utterance, instead of forcing consumers to either track `last_item_id` manually or drop down to a provider-specific raw event.

### Workarounds / Alternatives

Two workarounds, both unsatisfying:

1. Manual state: keep a `last_item_id` in the handler closure and skip duplicates. But `UserInputTranscribedEvent` has no `item_id`, so this isn't actually possible from the high-level event — the only stable correlation key is unavailable.

2. Bypass the abstraction: subscribe to the OpenAI plugin's raw `openai_server_event_received`, parse the raw dict, and read `item_id` to dedup:

       session.current_agent.realtime_llm_session.on("openai_server_event_received")(handler)
       # handler filters event["type"] == "conversation.item.input_audio_transcription.delta"
       # and dedups on event["item_id"]

   This works for OpenAI/xAI but is NOT portable — the Gemini realtime plugin (livekit-plugins-google) never emits `openai_server_event_received`, so the same code silently does nothing on Gemini.

Subscribing to the provider-agnostic `input_audio_transcription_completed` on the realtime session works across providers, but still can't dedup because the re-emitted `UserInputTranscribedEvent` lacks `item_id` (the underlying `InputTranscriptionCompleted` has it). Hence this request.

### Additional Context

Versions:
- livekit-agents 1.5.17
- livekit (rtc) 1.1.8
- livekit-plugins-openai 1.5.17
- livekit-plugins-google 1.5.17
- Python 3.13, macOS

Relevant code references:
- UserInputTranscribedEvent fields: livekit/agents/voice/events.py (class UserInputTranscribedEvent)
- item_id dropped on re-emit: livekit/agents/voice/agent_activity.py (_on_input_audio_transcription_completed)
- item_id present at source: livekit/agents/llm/realtime.py (class InputTranscriptionCompleted)
- OpenAI/xAI emit one interim per delta chunk: livekit/plugins/openai/realtime/realtime_model.py (_handle_conversion_item_input_audio_transcription_delta)
- Gemini emits the same normalized event from a different base class: livekit/plugins/google/realtime/realtime_api.py (_handle_server_content)

Confirmed behavior: with the same handler bound to `input_audio_transcription_completed`, interim events fire repeatedly per utterance on both OpenAI/xAI (token-by-token) and Gemini (whole-sentence in one interim). A passed-through `item_id` would let consumers collapse these to one action per utterance uniformly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose item_id (and speech-start timestamp) on UserInputTranscribedEvent for per-utterance dedup #6110

Feature Type

Feature Description

Workarounds / Alternatives

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Expose item_id (and speech-start timestamp) on UserInputTranscribedEvent for per-utterance dedup #6110

Description

Feature Type

Feature Description

Workarounds / Alternatives

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions