AssemblyAI supports four paths for building voice agents:
- Speech-to-Speech API — single WebSocket for full voice agent (speech-in → LLM → speech-out)
- LiveKit Agents — fastest path to deployment using U3 Pro STT
- Pipecat (by Daily) — open-source, maximum customizability using U3 Pro STT
- Direct WebSocket — fully custom STT builds (see
streaming.md)
AssemblyAI's Speech-to-Speech API is a single WebSocket that handles the full voice agent loop: speech-in → LLM → speech-out. It includes built-in VAD, TTS, tool calling, and barge-in handling.
Note: Requires a credit card on file to activate.
wss://speech-to-speech.us.assemblyai.com/v1/realtime
Authorization: Bearer YOUR_API_KEY
Audio format: PCM16, 24kHz mono, base64-encoded, ~50ms chunks (2400 bytes).
| Event | Description |
|---|---|
input.audio |
Send audio chunk: {"type": "input.audio", "audio": "<base64>"} |
session.update |
Configure session: system_prompt, greeting, tools, turn_detection |
session.resume |
Reconnect to an existing session: {"type": "session.resume", "session_id": "..."} |
tool.result |
Return tool call result back to the agent |
| Event | Description |
|---|---|
session.ready |
Session is initialized and ready |
session.updated |
Session configuration has been updated |
input.speech.started |
VAD detected speech start (for barge-in) |
input.speech.stopped |
VAD detected speech end |
transcript.user.delta |
Partial user transcript |
transcript.user |
Final user transcript |
reply.started |
Agent is starting a reply |
reply.audio |
Agent audio chunk (base64 PCM16 24kHz) |
transcript.agent |
Agent's reply text |
reply.done |
Agent reply complete |
tool.call |
Agent wants to call a tool |
error |
Non-fatal error |
session.error |
Fatal session error |
Sessions are preserved for 30 seconds after disconnection. Reconnect using session.resume with the session ID to continue without losing context.
Note: S2S session.update wraps all config under a "session" key. Tool definitions use a flat format (not the nested function object used by the LLM Gateway).
{
"type": "session.update",
"session": {
"system_prompt": "You are a helpful customer support agent for Acme Corp.",
"greeting": "Hello! How can I help you today?",
"tools": [
{
"type": "function",
"name": "lookup_order",
"description": "Look up an order by order ID",
"parameters": {
"type": "object",
"properties": {
"order_id": {"type": "string"}
},
"required": ["order_id"]
}
}
]
}
}u3-rt-pro (Universal-3 Pro Streaming) is the recommended model for all new voice agent work.
| Feature | u3-rt-pro | universal-streaming-english | universal-streaming-multilingual |
|---|---|---|---|
| Turn detection | Punctuation-based | Confidence-based | Confidence-based |
| Custom prompting (beta) | Yes | No | No |
| Keyterms boosting | Yes | Yes | Yes |
| Speaker diarization | Yes | Yes | Yes |
| Dynamic mid-session updates | Yes | Yes | Yes |
| Multilingual code switching | Yes | No | Yes |
| Languages | 6 (en, es, fr, de, it, pt) | English only | Multiple |
end_of_turn_confidence_threshold does NOT work with u3-rt-pro — it only applies to older universal-streaming models.
- User pauses for
min_turn_silence(e.g., 100ms) - Model checks for terminal punctuation (
.?!) - If found: turn ends immediately (
end_of_turn: true) - If not found: partial emitted, listening continues
- If silence reaches
max_turn_silence: turn forced to end regardless
| Profile | min_turn_silence | max_turn_silence | Use Case |
|---|---|---|---|
| Fast | 100ms | 800ms | IVR, yes/no, quick confirmations |
| Balanced | 100ms | 1000ms | Most voice agents (recommended default) |
| Patient | 200ms | 2000ms | Entity dictation: emails, phone numbers, addresses |
Low min_turn_silence can split entities (phone numbers, emails) across turns. Dynamically increase max_turn_silence to 2000-3000ms during entity collection phases, then reduce it afterward.
# For u3-rt-pro (requires livekit-agents >= 1.4.4)
pip install "livekit-agents[assemblyai,silero,codecs]~=1.0" python-dotenv livekit-plugins-turn-detector~=1.0Required env vars: ASSEMBLYAI_API_KEY, LIVEKIT_URL, LIVEKIT_API_KEY, LIVEKIT_API_SECRET, plus LLM/TTS provider keys.
from dotenv import load_dotenv
from livekit import agents
from livekit.agents import AgentSession, Agent
from livekit.plugins import assemblyai, silero
load_dotenv()
class Assistant(Agent):
def __init__(self) -> None:
super().__init__(instructions="You are a helpful voice AI assistant.")
async def entrypoint(ctx: agents.JobContext):
await ctx.connect()
session = AgentSession(
stt=assemblyai.STT(
model="u3-rt-pro",
min_turn_silence=100,
max_turn_silence=1000,
vad_threshold=0.3,
),
vad=silero.VAD.load(activation_threshold=0.3),
turn_detection="stt",
min_endpointing_delay=0, # CRITICAL: avoid additive 500ms delay
)
await session.start(room=ctx.room, agent=Assistant())
await session.generate_reply(instructions="Greet the user and offer your assistance.")
if __name__ == "__main__":
agents.cli.run_app(agents.WorkerOptions(entrypoint_fnc=entrypoint))Run with python voice_agent.py dev, test at https://agents-playground.livekit.io/.
from livekit.plugins.turn_detector.multilingual import MultilingualModel
session = AgentSession(
turn_detection=MultilingualModel(),
stt=assemblyai.STT(model="u3-rt-pro", vad_threshold=0.3),
vad=silero.VAD.load(activation_threshold=0.3),
min_endpointing_delay=0.5,
max_endpointing_delay=3.0,
)Other modes: VAD-only (purely silence-based) and Manual (explicit session.commit_user_turn(), session.clear_user_turn(), session.interrupt()).
| Pitfall | Fix |
|---|---|
max_turn_silence defaults to 100ms in LiveKit (API default is 1000ms) |
Always set max_turn_silence=1000 explicitly |
min_endpointing_delay adds 500ms on top of AssemblyAI endpointing |
Set min_endpointing_delay=0 in STT mode |
| Silero VAD default threshold is 0.5, AssemblyAI default is 0.3 | Set both to 0.3 — mismatch creates a dead zone delaying interruption |
| u3-rt-pro requires livekit-agents >= 1.4.4 | Check version before debugging |
pip install "pipecat-ai[assemblyai,openai,cartesia]"
# or swap providers:
pip install "pipecat-ai[assemblyai,anthropic,elevenlabs]"from pipecat.services.assemblyai.stt import AssemblyAISTTService
from pipecat.services.assemblyai.config import AssemblyAIConnectionParams
stt = AssemblyAISTTService(
api_key=os.getenv("ASSEMBLYAI_API_KEY"),
connection_params=AssemblyAIConnectionParams(
speech_model="u3-rt-pro",
min_turn_silence=100,
),
vad_force_turn_endpoint=True, # Default — Pipecat controls turns
)In Pipecat mode, VAD + Smart Turn analyzer controls endpointing. max_turn_silence auto-syncs with min_turn_silence. A ForceEndpoint message is sent to AssemblyAI when silence is detected.
stt = AssemblyAISTTService(
api_key=os.getenv("ASSEMBLYAI_API_KEY"),
connection_params=AssemblyAIConnectionParams(
speech_model="u3-rt-pro",
min_turn_silence=100,
max_turn_silence=1000,
),
vad_force_turn_endpoint=False, # AssemblyAI controls turns
)stt = AssemblyAISTTService(
api_key=os.getenv("ASSEMBLYAI_API_KEY"),
connection_params=AssemblyAIConnectionParams(
speech_model="u3-rt-pro",
min_turn_silence=100,
keyterms_prompt=["Xiomara", "Saoirse", "Pipecat", "AssemblyAI"],
),
)from pipecat.frames.frames import STTUpdateSettingsFrame
from pipecat.services.assemblyai.stt import AssemblyAISTTSettings
await task.queue_frame(
STTUpdateSettingsFrame(
delta=AssemblyAISTTSettings(
connection_params=AssemblyAIConnectionParams(
keyterms_prompt=["NewName", "NewCompany"],
min_turn_silence=200,
max_turn_silence=3000,
)
)
)
)stt = AssemblyAISTTService(
api_key=os.getenv("ASSEMBLYAI_API_KEY"),
connection_params=AssemblyAIConnectionParams(
speech_model="u3-rt-pro",
speaker_labels=True,
),
speaker_format="<Speaker {speaker}>{text}</Speaker {speaker}>",
)keyterms_prompt and prompt cannot be used simultaneously — choose one.
Monitor SpeechStarted events from AssemblyAI:
{"type": "SpeechStarted", "timestamp": 14400, "confidence": 0.79}On detection: stop TTS playback immediately, switch to listening mode, wait for full turn before responding.
Both frameworks support updating parameters mid-session without reconnecting:
| Stage | Configuration |
|---|---|
| Caller identification | Boost specific names via keyterms_prompt |
| Entity dictation (email, phone) | Increase max_turn_silence to 3000ms |
| Yes/no questions | Use prompt anticipating short responses |
| Payment collection | Boost card brand terms + extend silence |
- Set
min_endpointing_delay=0in LiveKit STT mode — default 500ms is additive - Use 16kHz sample rate — higher rates don't improve accuracy
- Synchronize VAD thresholds — set both local VAD and AssemblyAI
vad_thresholdto 0.3 - Avoid audio preprocessing/noise cancellation before sending to AssemblyAI — artifacts cause more harm than background noise
- Only enable features you need — skip
speaker_labelsunless required - Use dynamic configuration to adjust silence only when needed
| Component | Latency |
|---|---|
| Network transmission | ~50ms |
| Speech-to-text processing | 200-300ms (sub-300ms P50) |
min_turn_silence check |
100ms+ (configurable) |
max_turn_silence fallback |
1000ms+ (only if no terminal punctuation) |
SIP trunking routes phone calls into LiveKit rooms. Configure inbound/outbound trunks and dispatch rules.
WebSocket media streaming with TeXML. Critical: Telnyx uses 8kHz audio, not 16kHz:
transport = TelnyxTransport(
# ...
audio_in_sample_rate=8000,
audio_out_sample_rate=8000,
)- Free tier: 5 new streams/minute
- Pay-as-you-go: 100 new streams/minute
- No hard cap on concurrent streams
- Automatic 10% capacity increase every 60 seconds at 70%+ utilization
- Keyterms prompting (highest impact) — up to 100 terms, max 50 chars each
- Dynamic configuration updates — contextual adaptation per conversation stage
- Silence threshold tuning — entity preservation
- Avoid preprocessing noise cancellation — artifacts hurt more than noise