-
Notifications
You must be signed in to change notification settings - Fork 0
Developer Voice
The voice subsystem adds speech-to-text, text-to-speech, microphone recording, and voice settings on top of the existing agent panel.
Voice backend code lives under src-tauri/src/voice/.
| File | Responsibility |
|---|---|
mod.rs |
Module exports and subsystem overview. |
commands.rs |
Tauri command surface for recording, transcription, settings, voices, and preview. |
recorder.rs |
cpal microphone capture into temporary WAV files. |
settings.rs |
Persistent voice settings stored inside agent_provider_settings.json. |
stt.rs |
OpenAI/OpenRouter audio transcription HTTP client. |
tts.rs |
OpenAI speech HTTP client returning MP3 bytes. |
voices.rs |
Static TTS voice catalog with gender hints. |
src-tauri/src/lib.rs registers VoiceRecorderState as managed state and registers every voice command in generate_handler![].
Voice frontend code lives in:
-
src/workbench/agent_panel/voice_orb/: microphone orb, recording state machine, playback, and push-to-talk hotkey. -
src/workbench/harness_voice_pane/mod.rs: voice settings UI. -
src/tauri_bridge.rs: voice settings structs and typed command wrappers. -
src/agent_wire.rs:voice_inputonUserTurnandvoice_readyonAgentEvent.
Voice settings are stored as a voice sub-object in the same JSON envelope as agent provider settings:
<app-config>/agent_provider_settings.json
The voice settings module deserializes the file as serde_json::Value, updates only the voice object, and round-trips the other settings untouched.
Default settings:
- STT provider: OpenAI.
- STT model:
gpt-4o-mini-transcribe. - STT sample rate:
16000. - TTS provider: OpenAI.
- TTS model:
gpt-4o-mini-tts. - TTS voice:
nova. - TTS enabled:
true. - Post-STT flow:
AutoSend. - STT language:
FollowApp. - PTT hotkey: Space.
Voice provider keys piggyback on agent_settings::provider_key_pub, so OpenAI voice uses the OpenAI keyring entry and OpenRouter STT uses the OpenRouter keyring entry.
- The user starts the voice orb or configured push-to-talk hotkey.
- The frontend calls
voice_start_recording(sampleRateHz). - The backend creates a UUID turn ID and starts
cpalcapture from the default input device. - Audio is downmixed to mono and resampled to the configured target rate.
- Samples are written as 16-bit PCM WAV under
<app-cache>/voice/. - The frontend stops recording with
voice_stop_and_transcribe(turnId, localeHint). - The backend finalizes the WAV, sends it to STT, deletes the WAV, and returns transcript text.
Cancelling calls voice_cancel_recording(turnId), stops the worker, and removes the temporary WAV.
stt::transcribe_wav posts multipart form data to:
- OpenAI:
https://api.openai.com/v1/audio/transcriptions - OpenRouter:
https://openrouter.ai/api/v1/audio/transcriptions
The request includes:
-
model. -
fileasaudio/wav. -
response_format=text. - Optional
languagereduced to a primary ISO-639-1 code.
Responses are parsed as JSON first for compatibility with providers that still return { "text": "..." }, then as raw text.
TTS runs only for turns that originated from voice input. The agent panel marks the next UserTurn with voice_input=true after a successful STT transcript.
After the model turn finishes, session_orchestrator::maybe_emit_tts:
- Loads voice settings.
- Skips work if TTS is disabled.
- Reads the final assistant text from conversation state.
- Resolves the TTS provider key.
- Calls
tts::synthesize. - Pushes
AgentEvent::VoiceReady { audio_b64, mime }.
The frontend converts the base64 MP3 into a Blob URL and plays it through an <audio> element.
TTS currently supports OpenAI only. If another provider is selected, tts::synthesize returns an unsupported-provider error.
voice_start_recordingvoice_stop_and_transcribevoice_cancel_recordingvoice_settings_getvoice_settings_savevoice_provider_voicesvoice_tts_preview
Keep new command arguments owned and serializable. Validate provider/model assumptions on the backend, not only in the settings UI.
- Missing microphone:
voice_start_recordingreturns an error. - Failed STT: the temporary WAV is still removed and the error is returned to the frontend.
- Failed TTS: the text answer remains available and an
AgentEvent::Erroris queued. - Missing key: key resolution fails through the shared provider key lookup.
Voice currently depends on:
-
cpalfor cross-platform audio input. -
houndfor WAV writing. -
uuidfor recording turn IDs. -
reqwestmultipart support for STT uploads.
- User-Agent-Harness
- User-Agent-Providers
- User-Building
- User-Getting-Started
- User-Image
- User-Keyboard-Shortcuts
- User-Language
- User-Memory-And-Tasks
- User-Plans
- User-Rules-And-Skills
- User-Subagents
- User-Troubleshooting
- User-Voice
- User-Workspaces
- Developer-Agent-Harness
- Developer-Architecture
- Developer-Contributing
- Developer-I18n
- Developer-Setup
- Developer-Subagents
- Developer-Tauri-Ipc
- Developer-Voice