From 44da8666914e9b87d5134940889596fdcd7d7c71 Mon Sep 17 00:00:00 2001 From: Cale Shapera Date: Thu, 18 Jun 2026 18:05:51 -0700 Subject: [PATCH] docs: address week-1 onboarding feedback - API reference: Create Model (POST /model) response now shows train_mode as const "fast" (TTS-only); GET /model/{id} and the shared model schema default to "fast" while still allowing "full" for non-TTS models. Fixes the stale "full" example in the clone/create model reference. - Emotion Control: document the [emphasis] tone marker (previously only in the changelog and models overview), with usage example. - Emotion Control: move Tone Markers, Audio Effects and Special Effects out from under "Complete Emotion Reference" into a new "Sound & Delivery Markers" section, since these are not emotions. - Real-time streaming + TTS: clarify latency defaults. The raw HTTP/WebSocket API defaults to "normal" (quality-tuned, high time-to-first-audio) while the Python SDK defaults to "balanced"; add a warning and a per-surface support matrix so real-time users (e.g. via LiveKit) set balanced/low. Note: openapi.json is regenerated from platform-api; the durable train_mode fix also needs platform-api apps/models/schemas.py updated (follow-up). --- api-reference/openapi.json | 10 +++------- developer-guide/core-features/emotions.mdx | 12 ++++++++++-- features/realtime-streaming.mdx | 14 +++++++++++++- features/text-to-speech.mdx | 7 ++++++- snippets/emotion-list-tones-s2.mdx | 1 + 5 files changed, 33 insertions(+), 11 deletions(-) diff --git a/api-reference/openapi.json b/api-reference/openapi.json index 8d2659f..cb160f6 100644 --- a/api-reference/openapi.json +++ b/api-reference/openapi.json @@ -1345,11 +1345,7 @@ "type": "string" }, "train_mode": { - "default": "full", - "enum": [ - "fast", - "full" - ], + "const": "fast", "title": "Train Mode", "type": "string" }, @@ -1647,7 +1643,7 @@ "type": "string" }, "train_mode": { - "default": "full", + "default": "fast", "enum": [ "fast", "full" @@ -4052,7 +4048,7 @@ "type": "string" }, "train_mode": { - "default": "full", + "default": "fast", "enum": [ "fast", "full" diff --git a/developer-guide/core-features/emotions.mdx b/developer-guide/core-features/emotions.mdx index c66f8b6..fae52ea 100644 --- a/developer-guide/core-features/emotions.mdx +++ b/developer-guide/core-features/emotions.mdx @@ -60,9 +60,17 @@ The S2 TTS models will interpret these markers and adjust the voice accordingly. -### Tone Markers (5 expressions) +## Sound & Delivery Markers -Control volume and intensity: +These markers aren't emotions — they shape *how* a line is delivered, add natural human sounds, or layer in ambient effects. Combine them with the emotion cues above. + +### Tone Markers (6 expressions) + +Control volume, intensity, and emphasis. Place `[emphasis]` right before the word or phrase you want to stress: + +```text +This is [emphasis] really important. +``` diff --git a/features/realtime-streaming.mdx b/features/realtime-streaming.mdx index 2dd90bc..005c130 100644 --- a/features/realtime-streaming.mdx +++ b/features/realtime-streaming.mdx @@ -151,7 +151,7 @@ for chunk in client.tts.stream_websocket(script(), reference_id="YOUR_VOICE_ID") Both streaming paths take a `latency` mode: -- `latency="balanced"` (default) — lowest time-to-first-audio. Use it for voice agents and live LLM output. +- `latency="balanced"` (Python SDK default) — lowest time-to-first-audio. Use it for voice agents and live LLM output. - `latency="normal"` — slightly higher latency, best audio quality. Use it for narration where you can afford a beat. ```python @@ -159,6 +159,18 @@ for chunk in client.tts.stream_websocket(llm_tokens(), latency="balanced"): ... ``` + + **Set `latency` explicitly for real-time use.** The Python SDK defaults to `balanced`, but the raw HTTP/WebSocket API defaults to `normal`, which is tuned for quality and noticeably increases time-to-first-audio — you may wait several seconds for the first chunk. If you call the API directly, or through a third-party integration such as the LiveKit plugin, pass `balanced` (or `low`) for interactive latency. + + +The available modes differ slightly between the raw API and the SDK: + +| Mode | Raw HTTP/WebSocket API | Python SDK | Behavior | +| ---------- | ---------------------- | ------------- | ---------------------------------------------- | +| `low` | Supported | Not available | Lowest latency | +| `balanced` | Supported | Default | Reduced latency — recommended for real-time | +| `normal` | Default | Supported | Best quality, highest time-to-first-audio | + For finer control, pass a `TTSConfig` with chunk tuning. Smaller chunks emit audio sooner (lower latency); larger chunks give the model more context (smoother prosody): ```python diff --git a/features/text-to-speech.mdx b/features/text-to-speech.mdx index 0d9a104..e84c505 100644 --- a/features/text-to-speech.mdx +++ b/features/text-to-speech.mdx @@ -215,10 +215,15 @@ audio = client.tts.convert( `latency` trades stability for speed; `chunk_length` controls how much text the engine batches before it starts generating. -- `latency="balanced"` (default) — lower time-to-first-audio (~300ms). Good for interactive use. +- `latency="balanced"` (Python SDK default) — lower time-to-first-audio (~300ms). Good for interactive use. - `latency="normal"` — most stable output, at slightly higher latency. +- `latency="low"` (raw API only) — lowest latency. - `chunk_length` (`100`–`300`, default `200`) — smaller chunks start audio sooner; larger chunks are more efficient for long text. + + The raw HTTP/WebSocket API defaults `latency` to `normal` (quality-tuned), while the Python SDK defaults to `balanced`. For real-time use over the raw API, set `latency` to `balanced` or `low` explicitly — see [Tune latency vs. quality](/features/realtime-streaming#tune-latency-vs-quality). + + ```python Python from fishaudio.types import TTSConfig diff --git a/snippets/emotion-list-tones-s2.mdx b/snippets/emotion-list-tones-s2.mdx index e4d83d1..b979abe 100644 --- a/snippets/emotion-list-tones-s2.mdx +++ b/snippets/emotion-list-tones-s2.mdx @@ -5,3 +5,4 @@ | Screaming | `[screaming]` | Very loud, panicked | Emergencies, fear | | Whispering | `[whispering]` | Very soft, secretive | Secrets, quiet scenes | | Soft | `[soft tone]` | Gentle, quiet | Comfort, lullabies | +| Emphasis | `[emphasis]` | Stress a word/phrase | Highlighting key words |