From 44da8666914e9b87d5134940889596fdcd7d7c71 Mon Sep 17 00:00:00 2001
From: Cale Shapera <cale@fish.audio>
Date: Thu, 18 Jun 2026 18:05:51 -0700
Subject: [PATCH] docs: address week-1 onboarding feedback

- API reference: Create Model (POST /model) response now shows train_mode
  as const "fast" (TTS-only); GET /model/{id} and the shared model schema
  default to "fast" while still allowing "full" for non-TTS models. Fixes
  the stale "full" example in the clone/create model reference.
- Emotion Control: document the [emphasis] tone marker (previously only in
  the changelog and models overview), with usage example.
- Emotion Control: move Tone Markers, Audio Effects and Special Effects out
  from under "Complete Emotion Reference" into a new "Sound & Delivery
  Markers" section, since these are not emotions.
- Real-time streaming + TTS: clarify latency defaults. The raw HTTP/WebSocket
  API defaults to "normal" (quality-tuned, high time-to-first-audio) while
  the Python SDK defaults to "balanced"; add a warning and a per-surface
  support matrix so real-time users (e.g. via LiveKit) set balanced/low.

Note: openapi.json is regenerated from platform-api; the durable train_mode
fix also needs platform-api apps/models/schemas.py updated (follow-up).
---
 api-reference/openapi.json                 | 10 +++-------
 developer-guide/core-features/emotions.mdx | 12 ++++++++++--
 features/realtime-streaming.mdx            | 14 +++++++++++++-
 features/text-to-speech.mdx                |  7 ++++++-
 snippets/emotion-list-tones-s2.mdx         |  1 +
 5 files changed, 33 insertions(+), 11 deletions(-)
diff --git a/api-reference/openapi.json b/api-reference/openapi.json
index 8d2659f..cb160f6 100644
--- a/api-reference/openapi.json
+++ b/api-reference/openapi.json
@@ -1345,11 +1345,7 @@
                       "type": "string"
                     },
                     "train_mode": {
-                      "default": "full",
-                      "enum": [
-                        "fast",
-                        "full"
-                      ],
+                      "const": "fast",
                       "title": "Train Mode",
                       "type": "string"
                     },
@@ -1647,7 +1643,7 @@
                       "type": "string"
                     },
                     "train_mode": {
-                      "default": "full",
+                      "default": "fast",
                       "enum": [
                         "fast",
                         "full"
@@ -4052,7 +4048,7 @@
             "type": "string"
           },
           "train_mode": {
-            "default": "full",
+            "default": "fast",
             "enum": [
               "fast",
               "full"
diff --git a/developer-guide/core-features/emotions.mdx b/developer-guide/core-features/emotions.mdx
index c66f8b6..fae52ea 100644
--- a/developer-guide/core-features/emotions.mdx
+++ b/developer-guide/core-features/emotions.mdx
@@ -60,9 +60,17 @@ The S2 TTS models will interpret these markers and adjust the voice accordingly.
 
 <AdvancedEmotions />
 
-### Tone Markers (5 expressions)
+## Sound & Delivery Markers
 
-Control volume and intensity:
+These markers aren't emotions — they shape *how* a line is delivered, add natural human sounds, or layer in ambient effects. Combine them with the emotion cues above.
+
+### Tone Markers (6 expressions)
+
+Control volume, intensity, and emphasis. Place `[emphasis]` right before the word or phrase you want to stress:
+
+```text
+This is [emphasis] really important.
+```
 
 <ToneMarkers />
 
diff --git a/features/realtime-streaming.mdx b/features/realtime-streaming.mdx
index 2dd90bc..005c130 100644
--- a/features/realtime-streaming.mdx
+++ b/features/realtime-streaming.mdx
@@ -151,7 +151,7 @@ for chunk in client.tts.stream_websocket(script(), reference_id="YOUR_VOICE_ID")
 
 Both streaming paths take a `latency` mode:
 
-- `latency="balanced"` (default) — lowest time-to-first-audio. Use it for voice agents and live LLM output.
+- `latency="balanced"` (Python SDK default) — lowest time-to-first-audio. Use it for voice agents and live LLM output.
 - `latency="normal"` — slightly higher latency, best audio quality. Use it for narration where you can afford a beat.
 
 ```python
@@ -159,6 +159,18 @@ for chunk in client.tts.stream_websocket(llm_tokens(), latency="balanced"):
     ...
 ```
 
+<Warning>
+  **Set `latency` explicitly for real-time use.** The Python SDK defaults to `balanced`, but the raw HTTP/WebSocket API defaults to `normal`, which is tuned for quality and noticeably increases time-to-first-audio — you may wait several seconds for the first chunk. If you call the API directly, or through a third-party integration such as the LiveKit plugin, pass `balanced` (or `low`) for interactive latency.
+</Warning>
+
+The available modes differ slightly between the raw API and the SDK:
+
+| Mode       | Raw HTTP/WebSocket API | Python SDK    | Behavior                                       |
+| ---------- | ---------------------- | ------------- | ---------------------------------------------- |
+| `low`      | Supported              | Not available | Lowest latency                                 |
+| `balanced` | Supported              | Default       | Reduced latency — recommended for real-time    |
+| `normal`   | Default                | Supported     | Best quality, highest time-to-first-audio      |
+
 For finer control, pass a `TTSConfig` with chunk tuning. Smaller chunks emit audio sooner (lower latency); larger chunks give the model more context (smoother prosody):
 
 ```python
diff --git a/features/text-to-speech.mdx b/features/text-to-speech.mdx
index 0d9a104..e84c505 100644
--- a/features/text-to-speech.mdx
+++ b/features/text-to-speech.mdx
@@ -215,10 +215,15 @@ audio = client.tts.convert(
 
 `latency` trades stability for speed; `chunk_length` controls how much text the engine batches before it starts generating.
 
-- `latency="balanced"` (default) — lower time-to-first-audio (~300ms). Good for interactive use.
+- `latency="balanced"` (Python SDK default) — lower time-to-first-audio (~300ms). Good for interactive use.
 - `latency="normal"` — most stable output, at slightly higher latency.
+- `latency="low"` (raw API only) — lowest latency.
 - `chunk_length` (`100`–`300`, default `200`) — smaller chunks start audio sooner; larger chunks are more efficient for long text.
 
+<Note>
+  The raw HTTP/WebSocket API defaults `latency` to `normal` (quality-tuned), while the Python SDK defaults to `balanced`. For real-time use over the raw API, set `latency` to `balanced` or `low` explicitly — see [Tune latency vs. quality](/features/realtime-streaming#tune-latency-vs-quality).
+</Note>
+
 <CodeGroup>
 ```python Python
 from fishaudio.types import TTSConfig
diff --git a/snippets/emotion-list-tones-s2.mdx b/snippets/emotion-list-tones-s2.mdx
index e4d83d1..b979abe 100644
--- a/snippets/emotion-list-tones-s2.mdx
+++ b/snippets/emotion-list-tones-s2.mdx
@@ -5,3 +5,4 @@
 | Screaming  | `[screaming]`       | Very loud, panicked  | Emergencies, fear          |
 | Whispering | `[whispering]`      | Very soft, secretive | Secrets, quiet scenes      |
 | Soft       | `[soft tone]`       | Gentle, quiet        | Comfort, lullabies         |
+| Emphasis   | `[emphasis]`        | Stress a word/phrase | Highlighting key words     |