From 438ddfbf9c5e82c42c2990f695262f08a250ff41 Mon Sep 17 00:00:00 2001 From: lengyue Date: Tue, 23 Jun 2026 17:51:47 -0700 Subject: [PATCH] docs: add s2.1 pro pricing and model guidance --- .../models-pricing/choosing-a-model.mdx | 5 ++- .../models-pricing/models-overview.mdx | 42 ++++++++++++++----- .../pricing-and-rate-limits.mdx | 10 +++-- features/text-to-speech.mdx | 6 ++- overview/capabilities.mdx | 8 ++-- 5 files changed, 50 insertions(+), 21 deletions(-) diff --git a/developer-guide/models-pricing/choosing-a-model.mdx b/developer-guide/models-pricing/choosing-a-model.mdx index f3a10ca..9f05621 100644 --- a/developer-guide/models-pricing/choosing-a-model.mdx +++ b/developer-guide/models-pricing/choosing-a-model.mdx @@ -12,7 +12,8 @@ import { AudioTranscript } from '/snippets/audio-transcript.jsx'; +We recommend using **Fish Audio S2.1-Pro** for production projects. It improves on S2-Pro quality, latency, and throughput, and is the right choice when you need production TTFA and DPA guarantees. -We recommend using **Fish Audio S2-Pro** for all projects - our flagship model with industry-leading quality and performance. +Use **`s2.1-pro-free`** for testing, prototyping, development, and smaller businesses. It is the same model as S2.1-Pro at $0, but it does not guarantee TTFA or DPA. - \ No newline at end of file + diff --git a/developer-guide/models-pricing/models-overview.mdx b/developer-guide/models-pricing/models-overview.mdx index 9047691..4ae4382 100644 --- a/developer-guide/models-pricing/models-overview.mdx +++ b/developer-guide/models-pricing/models-overview.mdx @@ -20,10 +20,32 @@ Fish Audio offers state-of-the-art text-to-speech models optimized for different ### Recommended Model - - **Fish Audio S2-Pro** - Our next-generation TTS model with best-in-class performance + + **Fish Audio S2.1-Pro** - Our recommended production TTS model and an improved version of S2-Pro - Natural language control with `[bracket]` syntax — not limited to a fixed set (e.g., `[whispers sweetly]`, `[laughing nervously]`) - - Multi-speaker dialogue support **(S2-Pro exclusive)** + - Multi-speaker dialogue support + - 83 languages + - Improved quality, latency, and throughput over S2-Pro + - Production option for workloads that need TTFA and DPA guarantees + + +### Free Development Model + + + **Fish Audio S2.1-Pro Free** - The same model as S2.1-Pro, available at $0 for development and testing + - Use the `s2.1-pro-free` model string with the same TTS API endpoint + - Same model quality and language coverage as `s2.1-pro` + - Free to use under fair-use limits + - No TTFA or DPA guarantees + - Best for testing, prototyping, development, and smaller businesses + + +### Previous S2 Model + + + **Fish Audio S2-Pro** - Previous-generation S2 TTS model + - Natural language control with `[bracket]` syntax — not limited to a fixed set (e.g., `[whispers sweetly]`, `[laughing nervously]`) + - Multi-speaker dialogue support - 80+ languages - 100ms time-to-first-audio - Full SGLang-based serving stack @@ -31,7 +53,7 @@ Fish Audio offers state-of-the-art text-to-speech models optimized for different -We recommend using `s2-pro` for all new projects to access the latest capabilities and performance improvements. S1 remains available for existing integrations. +We recommend using `s2.1-pro` for production projects. Use `s2.1-pro-free` when you want the same model for evaluation, prototyping, development, and smaller businesses without TTFA or DPA guarantees. S1 remains available for existing integrations. ### Previous Model @@ -54,9 +76,9 @@ We recommend using `s2-pro` for all new projects to access the latest capabiliti ## Supported Languages -### S2-Pro +### S2.1-Pro and S2-Pro -S2-Pro supports 80+ languages with automatic language detection and inline emotion and paralinguistic cue support. +S2.1-Pro supports 83 languages, while S2-Pro supports 80+ languages. Both use automatic language detection and support inline emotion and paralinguistic cues. Language detection is automatic - simply provide text in your target language. @@ -76,9 +98,9 @@ Russian, Dutch, Italian, Polish, Portuguese Fish Audio models support emotional expressions and voice styles that can be controlled through text markers in your input. -### S2-Pro Natural Language Control +### S2.1-Pro and S2-Pro Natural Language Control -S2-Pro treats `[bracket]` tags as standard text rather than dedicated control tokens. Through training on massive datasets, the model learned implicit mappings between natural language descriptions and acoustic variations. This means you are not limited to a predefined set of tags — you can use any descriptive expression and the model will interpret it, such as `[whispers sweetly]` or `[laughing nervously]`. +S2.1-Pro and S2-Pro treat `[bracket]` tags as standard text rather than dedicated control tokens. Through training on massive datasets, the models learned implicit mappings between natural language descriptions and acoustic variations. This means you are not limited to a predefined set of tags — you can use any descriptive expression and the model will interpret it, such as `[whispers sweetly]` or `[laughing nervously]`. Common examples include: @@ -88,7 +110,7 @@ Common examples include: ``` -S2-Pro cues can be placed anywhere in your text to control emotion at specific positions. For example: `"I can't believe it [gasp] you actually did it [laugh]"` +S2 cues can be placed anywhere in your text to control emotion at specific positions. For example: `"I can't believe it [gasp] you actually did it [laugh]"` ### S1 Voice Styles and Emotions @@ -127,4 +149,4 @@ S1 supports 64+ emotional expressions using `(parenthesis)` syntax. You can also use natural expressions like "Ha,ha,ha" for laughter. Experiment with combinations to achieve the perfect emotional tone for your application. - \ No newline at end of file + diff --git a/developer-guide/models-pricing/pricing-and-rate-limits.mdx b/developer-guide/models-pricing/pricing-and-rate-limits.mdx index 89eae97..a5927f9 100644 --- a/developer-guide/models-pricing/pricing-and-rate-limits.mdx +++ b/developer-guide/models-pricing/pricing-and-rate-limits.mdx @@ -21,10 +21,12 @@ The Fish Audio API uses pay-as-you-go pricing based on actual usage. There are n TTS pricing is based on the size of input text, measured in millions of UTF-8 bytes. -| Model Name | Price (USD) | -|--------------|------------------------| -| `s2-pro` | $15.00 / M UTF-8 bytes | -| `s1` | $15.00 / M UTF-8 bytes | +| Model Name | Price (USD) | +|-------------------|------------------------| +| `s2.1-pro` | $15.00 / M UTF-8 bytes | +| `s2.1-pro-free` | $0.00 / M UTF-8 bytes | +| `s2-pro` | $15.00 / M UTF-8 bytes | +| `s1` | $15.00 / M UTF-8 bytes | 1M UTF-8 bytes is approximately 180,000 English words, or about 12 hours of speech diff --git a/features/text-to-speech.mdx b/features/text-to-speech.mdx index 0d9a104..d075f7c 100644 --- a/features/text-to-speech.mdx +++ b/features/text-to-speech.mdx @@ -5,7 +5,7 @@ description: "Turn text into lifelike speech — use it however you build" icon: "microphone" --- -Generate natural speech from text with the `s2-pro` and `s1` models. Pick a voice, choose a format, and go — from the API directly, the Python library, or JavaScript. +Generate natural speech from text with the `s2.1-pro`, `s2-pro`, and `s1` models. Pick a voice, choose a format, and go — from the API directly, the Python library, or JavaScript. @@ -107,7 +107,9 @@ curl --request POST https://api.fish.audio/v1/tts \ ### Models -- **`s2-pro`** (default) — highest quality, multi-speaker, natural-language expression control. +- **`s2.1-pro`** — recommended for production, with improved quality, latency, and throughput over S2-Pro. +- **`s2.1-pro-free`** — the same model at $0 for testing, prototyping, development, and smaller businesses, without TTFA or DPA guarantees. +- **`s2-pro`** (default) — previous-generation S2 model with multi-speaker and natural-language expression control. - **`s1`** — previous generation, `(parenthesis)` emotion tags. In the API, select with the `model` request header. In Python, pass `model="s2-pro"`. See [Choosing a Model](/developer-guide/models-pricing/choosing-a-model). diff --git a/overview/capabilities.mdx b/overview/capabilities.mdx index 57184e8..13a54b7 100644 --- a/overview/capabilities.mdx +++ b/overview/capabilities.mdx @@ -11,7 +11,7 @@ Fish Audio is a voice AI platform. Every core feature is available three ways: i - Convert text into lifelike speech with the `s2-pro` and `s1` models. + Convert text into lifelike speech with the `s2.1-pro`, `s2-pro`, and `s1` models. @@ -55,9 +55,11 @@ These run in the browser, no code required — see the [Platform guide](/overvie ## Models -Two text-to-speech models power most capabilities: +These text-to-speech models power most capabilities: -- **`s2-pro`** — the default, highest-quality model, with multi-speaker and natural-language expression control. +- **`s2.1-pro`** — the recommended production model, with improved quality, latency, and throughput over S2-Pro. +- **`s2.1-pro-free`** — the same model at $0 for testing, prototyping, development, and smaller businesses, without TTFA or DPA guarantees. +- **`s2-pro`** — the previous-generation S2 model, with multi-speaker and natural-language expression control. - **`s1`** — the previous generation, with `(parenthesis)` emotion tags. See [Models Overview](/developer-guide/models-pricing/models-overview) and [Choosing a Model](/developer-guide/models-pricing/choosing-a-model) for the full lineup, languages, and limits.