Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions developer-guide/models-pricing/choosing-a-model.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@ import { AudioTranscript } from '/snippets/audio-transcript.jsx';
<AudioTranscript page="models-pricing-choosing-a-model" />
</Visibility>

We recommend using **Fish Audio S2.1-Pro** for production projects. It improves on S2-Pro quality, latency, and throughput, and is the right choice when you need production TTFA and DPA guarantees.

We recommend using **Fish Audio S2-Pro** for all projects - our flagship model with industry-leading quality and performance.
Use **`s2.1-pro-free`** for testing, prototyping, development, and smaller businesses. It is the same model as S2.1-Pro at $0, but it does not guarantee TTFA or DPA.

<Support />
<Support />
42 changes: 32 additions & 10 deletions developer-guide/models-pricing/models-overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -20,18 +20,40 @@ Fish Audio offers state-of-the-art text-to-speech models optimized for different

### Recommended Model

<Card title="s2-pro" icon="star">
**Fish Audio S2-Pro** - Our next-generation TTS model with best-in-class performance
<Card title="s2.1-pro" icon="star">
**Fish Audio S2.1-Pro** - Our recommended production TTS model and an improved version of S2-Pro
- Natural language control with `[bracket]` syntax — not limited to a fixed set (e.g., `[whispers sweetly]`, `[laughing nervously]`)
- Multi-speaker dialogue support **(S2-Pro exclusive)**
- Multi-speaker dialogue support
- 83 languages
- Improved quality, latency, and throughput over S2-Pro
- Production option for workloads that need TTFA and DPA guarantees
</Card>

### Free Development Model

<Card title="s2.1-pro-free" icon="flask">
**Fish Audio S2.1-Pro Free** - The same model as S2.1-Pro, available at $0 for development and testing
- Use the `s2.1-pro-free` model string with the same TTS API endpoint
- Same model quality and language coverage as `s2.1-pro`
- Free to use under fair-use limits
- No TTFA or DPA guarantees
- Best for testing, prototyping, development, and smaller businesses
</Card>

### Previous S2 Model

<Card title="s2-pro" icon="microchip">
**Fish Audio S2-Pro** - Previous-generation S2 TTS model
- Natural language control with `[bracket]` syntax — not limited to a fixed set (e.g., `[whispers sweetly]`, `[laughing nervously]`)
- Multi-speaker dialogue support
- 80+ languages
- 100ms time-to-first-audio
- Full SGLang-based serving stack
- Open-source
</Card>

<Note>
We recommend using `s2-pro` for all new projects to access the latest capabilities and performance improvements. S1 remains available for existing integrations.
We recommend using `s2.1-pro` for production projects. Use `s2.1-pro-free` when you want the same model for evaluation, prototyping, development, and smaller businesses without TTFA or DPA guarantees. S1 remains available for existing integrations.
</Note>

### Previous Model
Expand All @@ -54,9 +76,9 @@ We recommend using `s2-pro` for all new projects to access the latest capabiliti

## Supported Languages

### S2-Pro
### S2.1-Pro and S2-Pro

S2-Pro supports 80+ languages with automatic language detection and inline emotion and paralinguistic cue support.
S2.1-Pro supports 83 languages, while S2-Pro supports 80+ languages. Both use automatic language detection and support inline emotion and paralinguistic cues.

<Info>
Language detection is automatic - simply provide text in your target language.
Expand All @@ -76,9 +98,9 @@ Russian, Dutch, Italian, Polish, Portuguese

Fish Audio models support emotional expressions and voice styles that can be controlled through text markers in your input.

### S2-Pro Natural Language Control
### S2.1-Pro and S2-Pro Natural Language Control

S2-Pro treats `[bracket]` tags as standard text rather than dedicated control tokens. Through training on massive datasets, the model learned implicit mappings between natural language descriptions and acoustic variations. This means you are not limited to a predefined set of tags — you can use any descriptive expression and the model will interpret it, such as `[whispers sweetly]` or `[laughing nervously]`.
S2.1-Pro and S2-Pro treat `[bracket]` tags as standard text rather than dedicated control tokens. Through training on massive datasets, the models learned implicit mappings between natural language descriptions and acoustic variations. This means you are not limited to a predefined set of tags — you can use any descriptive expression and the model will interpret it, such as `[whispers sweetly]` or `[laughing nervously]`.

Common examples include:

Expand All @@ -88,7 +110,7 @@ Common examples include:
```

<Tip>
S2-Pro cues can be placed anywhere in your text to control emotion at specific positions. For example: `"I can't believe it [gasp] you actually did it [laugh]"`
S2 cues can be placed anywhere in your text to control emotion at specific positions. For example: `"I can't believe it [gasp] you actually did it [laugh]"`
</Tip>

### S1 Voice Styles and Emotions
Expand Down Expand Up @@ -127,4 +149,4 @@ S1 supports 64+ emotional expressions using `(parenthesis)` syntax.
You can also use natural expressions like "Ha,ha,ha" for laughter. Experiment with combinations to achieve the perfect emotional tone for your application.
</Tip>

<Support />
<Support />
10 changes: 6 additions & 4 deletions developer-guide/models-pricing/pricing-and-rate-limits.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,12 @@ The Fish Audio API uses pay-as-you-go pricing based on actual usage. There are n

TTS pricing is based on the size of input text, measured in millions of UTF-8 bytes.

| Model Name | Price (USD) |
|--------------|------------------------|
| `s2-pro` | $15.00 / M UTF-8 bytes |
| `s1` | $15.00 / M UTF-8 bytes |
| Model Name | Price (USD) |
|-------------------|------------------------|
| `s2.1-pro` | $15.00 / M UTF-8 bytes |
| `s2.1-pro-free` | $0.00 / M UTF-8 bytes |
| `s2-pro` | $15.00 / M UTF-8 bytes |
| `s1` | $15.00 / M UTF-8 bytes |

<Info>
1M UTF-8 bytes is approximately 180,000 English words, or about 12 hours of speech
Expand Down
6 changes: 4 additions & 2 deletions features/text-to-speech.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
icon: "microphone"
---

Generate natural speech from text with the `s2-pro` and `s1` models. Pick a voice, choose a format, and go — from the API directly, the Python library, or JavaScript.
Generate natural speech from text with the `s2.1-pro`, `s2-pro`, and `s1` models. Pick a voice, choose a format, and go — from the API directly, the Python library, or JavaScript.

<CardGroup cols={3}>
<Card title="Use it in the web app" icon="browser" href="https://fish.audio/app/text-to-speech">
Expand All @@ -23,7 +23,7 @@

<CardGroup cols={2}>
<Card title="Voiceovers & narration" icon="film">
Audiobooks, explainers, ads, and video narration.

Check warning on line 26 in features/text-to-speech.mdx

View check run for this annotation

Mintlify / Mintlify Validation (hanabiaiinc) - vale-spellcheck

features/text-to-speech.mdx#L26

Did you really mean 'explainers'?
</Card>
<Card title="Conversational AI" icon="comments">
Speak an assistant's replies — pair with [streaming](/features/realtime-streaming) for low latency.
Expand All @@ -42,7 +42,7 @@

<CodeGroup>
```python Python
from fishaudio import FishAudio

Check warning on line 45 in features/text-to-speech.mdx

View check run for this annotation

Mintlify / Mintlify Validation (hanabiaiinc) - vale-spellcheck

features/text-to-speech.mdx#L45

Did you really mean 'fishaudio'?
from fishaudio.utils import save

client = FishAudio() # reads FISH_API_KEY
Expand Down Expand Up @@ -107,7 +107,9 @@

### Models

- **`s2-pro`** (default) — highest quality, multi-speaker, natural-language expression control.
- **`s2.1-pro`** — recommended for production, with improved quality, latency, and throughput over S2-Pro.
- **`s2.1-pro-free`** — the same model at $0 for testing, prototyping, development, and smaller businesses, without TTFA or DPA guarantees.
- **`s2-pro`** (default) — previous-generation S2 model with multi-speaker and natural-language expression control.
- **`s1`** — previous generation, `(parenthesis)` emotion tags.

In the API, select with the `model` request header. In Python, pass `model="s2-pro"`. See [Choosing a Model](/developer-guide/models-pricing/choosing-a-model).
Expand Down Expand Up @@ -191,16 +193,16 @@

To reuse a voice across many requests, [clone it once](/features/voice-cloning) and pass the resulting `reference_id` instead.

### Format & bitrate

Check warning on line 196 in features/text-to-speech.mdx

View check run for this annotation

Mintlify / Mintlify Validation (hanabiaiinc) - vale-spellcheck

features/text-to-speech.mdx#L196

Did you really mean 'bitrate'?

Pick a format for your delivery channel, and tune bitrate to trade size against quality:

Check warning on line 198 in features/text-to-speech.mdx

View check run for this annotation

Mintlify / Mintlify Validation (hanabiaiinc) - vale-spellcheck

features/text-to-speech.mdx#L198

Did you really mean 'bitrate'?

| Format | Notes |
|---|---|
| `mp3` (default) | good size/quality balance; set `mp3_bitrate` to `64`, `128`, or `192` |
| `wav` | uncompressed, highest quality; set `sample_rate` (e.g. `44100`) |
| `pcm` | raw samples, no container — for low-latency playback and telephony pipelines |
| `opus` | efficient for streaming; bitrate is automatic (`opus_bitrate=-1000`) |

Check warning on line 205 in features/text-to-speech.mdx

View check run for this annotation

Mintlify / Mintlify Validation (hanabiaiinc) - vale-spellcheck

features/text-to-speech.mdx#L205

Did you really mean 'bitrate'?

```python
from fishaudio.types import TTSConfig
Expand Down
8 changes: 5 additions & 3 deletions overview/capabilities.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@

<CardGroup cols={2}>
<Card title="Text to Speech" icon="microphone" href="/features/text-to-speech">
Convert text into lifelike speech with the `s2-pro` and `s1` models.
Convert text into lifelike speech with the `s2.1-pro`, `s2-pro`, and `s1` models.
</Card>

<Card title="Speech to Text" icon="waveform" href="/features/speech-to-text">
Expand Down Expand Up @@ -41,7 +41,7 @@
</Card>

<Card title="Story Studio" icon="book-open" href="/overview/platform">
Produce multi-speaker, long-form audio — audiobooks and narration.

Check warning on line 44 in overview/capabilities.mdx

View check run for this annotation

Mintlify / Mintlify Validation (hanabiaiinc) - vale-spellcheck

overview/capabilities.mdx#L44

Did you really mean 'audiobooks'?
</Card>

<Card title="Music & Sound Effects" icon="music" href="/overview/platform">
Expand All @@ -55,9 +55,11 @@

## Models

Two text-to-speech models power most capabilities:
These text-to-speech models power most capabilities:

- **`s2-pro`** — the default, highest-quality model, with multi-speaker and natural-language expression control.
- **`s2.1-pro`** — the recommended production model, with improved quality, latency, and throughput over S2-Pro.
- **`s2.1-pro-free`** — the same model at $0 for testing, prototyping, development, and smaller businesses, without TTFA or DPA guarantees.
- **`s2-pro`** — the previous-generation S2 model, with multi-speaker and natural-language expression control.
- **`s1`** — the previous generation, with `(parenthesis)` emotion tags.

See [Models Overview](/developer-guide/models-pricing/models-overview) and [Choosing a Model](/developer-guide/models-pricing/choosing-a-model) for the full lineup, languages, and limits.
Expand Down
Loading