Pure Rust inference for Qwen3-TTS, a text-to-speech model from Alibaba. Built on candle — no Python or ONNX runtime required.
All code in this repo was written with Claude Code. This is an experiment -- not a production library.
- Pre-allocated KV cache with InplaceOp2 (zero-copy CUDA writes, no Tensor::cat)
- GPU-side repetition penalty mask (incremental slice_assign, eliminates growing CPU transfer)
- Deferred acoustic codes transfer (single bulk GPU→CPU at end of generation)
- Fused residual + RMSNorm CUDA kernel
- GPU→CPU syncs reduced from 3/frame to 1/frame (4-byte EOS check)
- Non-streaming RTF: 0.48–0.67 across all variants (97-100% of theoretical throughput)
- GPU-side sampling: batched argmax, on-device top-k/top-p/repetition penalty
- Eliminated 15 of 16 GPU→CPU syncs per frame in code predictor
- Cached token suppression mask in streaming sessions
- Tokenizer fallback from vocab.json + merges.txt when tokenizer.json is unavailable
- Profiling infrastructure: Chrome tracing, flamegraph, Nsight Systems via Makefile
- Benchmarked all 4 model variants (0.6B Base, 1.7B Base/CustomVoice/VoiceDesign)
- Self-contained model directories (removed tokenizer symlinks)
- Enhanced waveform plots with stats annotation bar
- ICL voice cloning now works correctly with proper reference audio
- Fixed WAV output format (WAVEX/float32 → standard WAV/PCM16) — resolves playback speed issues in some players
- Improved tokenizer path resolution with explicit
--tokenizer-diroverride - Added benchmarking suite (Criterion micro-benchmarks + E2E speed tests)
- Automatic resampling of reference audio to 24kHz for voice cloning
- Docker base image updated to NGC pytorch:25.11 (CUDA 13.0)
Thanks to u/rngesius for feedback on playback speed and tokenizer issues.
- Qwen Team (Alibaba) — Qwen3-TTS model, weights, and technical report
- candle — Rust ML framework by Hugging Face
- mlx-audio — reference implementation that helped clarify model details
- Claude Code — wrote the code
- CPU inference with optional MKL/Accelerate for faster BLAS operations
- CUDA support for NVIDIA GPU acceleration with bf16 and Flash Attention 2
- Metal support for Apple Silicon
- Streaming synthesis for low-latency audio output
- Voice cloning via x-vector or ICL (in-context learning) from reference audio (Base models)
- Preset speakers with 9 built-in voices (CustomVoice models)
- Text-described voices via natural language prompts (VoiceDesign models)
- Auto-detection of model variant from
config.json - HuggingFace Hub integration for easy model downloads
Benchmarked on an NVIDIA DGX Spark (GB10 Blackwell, ARM Cortex-X925, 120 GB unified memory). Default generation parameters, seed 42, 2 warmup + 3 timed iterations.
| Model | RTF (short) | RTF (long) | Tok/s | Memory |
|---|---|---|---|---|
| 0.6B Base (CUDA BF16) | 0.48 | 0.50 | 25.9 | 767 MB |
| 1.7B Base (CUDA BF16) | 0.65 | 0.65 | 19.4 | 767 MB |
| 1.7B CustomVoice (CUDA BF16) | 0.64 | 0.67 | 19.2 | 772 MB |
| 1.7B VoiceDesign (CUDA BF16) | 0.64 | 0.66 | 19.3 | 770 MB |
| 1.7B CustomVoice (CPU F32) | 5.39 | 6.48 | 2.1 | 9.1 GB |
RTF (real-time factor) = wall-clock / audio duration. < 1.0 is faster than real-time. Non-streaming results shown above. Streaming adds ~8-12% overhead with TTFA ~444 ms (0.6B) / ~580 ms (1.7B).
See docs/BENCHMARKS.md for full results, test corpus, micro-benchmarks, and reproduction instructions.
All samples generated with 1.7B models, seed 42. Text: "The sun set behind the mountains, painting the sky in shades of gold and violet."
Five official model variants exist across two size classes. Each variant supports a different speaker conditioning method:
| Variant | Params | Speaker Conditioning | Use Case |
|---|---|---|---|
| 0.6B Base | 1.8 GB | Voice cloning from reference audio | Clone any voice from a WAV file |
| 0.6B CustomVoice | 1.8 GB | 9 preset speakers | Pick from built-in voices |
| 1.7B Base | 3.9 GB | Voice cloning from reference audio | Higher quality voice cloning |
| 1.7B CustomVoice | 3.9 GB | 9 preset speakers | Higher quality preset voices |
| 1.7B VoiceDesign | 3.8 GB | Text description | Describe a voice in natural language |
- Want to clone a specific voice? Use a Base model with
--ref-audio(ICL mode) or--ref-audio --x-vector-only(faster, lower quality). - Want a quick preset voice? Use a CustomVoice model with
--speaker. - Want to describe a voice in text? Use 1.7B VoiceDesign with
--instruct. - Unsure? Start with 0.6B CustomVoice for the fastest results.
| Preset speakers | Voice clone (x_vector) | Voice clone (ICL) | Text-described voice | |
|---|---|---|---|---|
| Base | x | x | ||
| CustomVoice | x | |||
| VoiceDesign | x |
Using the wrong combination (e.g. preset speakers on a Base model) won't crash, but produces unpredictable voice output. The library and CLI warn when this happens.
Add to your Cargo.toml:
[dependencies]
qwen3-tts = { version = "0.1", features = ["hub"] }| Feature | Description |
|---|---|
cpu (default) |
CPU inference |
cuda |
NVIDIA GPU acceleration |
flash-attn |
Flash Attention 2 (requires CUDA toolkit; enables bf16 compute) |
metal |
Apple Silicon GPU acceleration |
mkl |
Intel MKL for faster CPU inference |
accelerate |
Apple Accelerate framework |
hub |
HuggingFace Hub model downloads |
cli |
Command-line tools |
use qwen3_tts::{Qwen3TTS, Speaker, Language, auto_device};
fn main() -> anyhow::Result<()> {
let device = auto_device()?;
let model = Qwen3TTS::from_pretrained("path/to/customvoice_model", device)?;
let audio = model.synthesize_with_voice(
"Hello, world!",
Speaker::Ryan,
Language::English,
None,
)?;
audio.save("output.wav")?;
Ok(())
}Available speakers: Serena, Vivian, UncleFu, Ryan, Aiden, OnoAnna, Sohee, Eric, Dylan
use qwen3_tts::{Qwen3TTS, Language, AudioBuffer, auto_device};
fn main() -> anyhow::Result<()> {
let device = auto_device()?;
let model = Qwen3TTS::from_pretrained("path/to/base_model", device)?;
// Load reference audio
let ref_audio = AudioBuffer::load("reference_voice.wav")?;
// ICL mode: full voice cloning with reference text
let prompt = model.create_voice_clone_prompt(&ref_audio, Some("transcript of ref audio"))?;
// x_vector_only: faster, speaker embedding only (no reference text needed)
// let prompt = model.create_voice_clone_prompt(&ref_audio, None)?;
let audio = model.synthesize_voice_clone(
"Hello in the cloned voice!",
&prompt,
Language::English,
None,
)?;
audio.save("cloned.wav")?;
Ok(())
}use qwen3_tts::{Qwen3TTS, Language, auto_device};
fn main() -> anyhow::Result<()> {
let device = auto_device()?;
let model = Qwen3TTS::from_pretrained("path/to/voicedesign_model", device)?;
let audio = model.synthesize_voice_design(
"Hello from a designed voice!",
"A cheerful young female voice with high pitch and energetic tone",
Language::English,
None,
)?;
audio.save("designed.wav")?;
Ok(())
}use qwen3_tts::{Qwen3TTS, SynthesisOptions, auto_device};
fn main() -> anyhow::Result<()> {
let device = auto_device()?;
let model = Qwen3TTS::from_pretrained("path/to/model", device)?;
let options = SynthesisOptions {
temperature: 0.8,
top_k: 30,
top_p: 0.85,
repetition_penalty: 1.05,
..Default::default()
};
let audio = model.synthesize("Custom settings!", Some(options))?;
audio.save("output.wav")?;
Ok(())
}For low-latency applications, stream audio in chunks:
use qwen3_tts::{Qwen3TTS, Speaker, Language, SynthesisOptions, auto_device};
fn main() -> anyhow::Result<()> {
let device = auto_device()?;
let model = Qwen3TTS::from_pretrained("path/to/model", device)?;
let options = SynthesisOptions {
chunk_frames: 10, // ~800ms per chunk
..Default::default()
};
for chunk in model.synthesize_streaming(
"Hello, world!",
Speaker::Ryan,
Language::English,
options,
)? {
let audio = chunk?;
// Play or stream this chunk
println!("Got {} samples", audio.samples.len());
}
Ok(())
}use qwen3_tts::{Qwen3TTS, ModelPaths, auto_device};
fn main() -> anyhow::Result<()> {
let paths = ModelPaths::download(None)?;
let device = auto_device()?;
let model = Qwen3TTS::from_paths(&paths, device)?;
let audio = model.synthesize("Hello from HuggingFace!", None)?;
audio.save("output.wav")?;
Ok(())
}The TTS pipeline consists of three stages:
-
TalkerModel: 28-layer transformer generating semantic tokens from text autoregressively. Uses MRoPE (multimodal rotary position encoding) across all variants.
-
CodePredictor: 5-layer decoder that generates 15 acoustic tokens per semantic token. Always 1024 hidden dim; 1.7B models use a projection layer to bridge from the talker's 2048-dim space.
-
Decoder12Hz: Converts 16-codebook tokens to 24kHz audio via ConvNeXt blocks and transposed convolution upsampling. Shared across all model variants.
Text --> TalkerModel --> Semantic Token --> CodePredictor --> [16 codes] --> Decoder --> Audio
^ ^
(autoregressive, (per frame,
one per frame) 15 acoustic codes)
The model variant is auto-detected from config.json. The CLI warns if your flags don't match the model type.
# CustomVoice: preset speaker
cargo run --release --features cli --bin generate_audio -- \
--model-dir path/to/customvoice \
--text "Hello world" \
--speaker ryan \
--language english \
# Base: voice cloning (ICL — best quality, requires reference text)
cargo run --release --features cli --bin generate_audio -- \
--model-dir path/to/base \
--text "Hello world" \
--ref-audio reference.wav \
--ref-text "transcript of the reference audio"
# Base: voice cloning (x_vector_only — faster, no transcript needed)
cargo run --release --features cli --bin generate_audio -- \
--model-dir path/to/base \
--text "Hello world" \
--ref-audio reference.wav \
--x-vector-only
# VoiceDesign: describe the voice you want
cargo run --release --features cli --bin generate_audio -- \
--model-dir path/to/voicedesign \
--text "Hello world" \
--instruct "A cheerful young female voice with high pitch and energetic tone" \
--language english
# Reproducible generation with fixed seed
cargo run --release --features cli --bin generate_audio -- \
--model-dir path/to/model \
--text "Hello" \
--seed 42| Flag | Default | Description |
|---|---|---|
--model-dir |
test_data/model |
Path to model directory |
--text |
"Hello" |
Text to synthesize |
--speaker |
ryan |
Preset speaker (CustomVoice only) |
--language |
english |
Target language |
--instruct |
Voice description for VoiceDesign models | |
--ref-audio |
Reference audio WAV for voice cloning (Base only) | |
--ref-text |
Reference transcript for ICL voice cloning (requires --ref-audio) |
|
--x-vector-only |
Speaker embedding only, no ICL (use with --ref-audio) |
|
--output |
Output WAV file path (overrides default naming) | |
--device |
auto |
Device: auto, cpu, cuda, cuda:N, metal |
--duration |
Max duration in seconds (overrides --frames) | |
--frames |
2048 |
Max frames (~164s); generation stops at EOS |
--temperature |
0.7 |
Sampling temperature |
--top-k |
50 |
Top-k sampling |
--top-p |
0.9 |
Nucleus sampling threshold |
--repetition-penalty |
1.05 |
Repetition penalty |
--seed |
42 |
Random seed for reproducibility |
On CUDA devices, the talker and code predictor automatically run in bf16 for lower memory usage and faster inference. The codec decoder and speaker encoder remain in f32 (convolutional, no attention).
For best performance, build with Flash Attention 2 (requires CUDA toolkit in PATH):
cargo build --release --features flash-attn,cliFor GPU builds, use the build script which auto-detects your GPU architecture:
# Build GPU image (auto-detects compute capability)
./build-docker.sh qwen3-tts flash-attn,cli
# Build CPU-only image
./build-docker.sh qwen3-tts-cpu cliThe script builds inside a running container with GPU access, ensuring correct PTX compilation for your exact hardware (Ampere, Ada, Hopper, Blackwell, etc.).
Run inference:
docker run --gpus all \
-v /path/to/models:/models \
-v /path/to/output:/output \
qwen3-tts \
--model-dir /models/0.6b-customvoice \
--speaker ryan \
--text "Hello world, this is a test." \
--device cuda \
--output /output/hello.wavFor CPU-only builds without GPU access, use the Dockerfile directly:
docker build --build-arg FEATURES=cli --build-arg BASE=ubuntu:22.04 -t qwen3-tts-cpu .| Component | CPU | CUDA/Metal | CUDA + flash-attn |
|---|---|---|---|
| Talker (transformer) | F32 | BF16 | BF16 |
| Code Predictor | F32 | BF16 | BF16 |
| Codec Decoder | F32 | F32 | F32 |
| Speaker Encoder | F32 | F32 | F32 |
All models share the same speech tokenizer and text tokenizer.
| Component | HuggingFace Repo | Size |
|---|---|---|
| 0.6B Base | Qwen/Qwen3-TTS-12Hz-0.6B-Base |
1.8 GB |
| 0.6B CustomVoice | Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice |
1.8 GB |
| 1.7B Base | Qwen/Qwen3-TTS-12Hz-1.7B-Base |
3.9 GB |
| 1.7B CustomVoice | Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice |
3.9 GB |
| 1.7B VoiceDesign | Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign |
3.8 GB |
| Speech Tokenizer | Qwen/Qwen3-TTS-Tokenizer-12Hz |
682 MB |
| Text Tokenizer | Qwen/Qwen2-0.5B |
7 MB |
English, Chinese, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
Output audio is always 24kHz mono. Use audio::resample() for other rates:
use qwen3_tts::audio;
let audio_48k = audio::resample(&audio, 48000)?;MIT License. See the main Qwen3-TTS repository for model license information.





