The first voice acting pipeline with open-weights components and open post training data that combines zero-shot voice cloning with natural language performance direction. Vocalino allows you to provide a reference voice (or generate one from scratch) and use free-form text instructions to direct how the line is performed. It generates speech that maintains strict voice consistency with your reference audio while adhering to your specific emotional and stylistic prompts—giving you total control over the actor and the performance without any model training.
▶ Click to watch demo video
Vocalino v0.1 was built to solve the limitation of existing open-weights and open-source audio models. Standard Text-to-Speech (TTS) can generate emotions but usually with random voices. Standard Voice Conversion (VC) can clone a specific person but requires pre-acted source audio.
To our knowledge, Vocalino is the first open pipeline (using open-weights models and open post-training data) that decouples vocal identity from performance style. By chaining advanced stylistic generation with high-fidelity voice conversion, Vocalino lets you "cast" an actor (via a reference clip) and "direct" them (via text prompts like "whisper with trembling fear" or "shout with overwhelming joy"). The result is a unified audio file that sounds like your target speaker performing exactly the way you instructed.
┌────────────────────┐
Text + Style ──> │ Qwen3-TTS 1.7B │ ──> Raw TTS audio
│ (VoiceDesign) │ (12 Hz codec tokens → wav)
└────────────────────┘
│
▼
┌────────────────────┐
Reference WAV ─> │ Seed-VC V2 │ ──> Voice-converted audio
│ (CFM + AR) │ (matches reference timbre)
└────────────────────┘
│
▼
┌────────────────────┐
│ ECAPA-TDNN │ ──> 2048-dim embedding
│ (Speaker Encoder) │ → cosine similarity vs ref
└────────────────────┘
Stage 1 — Qwen3-TTS Voice Design (by Alibaba, 1.7B parameters)
A large language model that generates speech from text + a natural-language style instruction. You describe the desired emotion, pace, energy, and pitch in plain English (e.g. "speak with trembling fear, whispering, medium-pitched male voice"), and the model generates audio that matches. The voice identity is random — only the style/emotion is controlled.
Stage 2 — Seed-VC V2 (by Plachtaa / ByteDance)
A voice conversion model combining Conditional Flow Matching (CFM) diffusion with an autoregressive (AR) style transfer model. Given any source audio and a short reference clip of the target speaker:
- Content Extraction: The source audio's phonetic content and prosody (rhythm, intonation, emotion) are preserved while speaker identity is removed.
- Speaker Embedding: The reference clip is processed to extract a speaker embedding capturing timbre, formant structure, and vocal characteristics.
- CFM Diffusion Decoder: A flow-matching generative model synthesizes a new waveform that matches the source content/prosody while sounding like the target speaker — all from just a few seconds of reference audio.
Stage 3 — ECAPA-TDNN Ranking (from Qwen3-TTS-Base)
When generating K candidates, each voice-converted output is scored against the reference audio using cosine similarity of 2048-dimensional speaker embeddings. Candidates are ranked so you always get the most voice-consistent result.
- Web UI — dark-themed browser interface served at
/uifor interactive voice design and pipeline generation - Batched TTS — generate K candidates in a single forward pass instead of K sequential calls (~2x faster)
- SSE Streaming — candidates stream to the UI as they complete, no waiting for all K
- Speaker Similarity Ranking — ECAPA-TDNN embeddings rank candidates by voice consistency
- INT8 Quantization — optional bitsandbytes INT8 reduces TTS VRAM from ~15 GB to ~7 GB
- Multi-GPU — split TTS and VC across GPUs for VRAM isolation and concurrency
Vocalino-V0.1-Voice-Acting-Pipeline/
├── README.md # This file
├── server.py # FastAPI server (all endpoints + optimizations)
├── ui/
│ └── index.html # Web UI (served at /ui)
├── setup.py # Install dependencies + download model weights
├── generate_samples.py # Standalone: TTS+VC for multiple emotions
├── convert_voice.py # Standalone: voice conversion only
├── OPTIMIZATION_PLAN.txt # Detailed optimization roadmap
├── seed_vc_repo/ # Seed-VC V2 repository (cloned separately)
├── models/ # (auto-created) local model cache
└── output/ # Generated at runtime
- NVIDIA GPU with >= 24 GB VRAM (RTX 3090, A5000, etc.)
- With INT8 quantization: >= 16 GB VRAM is sufficient
- Two GPUs recommended for multi-GPU mode (reduces VRAM contention)
| Package | Version | Purpose |
|---|---|---|
| Python | >= 3.10 | |
| PyTorch | >= 2.6 | with CUDA 12.4 |
| transformers | >= 4.57 | HuggingFace model loading |
| qwen-tts | >= 0.1.1 | Qwen3-TTS model + tokenizer |
| fastapi | >= 0.129 | HTTP server |
| uvicorn | >= 0.40 | ASGI server |
| soundfile | >= 0.12 | WAV I/O |
| librosa | >= 0.10 | Audio resampling |
| safetensors | >= 0.5 | Weight loading |
| huggingface_hub | >= 0.36 | Model downloads |
| hydra-core | >= 1.3 | Seed-VC config loading |
| omegaconf | >= 2.3 | Seed-VC config parsing |
| pydub | >= 0.25 | MP3/format handling |
| bitsandbytes | >= 0.45 | INT8 quantization (optional) |
| pydantic | >= 2.11 | Request validation |
| numpy, scipy | Numerical ops |
| Model | Size | Source |
|---|---|---|
| Qwen3-TTS-12Hz-1.7B-VoiceDesign | ~3.4 GB (bf16) | Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign |
| Seed-VC V2 (CFM + AR) | ~800 MB | seed_vc_repo/ checkpoints |
| ECAPA-TDNN speaker encoder | ~20 MB | Extracted from Qwen/Qwen3-TTS-12Hz-1.7B-Base |
# Clone this repository
git clone https://github.com/LAION-AI/Vocalino-V0.1-Voice-Acting-Pipeline.git
cd Vocalino-V0.1-Voice-Acting-Pipeline
# Run the setup script (installs dependencies + downloads model weights)
python setup.py
# Clone Seed-VC V2 into the expected directory
git clone https://github.com/Plachtaa/seed-vc.git seed_vc_repo
# (Optional) Install flash-attn for faster TTS attention
pip install flash-attn --no-build-isolation# Basic launch (single GPU, bfloat16)
python server.py
# With INT8 quantization (halves TTS VRAM)
TTS_QUANTIZE=int8 python server.py
# Multi-GPU (TTS on GPU 0, VC on GPU 1)
CUDA_VISIBLE_DEVICES=0,1 VC_DEVICE=cuda:1 python server.py
# Multi-GPU + INT8 quantization
CUDA_VISIBLE_DEVICES=0,1 TTS_QUANTIZE=int8 VC_DEVICE=cuda:1 python server.pyThe server starts on http://0.0.0.0:8000. Open the web UI at
http://<server-ip>:8000/ui/.
All settings are configured via environment variables:
| Variable | Default | Description |
|---|---|---|
TTS_DEVICE |
cuda:0 |
GPU device for Qwen3-TTS model |
VC_DEVICE |
(same as TTS) | GPU device for Seed-VC model |
SE_DEVICE |
(same as TTS) | GPU device for speaker encoder |
TTS_QUANTIZE |
none |
none = bfloat16, int8 = bitsandbytes INT8 |
DEFAULT_DIFF_STEPS |
12 |
Default VC diffusion steps (lower = faster) |
EMB_CACHE_SIZE |
32 |
Max speaker embedding cache entries |
| Mode | VRAM | Speed | Quality |
|---|---|---|---|
none (bfloat16) |
~15 GB | Baseline | Best |
int8 |
~7 GB | Similar or slightly slower | Near-identical for TTS |
INT8 quantization uses bitsandbytes to quantize the transformer's linear layers to 8-bit integers. The codec token decoder and speech tokenizer remain in full precision. Quality impact is minimal for speech synthesis since the output is discrete codec tokens.
When two GPUs are available, splitting TTS and VC across devices:
- Eliminates VRAM contention between the two largest models
- Allows the server to handle concurrent requests more efficiently
- Frees VRAM on each GPU for other processes
# Example: RTX 3090 x2
CUDA_VISIBLE_DEVICES=0,1 VC_DEVICE=cuda:1 python server.py
The web UI is served at /ui and provides two sections:
- Enter text and a natural-language voice/style description
- Generate N samples (batched for speed)
- Listen, download, or select any sample as reference for the pipeline
- Upload or select a reference audio (target speaker identity)
- Enter text and emotion/style instruction
- Generate K candidates — each streamed to the UI as it completes
- Candidates ranked by speaker embedding similarity (green = best match)
- Download final audio or intermediate TTS (before voice conversion)
Generate speech using Qwen3-TTS Voice Design.
{
"text": "Hello, how are you today?",
"style_prompt": "A warm female voice, speaking calmly",
"language": "English"
}Response: { "status": "success", "sample_rate": 12000, "audio_base64": "..." }
Generate N voice design samples in a single batched forward pass. Each output differs due to stochastic sampling.
{
"text": "Hello, how are you today?",
"style_prompt": "A warm female voice, speaking calmly",
"language": "English",
"n_samples": 3
}Response: { "status": "success", "samples": [...], "batch_time": 65.2 }
Voice conversion with Seed-VC V2.
{
"source_audio_base64": "<base64 WAV>",
"target_audio_base64": "<base64 WAV>",
"diffusion_steps": 12
}Combined: generate styled speech then convert to target voice.
{
"text": "Hello!",
"style_prompt": "Excited, high energy",
"target_audio_base64": "<base64 reference WAV>",
"language": "English",
"diffusion_steps": 12,
"return_intermediate": true
}Set return_intermediate: true to also receive the raw TTS output (before voice conversion) in intermediate_tts_audio_base64.
Generate K candidates, rank by ECAPA-TDNN speaker similarity. Returns all candidates sorted by similarity after the last one completes.
{
"text": "Hello!",
"style_prompt": "Excited, high energy",
"reference_audio_base64": "<base64 reference WAV>",
"language": "English",
"k_candidates": 3,
"diffusion_steps": 12
}Same as /pipeline/ranked but streams results via Server-Sent Events.
Each candidate is sent as it completes. The UI uses this endpoint.
SSE event types:
event: progress— TTS batch completed, VC phase startingevent: candidate— one completed candidate (audio + similarity)event: done— final summary withbest_id
Returns model status and server configuration.
{
"qwen_loaded": true,
"seed_vc_loaded": true,
"speaker_encoder_loaded": true,
"tts_device": "cuda:0",
"vc_device": "cuda:1",
"tts_quantize": "int8",
"emb_cache_used": 3
}See OPTIMIZATION_PLAN.txt for the full roadmap. Summary of implemented tiers:
| Tier | Optimization | Status | Impact |
|---|---|---|---|
| 1 | VC diffusion steps 25 → 12 | Done | ~50% faster VC |
| 1 | Speaker embedding LRU cache | Done | Skip repeated ECAPA-TDNN passes |
| 2 | SSE streaming | Done | First result visible immediately |
| 4 | TTS batching | Done | K samples in ~1.5x single-call time |
| 5a | torch.compile | Skipped | Known issues with Qwen models |
| 5b | Multi-GPU split | Done | Opt-in via VC_DEVICE env var |
| 5c | vLLM-Omni serving | Future | Requires vllm-omni package |
| 5d | INT8 quantization | Done | Opt-in, halves TTS VRAM |
| Configuration | TTS Phase | VC Phase | Total |
|---|---|---|---|
| Sequential (baseline) | ~135s | ~24s | ~160s |
| Batched TTS | ~65s | ~24s | ~90s |
| Batched TTS + INT8 | ~65s* | ~24s | ~90s* |
*INT8 timing depends on hardware; may be slightly faster or slower than bf16.
vLLM-Omni provides official Qwen3-TTS support with continuous batching, PagedAttention, and CUDA graph acceleration. This is the recommended production path for high-throughput TTS serving.
pip install vllm-omni
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign \
--stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \
--omni --port 8091 --trust-remote-code --enforce-eagerNote: --enforce-eager is required — torch.compile has known
compatibility issues with Qwen-family models.
The Seed-VC CFM (diffusion) model natively supports batched inputs, but the wrapper API and AR generation loop are single-sample only. Refactoring these could enable batched voice conversion for additional throughput gains.
This is non-fatal. The model falls back to PyTorch SDPA attention. To install (requires compatible glibc):
pip install flash-attn --no-build-isolationCheck logs:
tail -f server.logModel loading takes 1-2 minutes. Wait for "Models loaded. Server ready."
- Use INT8 quantization:
TTS_QUANTIZE=int8 - Reduce batch size (K candidates) in the UI
- Use multi-GPU:
CUDA_VISIBLE_DEVICES=0,1 VC_DEVICE=cuda:1
- Increase
diffusion_steps(e.g., 20-25) for higher quality at the cost of speed - Adjust
similarity_cfg_rate(0.5-0.9) to balance intelligibility vs similarity - Ensure reference audio is clean, 5-30 seconds, single speaker