A self-hosted web interface for queued transcription jobs with two modes:
- Local vLLM model: Voxtral Mini 3B 2507
- External Mistral API model:
voxtral-mini-2602
- File upload transcription (audio or video)
- Queue-based job processing with shareable job URLs (
/jobs/<uuid>) - Automatic normalization to 16 kHz mono WAV via FFmpeg
- Language selector with
autodetection (default) - Conditional provider tabs (second tab appears only when
MISTRAL_API_KEYis set) - Mistral API options: output formats, word-level timestamps, diarization, and context bias
- Copy result and download as
.txt(plus.srt/.vttfor external jobs) - CPU and NVIDIA GPU Docker setup
docker compose up --buildRequirements:
- NVIDIA driver >= 525
- NVIDIA Container Toolkit installed
docker compose -f docker-compose.yml -f docker-compose.gpu.yml up --build| Variable | Default | Description |
|---|---|---|
VLLM_DEVICE |
cpu |
Build target for vLLM image (cpu or gpu) |
HF_TOKEN |
(none) | Optional HuggingFace token |
MISTRAL_API_KEY |
(none) | Enables external Mistral API transcription tab |
VLLM_MAX_AUDIO_CLIP_FILESIZE_MB |
200 |
Max accepted audio filesize in MB for local vLLM transcription |
MAX_QUEUED_JOBS |
10 |
Maximum number of queued jobs before new submissions are rejected (HTTP 429) |
JOB_TTL_SECONDS |
3600 |
Time in seconds before completed/failed jobs are cleaned up |
LOCAL_TRANSCRIBE_TIMEOUT_SECONDS |
600 |
Timeout for a single local vLLM transcription request (increase for long audio or slow hardware) |
LOCAL_CHUNK_SECONDS |
480 |
Duration in seconds of each audio chunk for local transcription (default 8 minutes) |
Set variables in .env next to docker-compose.yml.
| Method | Path | Description |
|---|---|---|
GET |
/ |
Web UI |
GET |
/jobs/{job_id} |
Job status and result page |
GET |
/health |
App and vLLM reachability status |
POST |
/api/jobs |
Create a transcription job (provider=local|mistral) |
GET |
/api/jobs/{job_id} |
Get job status, progress, and result |
curl -X POST http://localhost:8080/api/jobs \
-F "file=@recording.mp3" \
-F "language=auto" \
-F "provider=local"
curl -X POST http://localhost:8080/api/jobs \
-F "file=@recording.mp3" \
-F "language=en" \
-F "provider=mistral" \
-F "word_timestamps=true" \
-F "diarize=false" \
-F "context_bias_enabled=true" \
-F "context_bias=Chicago,Joplin,Boston,American_spirit" \
-F "want_srt=true" \
-F "want_vtt=true"- Local model can be used for confidential transcriptions.
- External model can be used for public information.
- In Mistral mode, word-level timestamps and diarization are mutually exclusive.
- Context bias accepts comma-separated terms (max 100). Allowed characters are letters, numbers,
_and-; spaces are converted to_. - If local transcription fails with
Maximum file size exceeded, increaseVLLM_MAX_AUDIO_CLIP_FILESIZE_MBand restart containers. - If local jobs fail on long files, increase
LOCAL_TRANSCRIBE_TIMEOUT_SECONDS(e.g.1800for 30+ minute files) and restart containers. - Completed and failed jobs are automatically cleaned up after
JOB_TTL_SECONDS(default 1 hour). - The job queue is capped at
MAX_QUEUED_JOBS(default 10). Requests beyond that limit receive HTTP 429. - Only audio and video files are accepted: MP3, WAV, OGG, FLAC, M4A, AAC, OPUS, MP4, MKV, MOV, AVI, WEBM.
Voxtral UI has no built-in authentication or rate limiting. It is designed to run in trusted local environments or private networks.
Do not expose this application directly to the public internet.
For internet-facing or shared deployments, place Voxtral UI behind a reverse proxy with authentication enabled. Example using Nginx with HTTP basic auth:
server {
listen 443 ssl;
server_name voxtral.example.com;
auth_basic "Voxtral UI";
auth_basic_user_file /etc/nginx/.htpasswd;
location / {
proxy_pass http://127.0.0.1:8080;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
client_max_body_size 500M;
}
}Generate the password file with:
htpasswd -c /etc/nginx/.htpasswd youruserThis project uses the Voxtral Mini 3B 2507 model and the Mistral API for speech-to-text transcription. Voxtral UI is an independent, community-built interface and is not developed, maintained, endorsed, or affiliated with Mistral AI in any way. All trademarks belong to their respective owners.
This project is licensed under the MIT License.