Voxtral UI

A self-hosted web interface for queued transcription jobs with two modes:

Local vLLM model: Voxtral Mini 3B 2507
External Mistral API model: voxtral-mini-2602

Features

File upload transcription (audio or video)
Queue-based job processing with shareable job URLs (/jobs/<uuid>)
Automatic normalization to 16 kHz mono WAV via FFmpeg
Language selector with auto detection (default)
Conditional provider tabs (second tab appears only when MISTRAL_API_KEY is set)
Mistral API options: output formats, word-level timestamps, diarization, and context bias
Copy result and download as .txt (plus .srt/.vtt for external jobs)
CPU and NVIDIA GPU Docker setup

Quick Start

CPU (default)

docker compose up --build

Open http://localhost:8080

NVIDIA GPU

Requirements:

NVIDIA driver >= 525
NVIDIA Container Toolkit installed

docker compose -f docker-compose.yml -f docker-compose.gpu.yml up --build

Environment Variables

Variable	Default	Description
`VLLM_DEVICE`	`cpu`	Build target for vLLM image (`cpu` or `gpu`)
`HF_TOKEN`	(none)	Optional HuggingFace token
`MISTRAL_API_KEY`	(none)	Enables external Mistral API transcription tab
`VLLM_MAX_AUDIO_CLIP_FILESIZE_MB`	`200`	Max accepted audio filesize in MB for local vLLM transcription
`MAX_QUEUED_JOBS`	`10`	Maximum number of queued jobs before new submissions are rejected (HTTP 429)
`JOB_TTL_SECONDS`	`3600`	Time in seconds before completed/failed jobs are cleaned up
`LOCAL_TRANSCRIBE_TIMEOUT_SECONDS`	`600`	Timeout for a single local vLLM transcription request (increase for long audio or slow hardware)
`LOCAL_CHUNK_SECONDS`	`480`	Duration in seconds of each audio chunk for local transcription (default 8 minutes)

Set variables in .env next to docker-compose.yml.

API

Method	Path	Description
`GET`	`/`	Web UI
`GET`	`/jobs/{job_id}`	Job status and result page
`GET`	`/health`	App and vLLM reachability status
`POST`	`/api/jobs`	Create a transcription job (`provider=local\|mistral`)
`GET`	`/api/jobs/{job_id}`	Get job status, progress, and result

Example curl

curl -X POST http://localhost:8080/api/jobs \
  -F "file=@recording.mp3" \
  -F "language=auto" \
  -F "provider=local"

curl -X POST http://localhost:8080/api/jobs \
  -F "file=@recording.mp3" \
  -F "language=en" \
  -F "provider=mistral" \
  -F "word_timestamps=true" \
  -F "diarize=false" \
  -F "context_bias_enabled=true" \
  -F "context_bias=Chicago,Joplin,Boston,American_spirit" \
  -F "want_srt=true" \
  -F "want_vtt=true"

Notes

Local model can be used for confidential transcriptions.
External model can be used for public information.
In Mistral mode, word-level timestamps and diarization are mutually exclusive.
Context bias accepts comma-separated terms (max 100). Allowed characters are letters, numbers, _ and -; spaces are converted to _.
If local transcription fails with Maximum file size exceeded, increase VLLM_MAX_AUDIO_CLIP_FILESIZE_MB and restart containers.
If local jobs fail on long files, increase LOCAL_TRANSCRIBE_TIMEOUT_SECONDS (e.g. 1800 for 30+ minute files) and restart containers.
Completed and failed jobs are automatically cleaned up after JOB_TTL_SECONDS (default 1 hour).
The job queue is capped at MAX_QUEUED_JOBS (default 10). Requests beyond that limit receive HTTP 429.
Only audio and video files are accepted: MP3, WAV, OGG, FLAC, M4A, AAC, OPUS, MP4, MKV, MOV, AVI, WEBM.

Security

Voxtral UI has no built-in authentication or rate limiting. It is designed to run in trusted local environments or private networks.

Do not expose this application directly to the public internet.

For internet-facing or shared deployments, place Voxtral UI behind a reverse proxy with authentication enabled. Example using Nginx with HTTP basic auth:

server {
    listen 443 ssl;
    server_name voxtral.example.com;

    auth_basic "Voxtral UI";
    auth_basic_user_file /etc/nginx/.htpasswd;

    location / {
        proxy_pass http://127.0.0.1:8080;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        client_max_body_size 500M;
    }
}

Generate the password file with:

htpasswd -c /etc/nginx/.htpasswd youruser

About Voxtral

This project uses the Voxtral Mini 3B 2507 model and the Mistral API for speech-to-text transcription. Voxtral UI is an independent, community-built interface and is not developed, maintained, endorsed, or affiliated with Mistral AI in any way. All trademarks belong to their respective owners.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
vllm		vllm
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.gpu.yml		docker-compose.gpu.yml
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Voxtral UI

Features

Quick Start

CPU (default)

NVIDIA GPU

Environment Variables

API

Example curl

Notes

Security

About Voxtral

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Voxtral UI

Features

Quick Start

CPU (default)

NVIDIA GPU

Environment Variables

API

Example curl

Notes

Security

About Voxtral

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages