Skip to content

tsueri/voxtral-ui

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Voxtral UI

A self-hosted web interface for queued transcription jobs with two modes:

Features

  • File upload transcription (audio or video)
  • Queue-based job processing with shareable job URLs (/jobs/<uuid>)
  • Automatic normalization to 16 kHz mono WAV via FFmpeg
  • Language selector with auto detection (default)
  • Conditional provider tabs (second tab appears only when MISTRAL_API_KEY is set)
  • Mistral API options: output formats, word-level timestamps, diarization, and context bias
  • Copy result and download as .txt (plus .srt/.vtt for external jobs)
  • CPU and NVIDIA GPU Docker setup

Quick Start

CPU (default)

docker compose up --build

Open http://localhost:8080

NVIDIA GPU

Requirements:

  • NVIDIA driver >= 525
  • NVIDIA Container Toolkit installed
docker compose -f docker-compose.yml -f docker-compose.gpu.yml up --build

Environment Variables

Variable Default Description
VLLM_DEVICE cpu Build target for vLLM image (cpu or gpu)
HF_TOKEN (none) Optional HuggingFace token
MISTRAL_API_KEY (none) Enables external Mistral API transcription tab
VLLM_MAX_AUDIO_CLIP_FILESIZE_MB 200 Max accepted audio filesize in MB for local vLLM transcription
MAX_QUEUED_JOBS 10 Maximum number of queued jobs before new submissions are rejected (HTTP 429)
JOB_TTL_SECONDS 3600 Time in seconds before completed/failed jobs are cleaned up
LOCAL_TRANSCRIBE_TIMEOUT_SECONDS 600 Timeout for a single local vLLM transcription request (increase for long audio or slow hardware)
LOCAL_CHUNK_SECONDS 480 Duration in seconds of each audio chunk for local transcription (default 8 minutes)

Set variables in .env next to docker-compose.yml.

API

Method Path Description
GET / Web UI
GET /jobs/{job_id} Job status and result page
GET /health App and vLLM reachability status
POST /api/jobs Create a transcription job (provider=local|mistral)
GET /api/jobs/{job_id} Get job status, progress, and result

Example curl

curl -X POST http://localhost:8080/api/jobs \
  -F "file=@recording.mp3" \
  -F "language=auto" \
  -F "provider=local"

curl -X POST http://localhost:8080/api/jobs \
  -F "file=@recording.mp3" \
  -F "language=en" \
  -F "provider=mistral" \
  -F "word_timestamps=true" \
  -F "diarize=false" \
  -F "context_bias_enabled=true" \
  -F "context_bias=Chicago,Joplin,Boston,American_spirit" \
  -F "want_srt=true" \
  -F "want_vtt=true"

Notes

  • Local model can be used for confidential transcriptions.
  • External model can be used for public information.
  • In Mistral mode, word-level timestamps and diarization are mutually exclusive.
  • Context bias accepts comma-separated terms (max 100). Allowed characters are letters, numbers, _ and -; spaces are converted to _.
  • If local transcription fails with Maximum file size exceeded, increase VLLM_MAX_AUDIO_CLIP_FILESIZE_MB and restart containers.
  • If local jobs fail on long files, increase LOCAL_TRANSCRIBE_TIMEOUT_SECONDS (e.g. 1800 for 30+ minute files) and restart containers.
  • Completed and failed jobs are automatically cleaned up after JOB_TTL_SECONDS (default 1 hour).
  • The job queue is capped at MAX_QUEUED_JOBS (default 10). Requests beyond that limit receive HTTP 429.
  • Only audio and video files are accepted: MP3, WAV, OGG, FLAC, M4A, AAC, OPUS, MP4, MKV, MOV, AVI, WEBM.

Security

Voxtral UI has no built-in authentication or rate limiting. It is designed to run in trusted local environments or private networks.

Do not expose this application directly to the public internet.

For internet-facing or shared deployments, place Voxtral UI behind a reverse proxy with authentication enabled. Example using Nginx with HTTP basic auth:

server {
    listen 443 ssl;
    server_name voxtral.example.com;

    auth_basic "Voxtral UI";
    auth_basic_user_file /etc/nginx/.htpasswd;

    location / {
        proxy_pass http://127.0.0.1:8080;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        client_max_body_size 500M;
    }
}

Generate the password file with:

htpasswd -c /etc/nginx/.htpasswd youruser

About Voxtral

This project uses the Voxtral Mini 3B 2507 model and the Mistral API for speech-to-text transcription. Voxtral UI is an independent, community-built interface and is not developed, maintained, endorsed, or affiliated with Mistral AI in any way. All trademarks belong to their respective owners.

License

This project is licensed under the MIT License.

About

Run your private voxtral transcription service in a docker container.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors