VEXYL-STT

Open-source Indian language speech-to-text server

WebSocket + REST speech-to-text server wrapping the ai4bharat/indic-conformer-600m-multilingual model (600M parameters). Self-hosted, zero API costs, full data sovereignty.

Built by VEXYL AI — the team behind the AI Voice Gateway, an enterprise platform that bridges telephony (PSTN, SIP, Asterisk, WebRTC) with LLMs and AI services. VEXYL-STT is the open-source STT component, extracted for standalone use and community contribution.

Overview

VEXYL-STT provides two transcription modes on a single port:

Real-time streaming — WebSocket connection with energy-based VAD, accepts 16kHz 16-bit mono PCM audio, returns JSON transcripts in real time
Batch transcription — REST API for async file-based transcription (WAV, MP3, FLAC, OGG, M4A). Upload a file, poll for results

Features

14 Indian languages supported
Energy-based VAD (no external VAD dependency)
WebSocket streaming + batch REST API on the same port
API key authentication (optional)
Docker and Cloud Run ready
Browser test clients included

Supported Languages

Code	Language	Code	Language
`ml-IN`	Malayalam	`mr-IN`	Marathi
`hi-IN`	Hindi	`pa-IN`	Punjabi
`ta-IN`	Tamil	`or-IN`	Odia
`te-IN`	Telugu	`as-IN`	Assamese
`kn-IN`	Kannada	`ur-IN`	Urdu
`bn-IN`	Bengali	`sa-IN`	Sanskrit
`gu-IN`	Gujarati	`ne-IN`	Nepali

Quick Start

# 1. Run the automated setup (one command)
./setup.sh

# 2. Start the server
./run.sh

# 3. Test in browser
open test.html

Prerequisites

Python 3.10+
macOS or Linux
HuggingFace account with access approved for the gated model
~3 GB disk space for model weights and dependencies

The setup script handles everything: creates a virtual environment, installs dependencies, authenticates with HuggingFace, downloads the model, and generates config files.

Manual Setup

1. Create virtual environment

python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip

2. Install dependencies

# PyTorch (CPU-only, smaller download)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu

# Other dependencies
pip install transformers websockets numpy onnxruntime soundfile

For GPU acceleration:

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121

3. Authenticate with HuggingFace

pip install huggingface_hub
huggingface-cli login

You need a token from huggingface.co/settings/tokens with read access. Request access to the model at huggingface.co/ai4bharat/indic-conformer-600m-multilingual.

4. Download the model

python3 -c "
from transformers import AutoModel
AutoModel.from_pretrained('ai4bharat/indic-conformer-600m-multilingual', trust_remote_code=True)
"

5. Create `.env`

VEXYL_STT_HOST=127.0.0.1
VEXYL_STT_PORT=8091
VEXYL_STT_DECODE=ctc
VEXYL_STT_DEVICE=cpu
# VEXYL_STT_API_KEY=your-secret-here

6. Start the server

source venv/bin/activate
export $(grep -v '^#' .env | xargs)
python3 vexyl_stt_server.py

Configuration

Environment Variables

Variable	Default	Options	Description
`VEXYL_STT_HOST`	`0.0.0.0`	Any IP	Bind address. Use `127.0.0.1` for local-only
`VEXYL_STT_PORT`	`8080`	Any port	Port number (via `PORT` or `VEXYL_STT_PORT`). The sample `.env` uses `8091`
`VEXYL_STT_DECODE`	`ctc`	`ctc`, `rnnt`	Decoding mode. CTC is faster, RNNT is more accurate
`VEXYL_STT_DEVICE`	`auto`	`auto`, `cpu`, `cuda`	Inference device. `auto` uses CUDA if available
`VEXYL_STT_MAX_CONN`	`50`	Any integer	Max concurrent WebSocket connections
`VEXYL_STT_API_KEY`	(empty)	Any string	Shared secret for authentication. Clients must send `X-API-Key` header

API Key Authentication

Set VEXYL_STT_API_KEY on both server and client. The client sends the key as an X-API-Key header. The /health endpoint is always exempt. When the variable is empty, authentication is disabled.

# Server .env
VEXYL_STT_API_KEY=your-shared-secret

# Test with wscat
wscat -c ws://127.0.0.1:8091 -H "X-API-Key: your-shared-secret"

API Reference

WebSocket Protocol

Client                              Server
  │                                    │
  │◄──── {"type":"ready"} ─────────────│  (immediate on connect)
  │── {"type":"start",...} ───────────►│  (begin session)
  │◄──── {"type":"started",...} ───────│
  │── [binary PCM audio] ────────────►│  (stream 16kHz 16-bit mono PCM)
  │◄──── {"type":"final",...} ─────────│  (VAD triggers transcription)
  │── {"type":"stop"} ───────────────►│  (end session)
  │◄──── {"type":"final",...} ─────────│  (flush remaining audio)
  │◄──── {"type":"stopped"} ──────────│

Client → Server

Message	Description
`{"type":"start","lang":"ml-IN","session_id":"abc"}`	Begin transcription session
`[binary]`	Raw 16kHz 16-bit mono PCM audio bytes
`{"type":"stop"}`	End session (flushes buffered audio)
`{"type":"ping"}`	Keepalive

Server → Client

Message	Description
`{"type":"ready","model":"..."}`	Server loaded, ready for sessions
`{"type":"started","session_id":"...","lang":"..."}`	Session begun
`{"type":"final","text":"...","lang":"...","duration":2.45,"latency_ms":320}`	Transcription result
`{"type":"stopped"}`	Session ended
`{"type":"pong"}`	Keepalive response
`{"type":"error","message":"..."}`	Error

Batch API

POST /batch/transcribe       → submit audio file
GET  /batch/status/{job_id}  → check job status
GET  /batch/result/{job_id}  → get transcript (202 if not ready)
GET  /health                 → health check

Submit — `POST /batch/transcribe`

curl -X POST http://localhost:8091/batch/transcribe \
  -H "X-API-Key: your-secret" \
  -F "file=@recording.wav" \
  -F "language_code=hi-IN"

Response (201):

{"job_id": "batch_a1b2c3d4", "status": "queued", "language": "hi-IN", "audio_duration": 4.52}

Status — `GET /batch/status/{job_id}`

Returns job status with transcript when completed.

Result — `GET /batch/result/{job_id}`

Returns 202 if processing, 200 when complete.

Limits

Limit	Value
Max file size	25 MB
Max audio duration	5 minutes
Max pending jobs	1,000
Job TTL	1 hour
Supported formats	WAV, MP3, FLAC, OGG, M4A

Health Endpoint

curl http://127.0.0.1:8091/health

{
  "status": "ok",
  "model": "indic-conformer-600m-multilingual",
  "device": "cpu",
  "decode_mode": "ctc",
  "active_sessions": 0,
  "max_connections": 50,
  "uptime_seconds": 3600.5,
  "batch_jobs_queued": 0,
  "batch_jobs_total": 0
}

Browser Test Clients

`test.html` — Real-time streaming

Open directly in a browser. Records from microphone, streams to server, displays transcripts in real time.

`test-batch.html` — Batch transcription

Upload audio files or record from microphone for async batch transcription.

Docker

Build

docker build --build-arg HF_TOKEN=$HF_TOKEN -t vexyl-stt .

Run

docker run -p 8080:8080 vexyl-stt

# With API key
docker run -p 8080:8080 -e VEXYL_STT_API_KEY=mysecret vexyl-stt

Cloud Run Deployment

See DEPLOY.md for a complete guide.

Quick deploy:

export GCP_PROJECT_ID=your-project-id
export HF_TOKEN=hf_your_token
./deploy.sh

VEXYL AI Voice Gateway

VEXYL AI Voice Gateway is an enterprise platform that connects phone calls directly to AI — bridging traditional telephony (PSTN, SIP, Asterisk, WebRTC) with LLMs, STT, and TTS providers. It supports 17+ AI providers including OpenAI, Groq, Deepgram, and ElevenLabs, with sub-200ms latency and features like barge-in, human escalation, and outbound calling.

VEXYL-STT plugs into the Voice Gateway as a self-hosted STT provider, giving you Indian language transcription with zero external API calls — ideal for data sovereignty, cost control, or as a fallback when cloud STT providers are unavailable.

Key benefits of using VEXYL-STT with the Voice Gateway:

Zero API cost for Indian language calls — no per-minute STT billing
Full data sovereignty — audio never leaves your infrastructure
Fallback resilience — automatic failover from cloud STT to local model
Low latency — same-machine WebSocket connection, no network round-trip

Visit vexyl.ai to learn more about the enterprise product.

Voice Gateway Client Library

The vexyl-stt-client.js module provides a Node.js client that follows the same interface pattern as other Voice Gateway STT providers (Groq, Deepgram, Sarvam, etc.).

const { VexylSTT } = require('./vexyl-stt-client.js');

const stt = new VexylSTT('ml-IN');
stt.onTranscript = (text) => console.log('Transcript:', text);
stt.onError = (err) => console.error('Error:', err);

await stt.connect();
stt.sendAudio(pcmBuffer);  // 16kHz 16-bit mono PCM
await stt.stop();

Environment variables:

VEXYL_STT_URL — WebSocket URL (default: ws://127.0.0.1:8091)
VEXYL_STT_API_KEY — Shared secret for X-API-Key header

See stt-provider-patch.md and language-config-patch.md for Voice Gateway integration instructions.

Production

PM2

pm2 start run.sh --name vexyl-stt
pm2 logs vexyl-stt
pm2 save && pm2 startup

GPU Acceleration

source venv/bin/activate
pip uninstall torch torchaudio -y
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121

Set VEXYL_STT_DEVICE=auto (or cuda) in .env and restart.

RNNT Decoding

For higher accuracy at the cost of slightly increased latency:

VEXYL_STT_DECODE=rnnt

Troubleshooting

HuggingFace 401/403

Verify your token (huggingface-cli whoami) and ensure you have been granted access to the gated model.

Port already in use

lsof -i :8091
# Or change port: VEXYL_STT_PORT=8092

No transcription output

Check audio format: server expects 16kHz, 16-bit, mono PCM
Check language code: unknown codes default to Malayalam (ml)
Check VAD threshold: quiet audio may not exceed SILENCE_THRESHOLD (0.015)

WebSocket connection refused

Ensure the server is running: ./run.sh
Check host/port match between client and .env
For remote access, set VEXYL_STT_HOST=0.0.0.0

Project Files

File	Description
`vexyl_stt_server.py`	Python server — WebSocket streaming + batch REST API
`vexyl-stt-client.js`	Node.js client library for Voice Gateway integration
`setup.sh`	Automated setup — venv, deps, HuggingFace auth, model download
`run.sh`	Start script — loads `.env`, activates venv, launches server
`deploy.sh`	One-command Cloud Run deployment
`Dockerfile`	Container image with baked-in model
`.env.example`	Template for server configuration
`test.html`	Browser test client for real-time streaming
`test-batch.html`	Browser test client for batch API
`stt-provider-patch.md`	Voice Gateway `stt-provider.js` integration guide
`language-config-patch.md`	Voice Gateway `language-config.js` integration guide

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
DEPLOY.md		DEPLOY.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
deploy.sh		deploy.sh
language-config-patch.md		language-config-patch.md
requirements.txt		requirements.txt
run.sh		run.sh
setup.sh		setup.sh
stt-provider-patch.md		stt-provider-patch.md
test-batch.html		test-batch.html
test.html		test.html
vexyl-stt-setup.sh		vexyl-stt-setup.sh
vexyl_stt_server.py		vexyl_stt_server.py

Folders and files

Latest commit

History

Repository files navigation

VEXYL-STT

Overview

Features

Supported Languages

Quick Start

Prerequisites

Manual Setup

1. Create virtual environment

2. Install dependencies

3. Authenticate with HuggingFace

4. Download the model

5. Create .env

6. Start the server

Configuration

Environment Variables

API Key Authentication

API Reference

WebSocket Protocol

Client → Server

Server → Client

Batch API

Submit — POST /batch/transcribe

Status — GET /batch/status/{job_id}

Result — GET /batch/result/{job_id}

Limits

Health Endpoint

Browser Test Clients

test.html — Real-time streaming

test-batch.html — Batch transcription

Docker

Build

Run

Cloud Run Deployment

VEXYL AI Voice Gateway

Voice Gateway Client Library

Production

PM2

GPU Acceleration

RNNT Decoding

Troubleshooting

HuggingFace 401/403

Port already in use

No transcription output

WebSocket connection refused

Project Files

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

5. Create `.env`

Submit — `POST /batch/transcribe`

Status — `GET /batch/status/{job_id}`

Result — `GET /batch/result/{job_id}`

`test.html` — Real-time streaming

`test-batch.html` — Batch transcription

Packages