Automatically detect speech segments in long audio files and split them into clean, silence-free clips β perfect for feeding into AI models, transcription pipelines, or dataset preparation.
Features Β· Quick Start Β· Usage Β· Configuration Β· Contributing
You can't feed a 10-minute audio file to most AI/ML models at once. You need to cut it into small pieces of 3β10 seconds. Doing this manually is painful and error-prone.
This app uses Silero VAD (Voice Activity Detection) β a state-of-the-art neural network β to automatically:
- Detect where speech occurs in your audio
- Remove silence gaps between speech segments
- Split the audio into clean, manageable clips (3β10s)
- Export everything as a downloadable ZIP of WAV files
All through a beautiful Liquid Glass UI with Material Design components β no command line needed.
| Feature | Description |
|---|---|
| π§ AI-Powered VAD | Uses Silero VAD v5 β 87.7% TPR, processes 30ms chunks in <1ms on CPU |
| π¨ Liquid Glass UI | Frosted-glass cards, animated gradients, Material Design components |
| π Visual Timeline | Interactive canvas visualization showing speech vs. silence regions |
| βοΈ Fine-Tunable | Adjustable sensitivity, min/max duration, silence gap, and padding |
| π Multi-Format | Supports WAV, MP3, OGG, FLAC, AAC, M4A, WMA, OPUS, WebM |
| π±οΈ Drag & Drop | Simply drag your audio file into the browser window |
| β Selective Export | Choose which segments to include in your download |
| π¦ ZIP Download | All segments packaged into a single downloadable ZIP |
| π» Standalone | Pure Python β no Node.js, no npm, just python app.py |
- Python 3.10+ β Download Python
- FFmpeg (optional, for MP3/AAC support) β Download FFmpeg
After initial setup, just double-click run.cmd β it activates the virtual environment, starts the server, and opens your browser automatically.
git clone https://github.com/Tharinda-Pamindu/Voice-Activity-Detector.git
cd Voice-Activity-Detector
setup.batThen double-click run.cmd to launch.
# Clone the repository
git clone https://github.com/Tharinda-Pamindu/Voice-Activity-Detector.git
cd Voice-Activity-Detector
# Create and activate virtual environment
python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS/Linux
source .venv/bin/activate
# Install PyTorch (CPU-only β lightweight ~150MB)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu
# Install remaining dependencies
pip install flask silero-vad pydub soundfile librosa numpy
# Run the app
python app.pyThen open http://localhost:5000 in your browser π
Drag and drop any audio file (WAV, MP3, FLAC, OGG, etc.) into the upload zone, or click to browse.
Fine-tune the detection parameters using the intuitive sliders:
| Setting | Default | Description |
|---|---|---|
| Detection Sensitivity | 0.50 | Higher = stricter speech detection (0.1 β 0.95) |
| Min Speech Duration | 250ms | Ignore speech segments shorter than this |
| Min Silence Duration | 300ms | Minimum silence gap to split segments |
| Max Segment Length | 10s | Automatically split segments longer than this |
| Padding | 200ms | Extra audio buffer around each segment |
Click "Analyze Audio" β the app will process your file with Silero VAD and display:
- A visual timeline showing speech (highlighted) vs. silence regions
- Statistics β total duration, segment count, speech time, silence removed
- A segment list with timestamps and duration bars
Select the segments you want, then click "Download Selected Segments" to get a ZIP file containing numbered WAV clips.
| Parameter | Range | Impact |
|---|---|---|
threshold |
0.1 β 0.95 | Lower = more sensitive (catches quiet speech), Higher = fewer false positives |
min_speech_ms |
100 β 2000 | Filters out very short sounds (coughs, clicks) |
min_silence_ms |
100 β 3000 | Controls how long a pause must be to split segments |
max_segment_s |
3 β 30 | Forces long monologues to be split at this length |
padding_ms |
0 β 500 | Adds a buffer to avoid cutting off word beginnings/endings |
By default, this app installs CPU-only PyTorch for a smaller footprint. To use GPU acceleration:
# Replace step 2 with CUDA-enabled PyTorch
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121Voice-Activity-Detector/
βββ app.py # Flask backend + VAD processing engine
βββ run.cmd # One-click launcher (activates venv + opens browser)
βββ setup.bat # Windows first-time setup script
βββ requirements.txt # Python dependencies
βββ templates/
β βββ index.html # Main HTML page (Liquid Glass UI)
βββ static/
β βββ style.css # Liquid Glass + Material Design styles
β βββ app.js # Frontend logic (upload, timeline, download)
βββ .gitignore
βββ LICENSE # MIT License
βββ CONTRIBUTING.md # Contribution guidelines
βββ README.md
| Component | Technology |
|---|---|
| VAD Engine | Silero VAD v5 (ONNX) |
| Backend | Flask 3.0 (Python) |
| Audio Processing | PyTorch, torchaudio, pydub, soundfile, librosa |
| Frontend | Vanilla HTML/CSS/JS |
| Design | Liquid Glass UI + Material Design |
| Fonts | Inter, Outfit (Google Fonts) |
| Icons | Material Icons Round |
Contributions are welcome! Please read the Contributing Guide for details on:
- Setting up your development environment
- Our commit message convention
- Code style guidelines
- How to submit pull requests
This project is licensed under the MIT License β see the LICENSE file for details.
- Silero VAD β State-of-the-art voice activity detection
- Flask β Lightweight Python web framework
- PyTorch β Deep learning framework
- Google Material Design β Design system inspiration
Built with β€οΈ by Tharinda Pamindu
β Star this repo if you find it useful!