Skip to content

Tharinda-Pamindu/Voice-Activity-Detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸŽ™οΈ Voice Activity Detector

AI-Powered Audio Segmentation with Silero VAD

Python 3.10+ Flask Silero VAD License: MIT PRs Welcome

Automatically detect speech segments in long audio files and split them into clean, silence-free clips β€” perfect for feeding into AI models, transcription pipelines, or dataset preparation.

Features Β· Quick Start Β· Usage Β· Configuration Β· Contributing


🎯 The Problem

You can't feed a 10-minute audio file to most AI/ML models at once. You need to cut it into small pieces of 3–10 seconds. Doing this manually is painful and error-prone.

βœ… The Solution

This app uses Silero VAD (Voice Activity Detection) β€” a state-of-the-art neural network β€” to automatically:

  1. Detect where speech occurs in your audio
  2. Remove silence gaps between speech segments
  3. Split the audio into clean, manageable clips (3–10s)
  4. Export everything as a downloadable ZIP of WAV files

All through a beautiful Liquid Glass UI with Material Design components β€” no command line needed.


✨ Features

Feature Description
🧠 AI-Powered VAD Uses Silero VAD v5 β€” 87.7% TPR, processes 30ms chunks in <1ms on CPU
🎨 Liquid Glass UI Frosted-glass cards, animated gradients, Material Design components
πŸ“Š Visual Timeline Interactive canvas visualization showing speech vs. silence regions
βš™οΈ Fine-Tunable Adjustable sensitivity, min/max duration, silence gap, and padding
πŸ“ Multi-Format Supports WAV, MP3, OGG, FLAC, AAC, M4A, WMA, OPUS, WebM
πŸ–±οΈ Drag & Drop Simply drag your audio file into the browser window
βœ… Selective Export Choose which segments to include in your download
πŸ“¦ ZIP Download All segments packaged into a single downloadable ZIP
πŸ’» Standalone Pure Python β€” no Node.js, no npm, just python app.py

πŸš€ Quick Start

Prerequisites

Option A: One-Click Launch (Windows)

After initial setup, just double-click run.cmd β€” it activates the virtual environment, starts the server, and opens your browser automatically.

Option B: Automated Setup (Windows β€” First Time)

git clone https://github.com/Tharinda-Pamindu/Voice-Activity-Detector.git
cd Voice-Activity-Detector
setup.bat

Then double-click run.cmd to launch.

Option C: Manual Setup

# Clone the repository
git clone https://github.com/Tharinda-Pamindu/Voice-Activity-Detector.git
cd Voice-Activity-Detector

# Create and activate virtual environment
python -m venv .venv

# Windows
.venv\Scripts\activate

# macOS/Linux
source .venv/bin/activate

# Install PyTorch (CPU-only β€” lightweight ~150MB)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu

# Install remaining dependencies
pip install flask silero-vad pydub soundfile librosa numpy

# Run the app
python app.py

Then open http://localhost:5000 in your browser πŸŽ‰


πŸ“– Usage

1. Upload Your Audio

Drag and drop any audio file (WAV, MP3, FLAC, OGG, etc.) into the upload zone, or click to browse.

2. Configure VAD Settings

Fine-tune the detection parameters using the intuitive sliders:

Setting Default Description
Detection Sensitivity 0.50 Higher = stricter speech detection (0.1 – 0.95)
Min Speech Duration 250ms Ignore speech segments shorter than this
Min Silence Duration 300ms Minimum silence gap to split segments
Max Segment Length 10s Automatically split segments longer than this
Padding 200ms Extra audio buffer around each segment

3. Analyze

Click "Analyze Audio" β€” the app will process your file with Silero VAD and display:

  • A visual timeline showing speech (highlighted) vs. silence regions
  • Statistics β€” total duration, segment count, speech time, silence removed
  • A segment list with timestamps and duration bars

4. Download

Select the segments you want, then click "Download Selected Segments" to get a ZIP file containing numbered WAV clips.


βš™οΈ Configuration

VAD Parameters (via UI sliders)

Parameter Range Impact
threshold 0.1 – 0.95 Lower = more sensitive (catches quiet speech), Higher = fewer false positives
min_speech_ms 100 – 2000 Filters out very short sounds (coughs, clicks)
min_silence_ms 100 – 3000 Controls how long a pause must be to split segments
max_segment_s 3 – 30 Forces long monologues to be split at this length
padding_ms 0 – 500 Adds a buffer to avoid cutting off word beginnings/endings

GPU Support

By default, this app installs CPU-only PyTorch for a smaller footprint. To use GPU acceleration:

# Replace step 2 with CUDA-enabled PyTorch
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121

πŸ“ Project Structure

Voice-Activity-Detector/
β”œβ”€β”€ app.py                 # Flask backend + VAD processing engine
β”œβ”€β”€ run.cmd                # One-click launcher (activates venv + opens browser)
β”œβ”€β”€ setup.bat              # Windows first-time setup script
β”œβ”€β”€ requirements.txt       # Python dependencies
β”œβ”€β”€ templates/
β”‚   └── index.html         # Main HTML page (Liquid Glass UI)
β”œβ”€β”€ static/
β”‚   β”œβ”€β”€ style.css          # Liquid Glass + Material Design styles
β”‚   └── app.js             # Frontend logic (upload, timeline, download)
β”œβ”€β”€ .gitignore
β”œβ”€β”€ LICENSE                # MIT License
β”œβ”€β”€ CONTRIBUTING.md        # Contribution guidelines
└── README.md

πŸ› οΈ Tech Stack

Component Technology
VAD Engine Silero VAD v5 (ONNX)
Backend Flask 3.0 (Python)
Audio Processing PyTorch, torchaudio, pydub, soundfile, librosa
Frontend Vanilla HTML/CSS/JS
Design Liquid Glass UI + Material Design
Fonts Inter, Outfit (Google Fonts)
Icons Material Icons Round

🀝 Contributing

Contributions are welcome! Please read the Contributing Guide for details on:

  • Setting up your development environment
  • Our commit message convention
  • Code style guidelines
  • How to submit pull requests

πŸ“„ License

This project is licensed under the MIT License β€” see the LICENSE file for details.


πŸ™ Acknowledgments


Built with ❀️ by Tharinda Pamindu

⭐ Star this repo if you find it useful!

About

πŸŽ™οΈ AI-powered Voice Activity Detection β€” Automatically detect and split speech segments from long audio files using Silero VAD. Beautiful Liquid Glass UI with drag-and-drop upload, visual timeline, and one-click ZIP export. No command line needed.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors