Transcribe Large audio files with automatic speaker diarization and a web interface.
Uses Nvdia Parakeet v2 and Pyannote models for transcription and diarization.
- Automatic Speech Transcription (English)
- Speaker Diarization: Distinguish between multiple speakers
- Chunked Processing: Handles long audio files efficiently
- Speaker Recognisation: Correctly recognizes these speakers across all the chunks.
- Modern Web UI: Powered by Gradio
- Downloadable Results: Get both formatted text and grouped-by-speaker files
- CPU Only Infernce: No need for GPU Uses PyTorch and ONNX for fast inference GPU is optional.
git clone https://github.com/deepanshu-yadav/big_audio_file_transcription.git
cd big_audio_file_transcriptionIf you choose pyannote models for diarization you need to download the following files. Otherwise if you choose resemblyser then no need.
-
Segmentation Model
Download from pyannote/segmentation-3.0
Then run:cp pytorch_model.bin model_components/pyannote/ mv model_components/pyannote/pytorch_model.bin model_components/pyannote/segmentation-3.0.bin
-
Embedding Model
Download from pyannote/wespeaker-voxceleb-resnet34-LM
Then run:cp pytorch_model.bin model_components/pyannote/ mv model_components/pyannote/pytorch_model.bin model_components/pyannote/wespeaker-voxceleb-resnet34-LM.bin
Download and extract:
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8.tar.bz2
tar xvf sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8.tar.bz2
rm sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8.tar.bz2Copy all .onnx and tokens.txt files into the model_components folder.
pip install -r requirements.txt- Ubuntu/Debian:
sudo apt update && sudo apt install ffmpeg - Arch Linux:
sudo pacman -S ffmpeg - macOS (Homebrew):
brew install ffmpeg - Windows (Winget):
winget install Gyan.FFmpeg - Windows (Chocolatey):
choco install ffmpeg - Windows (Scoop):
scoop install ffmpeg
Or download from the official FFmpeg website.
python app.pyVisit http://127.0.0.1:7860 in your browser.
- Upload your audio file (supports
.wav,.mp3, etc.). - Set the number of speakers and chunk duration.
- Click "Transcribe".
- View the transcription with speaker labels.
- Download the full results as a text file.
repo/
│
├── app.py # Gradio web interface
├── transcript_big_file.py # Main diarization & transcription logic
├── transcript.py # Feature extraction & ONNX model handling
├── utils.py # Utility functions (audio conversion, file writing)
├── requirements.txt # Python dependencies
├── model_components/ # Place all downloaded models here
│ └── pyannote/
│ ├── config_diarize.yaml
└── README.md
- Support for languages other than English
- Improved handling of overlapping speakers
Pull requests and issues are welcome! Please open an issue for bugs or feature requests.
This project is licensed under the MIT License.
Enjoy fast, accurate, and speaker-aware audio transcription!
