An AI Voice Assistant web platform combining 6 smart voice modules including transcription, voice answering, emotion detection, podcast summarization, document Q&A, and voice cloning — all accessible via a lightweight Flask web interface.
- Project Overview
- Features
- Project Structure
- Tech Stack
- Installation
- Feature Details
- How It Works (Behind the Scenes)
- Known Issuess
- Future Work
- License
- Contributing
- Contact
- Transcription from audio to text
- Conversational voice responses
- Real-time Q&A over documents via speech
- Emotion detection via CNN model
- Long podcast summarization
- Voice cloning and TTS using speaker sample
| Feature Name | Description | Technique/Model Used |
|---|---|---|
| Voice Transcription | Convert audio (mic/file) into text | Whisper (openai/whisper base) |
| Text-to-Speech (TTS) Answering | Generate voice output from text | CoquiTTS / Tacotron2-DDC |
| Voice Cloning | Clone user voice & read text | YourTTS / speaker_wav TTS synthesis |
| Emotion Detection | Detect emotion from voice (e.g., angry, sad) | CNN + MFCC (custom trained on RAVDESS) |
| Document Q&A | Ask voice-based questions over documents | Whisper + ChromaDB + SentenceTransformer |
| Podcast Summarizer | Transcribe & summarize long podcasts into bullet summary | Whisper + BART/DistilBART summarizer |
├── flask_app.py # Flask web app
├── __pycache__/
├── images/
├── ravdess-data/ # RAVDESS dataset
├── templates/ # HTML interface
├── static/ # Output audio files
├── uploads/ # Uploaded inputs
├── voice_transcriber.py # Feature 1
├── Text_to_Speech_generator.py # Feature 2
├── voice_cloner.py # Feature 3
├── emotion_detector.py # Feature 4 (CNN)
├── voice_rag_agent.py # Feature 5
├── podcast_summarizer.py # Feature 6
├── train_emotion_cnn.py # CNN training script
├── train_emotion_model.py # RandomForest training script
├── emotion_cnn.pth # Trained CNN weights
├── emotion_label_encoder.pkl # Label encoder for emotion
├── requirements-ai.txt # Python dependencies
└── README.md
└── LICENSE
| Purpose | Libraries Used |
|---|---|
| Transcription | whisper, ffmpeg |
| Text-to-Speech | TTS, CoquiTTS |
| Cloning | yourTTS, sentence-transformers, transformers |
| Q&A, Embedding | EasyOCR, OpenCV |
| Emotion Detection | PyTorch, librosa, scikit-learn``, joblib`` |
| Summarization | transformers (distilbart-cnn-12-6) |
| Web UI | Flask`, Jinja2, HTML5` |
# Clone repository
git clone https://github.com/paht2005/ai-voice-assistant-suite.git
cd ai-voice-assistant-suite
# Install dependencies
pip install -r requirements-ai.txt
# Run Flask web app
python flask_app.py
Then open your browser: http://127.0.0.1:5000
- Uses Whisper model to transcribe audio files or mic input.
- Auto language detection & punctuation recovery.
- Enter any text → generate voice using Coqui TTS model.
- Output saved as
static/tts_output.wav.
- Upload a 3–5 sec voice sample (.wav).
- Type any text → generates response in that speaker's voice.
- Trained using RAVDESS dataset.
- Input
.wav→ predicts 1 of 8 emotions. - Model:
CNN + MFCC→emotion_cnn.pth
- Upload voice question (.wav)
- Uses Whisper to transcribe → SentenceTransformer + ChromaDB to retrieve doc context → Falcon or LLM to answer.
- Upload long
.wavpodcast → splits into chunks → summarizes using BART-based model. - Summary returned as paragraph.
Each feature operates through a dedicated audio or NLP processing pipeline:
- Input:
.wavfile recorded from the user - Process:
- The Whisper model converts the audio waveform into a log-Mel spectrogram.
- A multilingual decoder processes the spectrogram and generates the corresponding text transcription.
- Output: Clean, normalized text
- Input: Text string generated or typed by the user
- Process:
- A Tacotron2 model (or similar) converts text to a mel-spectrogram.
- A vocoder (e.g., HiFi-GAN, WaveGlow) synthesizes audio from the spectrogram.
- Optionally, the voice output is adjusted using a cloned speaker embedding.
- Output:
tts_output.wavfile saved in/static/directory
- Input: A reference voice
.wavfile + target text - Process:
- Extract a speaker embedding from the input voice.
- Use a multi-speaker TTS model to synthesize speech that matches the tone and identity of the reference speaker.
- Output: Synthetic speech in the cloned voice
- Input: A short audio segment
- Process:
- Extract MFCC (Mel-Frequency Cepstral Coefficients) features.
- Feed the MFCC vector into a 2-layer CNN or classifier.
- Predict one of several emotion classes: e.g.,
angry,happy,sad, etc.
- Output: Detected emotion label (among 8 predefined classes)
- Input: Spoken question from the user
- Process:
- Whisper transcribes the spoken query into text.
- The query is embedded using SentenceTransformer.
- A vector search is performed using ChromaDB to find relevant documents.
- A language model (LLM) generates the final answer using the retrieved context.
- Output: Answer in natural language (text)
- Input: Long-form
.wavfile (e.g., podcast, recorded lecture) - Process:
- Whisper transcribes the full audio into text.
- The transcript is chunked into manageable segments.
- Each segment is summarized using a model like BART or DistilBART.
- Output: Final summary as paragraph or structured bullet points.
| Issue | Cause | Solution |
|---|---|---|
| Whisper FP16 Warningn | No GPU | Ignore or use GPU for speed |
punkt not found |
NLTK missing tokenizer | Run nltk.download('punkt') |
| Audio shape mismatchg | CNN flatten mismatch | Use dynamic flatten in CNN |
| ffmpeg not found | Whisper depends on it | Install ffmpeg & add to PATH |
- Add real-time streaming voice interface
- WebSocket for fast speech interaction
- Export as REST API (for mobile use)
- Integrate multi-lingual support (Vietnamese, etc.)
- User auth system for personalized interaction
This project is licensed under the MIT License. See the LICENSE file for details.
I welcome contributions to improve this project! Feel free to fork, pull request, or open issues. Ideas welcome!
- Contact for work: Nguyễn Công Phát – congphatnguyen.work@gmail.com
- Github
