Skip to content

b-smyers/voice-agent-framework

Repository files navigation

Voice Agent Framework

A flexible, modular Python framework to build voice-activated assistants by mixing and matching any STT (Speech-to-Text), TTS (Text-to-Speech), and LLM (Large Language Model) backends, whether it be open source or proprietary.

Features

✔️ Modular backend swapping — Easily swap STT, TTS, and LLM providers without changing core logic.
✔️ Supports local and cloud providers — Mix offline open-source providers with cloud APIs in a single pipeline.
✔️ Wake word activation — Trigger recording hands-free using configurable wake words via Porcupine.
✔️ Start/stop recording tones — Audible tones signal when recording starts and stops, with automatic silence detection.

Providers

A variety of STT, TTS, and LLM providers are supported out of the box, allowing you to experiment with both local and cloud-based models to match different use cases.

Supported provider types:

  • 🗣️ Speech-to-Text (STT): Whisper, Silero
  • 💬 Large Language Model (LLM): Gemini, ChatGPT
  • 🔊 Text-to-Speech (TTS): ElevenLabs, Silero, Piper, Gemini

For detailed information on each provider — including features, usage notes, and recommendations — check the full provider reference.

Usage

  1. Clone the repo
git clone git@github.com:b-smyers/voice-agent-framework.git
cd voice-agent-framework
  1. Create Python 3.10 environment
python3.10 -m venv venv/
source venv/bin/activate
  1. Install dependencies
pip install -r requirements.txt
  1. Configure a custom Agent in main.py by swapping out different providers (optional)
  2. Set environment variables in .env
    • cp .env-sample .env
    • The defualt Agent configuration requires a Picovoice and Gemini API key, edit .env with your API keys. This guide assumes you have access to both.
  3. Run the application
python main.py
  1. Presto! After successful setup, you can now use your Assistant:
    • Say the wake word: "ok agent".
    • Wait for the start tone, which means the microphone is listening.
    • Ask your question naturally.
    • When you stop speaking, the system will detect silence and play a stop tone.
    • After processing your request, the Assistant will reply using text-to-speech.

Note

The recording stays open while you're speaking. If there’s background noise or if you don't pause clearly, it may stay active longer than expected.

About

A flexible, modular Python framework to build voice-activated assistants by mixing and matching any STT (Speech-to-Text), TTS (Text-to-Speech), and LLM (Large Language Model) backends, whether it be open source or proprietary.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages