Skip to content

cogos-dev/mod3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Mod³ — Model Modality Modulator

Give your AI agent a voice.

Mod³ is a Python MCP server that provides text-to-speech for Claude Code, Cursor, and other MCP-compatible AI tools. It runs four TTS engines locally on Apple Silicon, generates speech faster than realtime, and returns immediately so the agent keeps working while audio plays.

What it does

  • Non-blocking speech -- speak() returns immediately with a job ID. Audio plays in the background. The agent writes code while it talks.
  • Queue-aware output -- Every speak() return includes queue position, estimated wait time, and active job state. The agent knows what's playing without making a separate status call.
  • Barge-in detection -- VAD (voice activity detection) monitors the microphone. If the user starts talking, playback stops and the agent is notified. No talking over people.
  • Turn-taking -- Bidirectional awareness of who's speaking. The agent can check user state before deciding to speak or wait.
  • Multi-model routing -- Four TTS engines behind one interface. Voice name determines which engine handles the request.
  • Adaptive buffering -- EMA-based arrival rate tracking with dynamic startup threshold. Gapless playback under normal load, graceful degradation under GPU contention.
  • Structured metrics -- Every call returns TTFA, RTF, per-chunk timing, buffer health, underrun counts, and memory usage. The agent can diagnose its own audio quality.

Engines

Engine Model Size TTFA Control Surfaces
Kokoro Kokoro-82M-bf16 82M ~60ms Speed, emphasis (ALL CAPS), pacing (punctuation)
Voxtral Voxtral-4B-TTS-mlx-4bit 4B ~500ms 20 voice presets, multi-language
Chatterbox chatterbox-4bit ~1B ~60ms Emotion/exaggeration (0-1), voice cloning
Spark Spark-TTS-0.5B-bf16 0.5B ~1s Pitch (5-level), speed, gender

Models are downloaded on first use via HuggingFace Hub.

Quick Start

git clone https://github.com/cogos-dev/mod3.git
cd mod3
./setup.sh

Then add to your project's .mcp.json:

{
  "mcpServers": {
    "mod3": {
      "command": "/path/to/mod3/.venv/bin/python",
      "args": ["/path/to/mod3/server.py"]
    }
  }
}

MCP Tools

speak(text, voice?, stream?, speed?, emotion?)

Synthesize text and play through speakers. Returns immediately with a job ID, queue state, and estimated wait time.

speak("Hello world")                                        → default voice (bm_lewis @ 1.25x)
speak("Hello world", voice="casual_male")                   → Voxtral
speak("Hello world", voice="chatterbox", emotion=0.8)       → Chatterbox with high emotion
speak("Hello world", voice="am_michael", speed=1.4)         → Kokoro fast

speech_status(job_id?, verbose?)

Check if speech is still playing, or get metrics from the last completed job. Pass verbose=True for per-chunk detail.

stop()

Interrupt current speech immediately.

vad_check()

Check microphone for voice activity. Returns whether the user is currently speaking, enabling the agent to wait for a natural pause before responding.

list_voices()

List all available voices grouped by engine, with control surface tags.

set_output_device(device?)

List audio output devices, or switch the active one mid-session.

diagnostics()

Show loaded engines, active jobs, output device, and last generation metrics.

Architecture

Two files:

  • server.py -- MCP tool definitions, multi-model registry, sentence chunking, non-blocking job management, queue-aware returns
  • adaptive_player.py -- Callback-based audio playback with EMA arrival rate tracking, adaptive startup threshold, and structured metrics collection

The adaptive player is model-agnostic. Any TTS engine that produces audio chunks feeds the same pipeline.

Requirements

  • macOS with Apple Silicon (M1/M2/M3/M4)
  • Python 3.10+
  • espeak-ng (brew install espeak-ng) -- required for Kokoro's phonemizer

Using Voice as a Modality

See skills/voice/SKILL.md for the full guide on dual-modal communication -- when to speak vs write, non-blocking patterns, reading metrics, and anti-patterns.

Voice carries the ephemeral (context, intent, tone). Text carries the persistent (code, data, decisions). Both channels active simultaneously.

Ecosystem

Mod³ is the voice layer in the CogOS ecosystem. It integrates as a modality channel -- the kernel routes intents to Mod³ when voice output is appropriate. Works standalone without CogOS.

Repo Purpose
cogos The daemon
mod3 Voice -- this repo
constellation Distributed identity and trust
skills Agent skill library
charts Helm charts for deployment
desktop macOS dashboard app

License

MIT

About

Mod³ — Model Modality Modulator. Give AI agents a voice. Multi-model TTS with queue-aware output, barge-in detection, and turn-taking. Python, Apple Silicon.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors