Mod³ — Model Modality Modulator

Give your AI agent a voice.

Mod³ is a Python MCP server that provides text-to-speech for Claude Code, Cursor, and other MCP-compatible AI tools. It runs four TTS engines locally on Apple Silicon, generates speech faster than realtime, and returns immediately so the agent keeps working while audio plays.

What it does

Non-blocking speech -- speak() returns immediately with a job ID. Audio plays in the background. The agent writes code while it talks.
Queue-aware output -- Every speak() return includes queue position, estimated wait time, and active job state. The agent knows what's playing without making a separate status call.
Barge-in detection -- VAD (voice activity detection) monitors the microphone. If the user starts talking, playback stops and the agent is notified. No talking over people.
Turn-taking -- Bidirectional awareness of who's speaking. The agent can check user state before deciding to speak or wait.
Multi-model routing -- Four TTS engines behind one interface. Voice name determines which engine handles the request.
Adaptive buffering -- EMA-based arrival rate tracking with dynamic startup threshold. Gapless playback under normal load, graceful degradation under GPU contention.
Structured metrics -- Every call returns TTFA, RTF, per-chunk timing, buffer health, underrun counts, and memory usage. The agent can diagnose its own audio quality.

Engines

Engine	Model	Size	TTFA	Control Surfaces
Kokoro	Kokoro-82M-bf16	82M	~60ms	Speed, emphasis (ALL CAPS), pacing (punctuation)
Voxtral	Voxtral-4B-TTS-mlx-4bit	4B	~500ms	20 voice presets, multi-language
Chatterbox	chatterbox-4bit	~1B	~60ms	Emotion/exaggeration (0-1), voice cloning
Spark	Spark-TTS-0.5B-bf16	0.5B	~1s	Pitch (5-level), speed, gender

Models are downloaded on first use via HuggingFace Hub.

Quick Start

git clone https://github.com/cogos-dev/mod3.git
cd mod3
./setup.sh

Then add to your project's .mcp.json:

{
  "mcpServers": {
    "mod3": {
      "command": "/path/to/mod3/.venv/bin/python",
      "args": ["/path/to/mod3/server.py"]
    }
  }
}

MCP Tools

`speak(text, voice?, stream?, speed?, emotion?)`

Synthesize text and play through speakers. Returns immediately with a job ID, queue state, and estimated wait time.

speak("Hello world")                                        → default voice (bm_lewis @ 1.25x)
speak("Hello world", voice="casual_male")                   → Voxtral
speak("Hello world", voice="chatterbox", emotion=0.8)       → Chatterbox with high emotion
speak("Hello world", voice="am_michael", speed=1.4)         → Kokoro fast

`speech_status(job_id?, verbose?)`

Check if speech is still playing, or get metrics from the last completed job. Pass verbose=True for per-chunk detail.

`stop()`

Interrupt current speech immediately.

`vad_check()`

Check microphone for voice activity. Returns whether the user is currently speaking, enabling the agent to wait for a natural pause before responding.

`list_voices()`

List all available voices grouped by engine, with control surface tags.

`set_output_device(device?)`

List audio output devices, or switch the active one mid-session.

`diagnostics()`

Show loaded engines, active jobs, output device, and last generation metrics.

Architecture

Two files:

server.py -- MCP tool definitions, multi-model registry, sentence chunking, non-blocking job management, queue-aware returns
adaptive_player.py -- Callback-based audio playback with EMA arrival rate tracking, adaptive startup threshold, and structured metrics collection

The adaptive player is model-agnostic. Any TTS engine that produces audio chunks feeds the same pipeline.

Requirements

macOS with Apple Silicon (M1/M2/M3/M4)
Python 3.10+
espeak-ng (brew install espeak-ng) -- required for Kokoro's phonemizer

Using Voice as a Modality

See skills/voice/SKILL.md for the full guide on dual-modal communication -- when to speak vs write, non-blocking patterns, reading metrics, and anti-patterns.

Voice carries the ephemeral (context, intent, tone). Text carries the persistent (code, data, decisions). Both channels active simultaneously.

Ecosystem

Mod³ is the voice layer in the CogOS ecosystem. It integrates as a modality channel -- the kernel routes intents to Mod³ when voice output is appropriate. Works standalone without CogOS.

Repo	Purpose
cogos	The daemon
mod3	Voice -- this repo
constellation	Distributed identity and trust
skills	Agent skill library
charts	Helm charts for deployment
desktop	macOS dashboard app

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
integrations/openclaw		integrations/openclaw
modules		modules
skills/voice		skills/voice
tests		tests
.gitignore		.gitignore
.mcp.json		.mcp.json
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
adaptive_player.py		adaptive_player.py
bus.py		bus.py
capture.py		capture.py
engine.py		engine.py
http_api.py		http_api.py
inbound.py		inbound.py
mcp.channel.json		mcp.channel.json
modality.py		modality.py
output_queue.py		output_queue.py
pipeline_state.py		pipeline_state.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
server.py		server.py
setup.sh		setup.sh
vad.py		vad.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mod³ — Model Modality Modulator

What it does

Engines

Quick Start

MCP Tools

`speak(text, voice?, stream?, speed?, emotion?)`

`speech_status(job_id?, verbose?)`

`stop()`

`vad_check()`

`list_voices()`

`set_output_device(device?)`

`diagnostics()`

Architecture

Requirements

Using Voice as a Modality

Ecosystem

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mod³ — Model Modality Modulator

What it does

Engines

Quick Start

MCP Tools

speak(text, voice?, stream?, speed?, emotion?)

speech_status(job_id?, verbose?)

stop()

vad_check()

list_voices()

set_output_device(device?)

diagnostics()

Architecture

Requirements

Using Voice as a Modality

Ecosystem

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`speak(text, voice?, stream?, speed?, emotion?)`

`speech_status(job_id?, verbose?)`

`stop()`

`vad_check()`

`list_voices()`

`set_output_device(device?)`

`diagnostics()`

Packages