Skip to content

Commit 99a700a

Browse files
docs: rewrite README for practical, no-buzzwords tone
Lead with what it does (give your agent a voice), document new queue-aware output, barge-in detection, and turn-taking features. Remove cognitive metaphors and taglines. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 7c43b97 commit 99a700a

1 file changed

Lines changed: 32 additions & 15 deletions

File tree

README.md

Lines changed: 32 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,18 @@
11
# Mod³ — Model Modality Modulator
22

3-
> Part of the [CogOS ecosystem](https://github.com/cogos-dev)**how it ACTS**
3+
Give your AI agent a voice.
44

5-
Mod³ translates between thinking and acting. It's the modality bus for CogOS — the layer where cognitive intents become physical signals and physical signals become cognitive events. Voice is the first modality. The architecture supports any modality.
6-
7-
Currently: an MCP server that runs local TTS models on Apple Silicon, with adaptive buffering, multi-model routing, non-blocking speech, voice activity detection, and structured metrics. Built for [Claude Code](https://claude.ai/claude-code), works with any MCP-compatible client.
5+
Mod³ is a Python MCP server that provides text-to-speech for Claude Code, Cursor, and other MCP-compatible AI tools. It runs four TTS engines locally on Apple Silicon, generates speech faster than realtime, and returns immediately so the agent keeps working while audio plays.
86

97
## What it does
108

11-
- **Non-blocking speech**`speak()` returns immediately. Audio plays in the background while the agent keeps working. Two output channels: voice for the ephemeral, text for the persistent.
12-
- **Multi-model routing** — Four TTS engines, one interface. Voice name auto-routes to the right model.
13-
- **Adaptive buffering** — EMA-based arrival rate tracking with dynamic startup threshold. Gapless playback under normal conditions, graceful degradation under GPU contention.
14-
- **Structured metrics** — Every call returns TTFA, RTF, per-chunk timing, buffer health, underrun counts, memory usage. The agent can diagnose its own audio quality.
15-
- **Sentence chunking** — Text is split at sentence boundaries for natural prosody. Feathered edges (fade-out + breath gap) between sentences.
9+
- **Non-blocking speech** -- `speak()` returns immediately with a job ID. Audio plays in the background. The agent writes code while it talks.
10+
- **Queue-aware output** -- Every `speak()` return includes queue position, estimated wait time, and active job state. The agent knows what's playing without making a separate status call.
11+
- **Barge-in detection** -- VAD (voice activity detection) monitors the microphone. If the user starts talking, playback stops and the agent is notified. No talking over people.
12+
- **Turn-taking** -- Bidirectional awareness of who's speaking. The agent can check user state before deciding to speak or wait.
13+
- **Multi-model routing** -- Four TTS engines behind one interface. Voice name determines which engine handles the request.
14+
- **Adaptive buffering** -- EMA-based arrival rate tracking with dynamic startup threshold. Gapless playback under normal load, graceful degradation under GPU contention.
15+
- **Structured metrics** -- Every call returns TTFA, RTF, per-chunk timing, buffer health, underrun counts, and memory usage. The agent can diagnose its own audio quality.
1616

1717
## Engines
1818

@@ -50,7 +50,7 @@ Then add to your project's `.mcp.json`:
5050

5151
### `speak(text, voice?, stream?, speed?, emotion?)`
5252

53-
Synthesize text and play through speakers. Returns immediately with a job ID.
53+
Synthesize text and play through speakers. Returns immediately with a job ID, queue state, and estimated wait time.
5454

5555
```
5656
speak("Hello world") → default voice (bm_lewis @ 1.25x)
@@ -67,6 +67,10 @@ Check if speech is still playing, or get metrics from the last completed job. Pa
6767

6868
Interrupt current speech immediately.
6969

70+
### `vad_check()`
71+
72+
Check microphone for voice activity. Returns whether the user is currently speaking, enabling the agent to wait for a natural pause before responding.
73+
7074
### `list_voices()`
7175

7276
List all available voices grouped by engine, with control surface tags.
@@ -83,22 +87,35 @@ Show loaded engines, active jobs, output device, and last generation metrics.
8387

8488
Two files:
8589

86-
- **`server.py`** MCP tool definitions, multi-model registry, sentence chunking, non-blocking job management
87-
- **`adaptive_player.py`** Callback-based audio playback with EMA arrival rate tracking, adaptive startup threshold, and structured metrics collection
90+
- **`server.py`** -- MCP tool definitions, multi-model registry, sentence chunking, non-blocking job management, queue-aware returns
91+
- **`adaptive_player.py`** -- Callback-based audio playback with EMA arrival rate tracking, adaptive startup threshold, and structured metrics collection
8892

8993
The adaptive player is model-agnostic. Any TTS engine that produces audio chunks feeds the same pipeline.
9094

9195
## Requirements
9296

9397
- macOS with Apple Silicon (M1/M2/M3/M4)
9498
- Python 3.10+
95-
- espeak-ng (`brew install espeak-ng`) required for Kokoro's phonemizer
99+
- espeak-ng (`brew install espeak-ng`) -- required for Kokoro's phonemizer
96100

97101
## Using Voice as a Modality
98102

99-
See [`skills/voice/SKILL.md`](skills/voice/SKILL.md) for the full guide on dual-modal communication — when to speak vs write, non-blocking patterns, reading metrics, and anti-patterns.
103+
See [`skills/voice/SKILL.md`](skills/voice/SKILL.md) for the full guide on dual-modal communication -- when to speak vs write, non-blocking patterns, reading metrics, and anti-patterns.
104+
105+
Voice carries the ephemeral (context, intent, tone). Text carries the persistent (code, data, decisions). Both channels active simultaneously.
106+
107+
## Ecosystem
108+
109+
Mod³ is the voice layer in the [CogOS](https://github.com/cogos-dev/cogos) ecosystem. It integrates as a modality channel -- the kernel routes intents to Mod³ when voice output is appropriate. Works standalone without CogOS.
100110

101-
The short version: voice carries the ephemeral (context, intent, tone). Text carries the persistent (code, data, decisions). Both channels active simultaneously. That's the point.
111+
| Repo | Purpose |
112+
|------|---------|
113+
| [cogos](https://github.com/cogos-dev/cogos) | The daemon |
114+
| **mod3** | **Voice -- this repo** |
115+
| [constellation](https://github.com/cogos-dev/constellation) | Distributed identity and trust |
116+
| [skills](https://github.com/cogos-dev/skills) | Agent skill library |
117+
| [charts](https://github.com/cogos-dev/charts) | Helm charts for deployment |
118+
| [desktop](https://github.com/cogos-dev/desktop) | macOS dashboard app |
102119

103120
## License
104121

0 commit comments

Comments
 (0)