You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: rewrite README for practical, no-buzzwords tone
Lead with what it does (give your agent a voice), document new
queue-aware output, barge-in detection, and turn-taking features.
Remove cognitive metaphors and taglines.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: README.md
+32-15Lines changed: 32 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,18 +1,18 @@
1
1
# Mod³ — Model Modality Modulator
2
2
3
-
> Part of the [CogOS ecosystem](https://github.com/cogos-dev) — **how it ACTS**
3
+
Give your AI agent a voice.
4
4
5
-
Mod³ translates between thinking and acting. It's the modality bus for CogOS — the layer where cognitive intents become physical signals and physical signals become cognitive events. Voice is the first modality. The architecture supports any modality.
6
-
7
-
Currently: an MCP server that runs local TTS models on Apple Silicon, with adaptive buffering, multi-model routing, non-blocking speech, voice activity detection, and structured metrics. Built for [Claude Code](https://claude.ai/claude-code), works with any MCP-compatible client.
5
+
Mod³ is a Python MCP server that provides text-to-speech for Claude Code, Cursor, and other MCP-compatible AI tools. It runs four TTS engines locally on Apple Silicon, generates speech faster than realtime, and returns immediately so the agent keeps working while audio plays.
8
6
9
7
## What it does
10
8
11
-
-**Non-blocking speech** — `speak()` returns immediately. Audio plays in the background while the agent keeps working. Two output channels: voice for the ephemeral, text for the persistent.
12
-
-**Multi-model routing** — Four TTS engines, one interface. Voice name auto-routes to the right model.
13
-
-**Adaptive buffering** — EMA-based arrival rate tracking with dynamic startup threshold. Gapless playback under normal conditions, graceful degradation under GPU contention.
14
-
-**Structured metrics** — Every call returns TTFA, RTF, per-chunk timing, buffer health, underrun counts, memory usage. The agent can diagnose its own audio quality.
15
-
-**Sentence chunking** — Text is split at sentence boundaries for natural prosody. Feathered edges (fade-out + breath gap) between sentences.
9
+
-**Non-blocking speech** -- `speak()` returns immediately with a job ID. Audio plays in the background. The agent writes code while it talks.
10
+
-**Queue-aware output** -- Every `speak()` return includes queue position, estimated wait time, and active job state. The agent knows what's playing without making a separate status call.
11
+
-**Barge-in detection** -- VAD (voice activity detection) monitors the microphone. If the user starts talking, playback stops and the agent is notified. No talking over people.
12
+
-**Turn-taking** -- Bidirectional awareness of who's speaking. The agent can check user state before deciding to speak or wait.
13
+
-**Multi-model routing** -- Four TTS engines behind one interface. Voice name determines which engine handles the request.
14
+
-**Adaptive buffering** -- EMA-based arrival rate tracking with dynamic startup threshold. Gapless playback under normal load, graceful degradation under GPU contention.
15
+
-**Structured metrics** -- Every call returns TTFA, RTF, per-chunk timing, buffer health, underrun counts, and memory usage. The agent can diagnose its own audio quality.
16
16
17
17
## Engines
18
18
@@ -50,7 +50,7 @@ Then add to your project's `.mcp.json`:
-**`adaptive_player.py`**-- Callback-based audio playback with EMA arrival rate tracking, adaptive startup threshold, and structured metrics collection
88
92
89
93
The adaptive player is model-agnostic. Any TTS engine that produces audio chunks feeds the same pipeline.
90
94
91
95
## Requirements
92
96
93
97
- macOS with Apple Silicon (M1/M2/M3/M4)
94
98
- Python 3.10+
95
-
- espeak-ng (`brew install espeak-ng`) — required for Kokoro's phonemizer
99
+
- espeak-ng (`brew install espeak-ng`) -- required for Kokoro's phonemizer
96
100
97
101
## Using Voice as a Modality
98
102
99
-
See [`skills/voice/SKILL.md`](skills/voice/SKILL.md) for the full guide on dual-modal communication — when to speak vs write, non-blocking patterns, reading metrics, and anti-patterns.
103
+
See [`skills/voice/SKILL.md`](skills/voice/SKILL.md) for the full guide on dual-modal communication -- when to speak vs write, non-blocking patterns, reading metrics, and anti-patterns.
104
+
105
+
Voice carries the ephemeral (context, intent, tone). Text carries the persistent (code, data, decisions). Both channels active simultaneously.
106
+
107
+
## Ecosystem
108
+
109
+
Mod³ is the voice layer in the [CogOS](https://github.com/cogos-dev/cogos) ecosystem. It integrates as a modality channel -- the kernel routes intents to Mod³ when voice output is appropriate. Works standalone without CogOS.
100
110
101
-
The short version: voice carries the ephemeral (context, intent, tone). Text carries the persistent (code, data, decisions). Both channels active simultaneously. That's the point.
111
+
| Repo | Purpose |
112
+
|------|---------|
113
+
|[cogos](https://github.com/cogos-dev/cogos)| The daemon |
114
+
|**mod3**|**Voice -- this repo**|
115
+
|[constellation](https://github.com/cogos-dev/constellation)| Distributed identity and trust |
0 commit comments