|
| 1 | +# Mod3 Architecture: The Modality Bus |
| 2 | + |
| 3 | +The modality bus is the sensorimotor boundary between cognitive agents and physical signals. Agents think in cognitive events ("someone spoke", "say this"); the bus translates between those events and raw bytes (audio, text, future: vision, spatial). |
| 4 | + |
| 5 | +``` |
| 6 | + ModalityBus |
| 7 | + ┌──────────────────────────────────────────────┐ |
| 8 | + │ │ |
| 9 | + │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ |
| 10 | + │ │ Voice │ │ Text │ │ Vision* │ ... │ |
| 11 | + │ │ Module │ │ Module │ │ Module │ │ |
| 12 | + │ └────┬─────┘ └────┬────┘ └────┬────┘ │ |
| 13 | + │ │ │ │ │ |
| 14 | + │ ┌────┴─────────────┴────────────┴────┐ │ |
| 15 | + │ │ Event Log + Listeners │ │ |
| 16 | + │ └────┬─────────────┬────────────┬────┘ │ |
| 17 | + │ │ │ │ │ |
| 18 | + │ ┌────┴────┐ ┌─────┴─────┐ ┌──┴───┐ │ |
| 19 | + │ │ Channel │ │ Channel │ │ ... │ │ |
| 20 | + │ │ discord │ │ http-api │ │ │ │ |
| 21 | + │ └─────────┘ └───────────┘ └──────┘ │ |
| 22 | + └──────────────────────────────────────────────┘ |
| 23 | +
|
| 24 | + * Vision/Spatial are defined in ModalityType but not yet implemented. |
| 25 | +``` |
| 26 | + |
| 27 | +## Core Types (modality.py) |
| 28 | + |
| 29 | +### Cognitive Primitives |
| 30 | + |
| 31 | +The agent never touches raw bytes. It sees these: |
| 32 | + |
| 33 | +```python |
| 34 | +@dataclass |
| 35 | +class CognitiveEvent: # Input percept |
| 36 | + modality: ModalityType # VOICE, TEXT, VISION, SPATIAL |
| 37 | + content: str # The meaning (transcribed text, caption, etc.) |
| 38 | + source_channel: str # Which channel it arrived on |
| 39 | + confidence: float # Decoder certainty (0.0 - 1.0) |
| 40 | + timestamp: float |
| 41 | + metadata: dict[str, Any] |
| 42 | + |
| 43 | +@dataclass |
| 44 | +class CognitiveIntent: # Output intent (not yet encoded) |
| 45 | + modality: ModalityType | None # None = let the bus decide |
| 46 | + content: str # What to communicate |
| 47 | + target_channel: str # Specific channel, or "" for bus routing |
| 48 | + priority: int # Higher = more urgent |
| 49 | + metadata: dict[str, Any] # voice, speed, emotion, etc. |
| 50 | + |
| 51 | +@dataclass |
| 52 | +class EncodedOutput: # Raw signal ready for delivery |
| 53 | + modality: ModalityType |
| 54 | + data: bytes # WAV, PNG, JSON, etc. |
| 55 | + format: str # "wav", "png", "text", etc. |
| 56 | + duration_sec: float |
| 57 | + metadata: dict[str, Any] |
| 58 | +``` |
| 59 | + |
| 60 | +### Abstract Base Classes |
| 61 | + |
| 62 | +Every modality module implements three components: |
| 63 | + |
| 64 | +```python |
| 65 | +class Gate(ABC): |
| 66 | + def check(self, raw: bytes, **kwargs) -> GateResult: ... |
| 67 | + |
| 68 | +class Decoder(ABC): |
| 69 | + def decode(self, raw: bytes, **kwargs) -> CognitiveEvent: ... |
| 70 | + |
| 71 | +class Encoder(ABC): |
| 72 | + def encode(self, intent: CognitiveIntent) -> EncodedOutput: ... |
| 73 | + |
| 74 | +class ModalityModule(ABC): |
| 75 | + modality_type -> ModalityType # Which modality this handles |
| 76 | + gate -> Gate | None # Input filter (None = pass all) |
| 77 | + decoder -> Decoder | None # raw -> CognitiveEvent |
| 78 | + encoder -> Encoder | None # CognitiveIntent -> EncodedOutput |
| 79 | + state -> ModuleState # Live HUD state |
| 80 | + health() -> dict # Diagnostics |
| 81 | +``` |
| 82 | + |
| 83 | +`Gate` is optional. Text has no gate (all text passes). Voice uses VAD to reject silence. |
| 84 | + |
| 85 | +## The Bus (bus.py) |
| 86 | + |
| 87 | +`ModalityBus` manages module registration, signal routing, and state tracking. |
| 88 | + |
| 89 | +### perceive() -- Input Path |
| 90 | + |
| 91 | +``` |
| 92 | +raw bytes ──→ Gate.check() ──→ Decoder.decode() ──→ CognitiveEvent |
| 93 | + │ │ |
| 94 | + (rejected?) (empty content?) |
| 95 | + ↓ ↓ |
| 96 | + None None (filtered) |
| 97 | +``` |
| 98 | + |
| 99 | +```python |
| 100 | +bus.perceive(raw: bytes, modality: str | ModalityType, channel: str = "", **kwargs) |
| 101 | + -> CognitiveEvent | None |
| 102 | +``` |
| 103 | + |
| 104 | +1. Resolve the modality module from the registry |
| 105 | +2. If the module has a gate, run `gate.check(raw)`. Emit a `modality.gate` bus event. Return `None` if rejected. |
| 106 | +3. Run `decoder.decode(raw)`. If content is empty (e.g., hallucination filtered), emit `modality.filtered` and return `None`. |
| 107 | +4. Stamp `source_channel`, emit `modality.input`, return the event. |
| 108 | + |
| 109 | +### act() -- Output Path |
| 110 | + |
| 111 | +``` |
| 112 | +CognitiveIntent ──→ resolve modality ──→ Encoder.encode() ──→ EncodedOutput |
| 113 | + │ |
| 114 | + channel.deliver() |
| 115 | +``` |
| 116 | + |
| 117 | +```python |
| 118 | +bus.act(intent: CognitiveIntent, channel: str = "", blocking: bool = False) |
| 119 | + -> QueuedJob | EncodedOutput |
| 120 | +``` |
| 121 | + |
| 122 | +1. Resolve output modality: explicit on intent, or inferred from channel capabilities (prefers voice over text), or defaults to text. |
| 123 | +2. Encode via the module's encoder. Emits `modality.encode_start` and `modality.output` bus events. |
| 124 | +3. If the target channel has a `deliver` callback, call it with the encoded output. |
| 125 | +4. If `blocking=True`, returns `EncodedOutput` directly. Otherwise queues via `OutputQueueManager` and returns a `QueuedJob`. |
| 126 | + |
| 127 | +### hud() -- Agent Awareness |
| 128 | + |
| 129 | +```python |
| 130 | +bus.hud() -> dict |
| 131 | +``` |
| 132 | + |
| 133 | +Returns a live snapshot of all modules and channels: current status, active jobs, queue depths, recent events. Designed to be injected into the agent's context window so it knows what the body is doing. |
| 134 | + |
| 135 | +### Channels |
| 136 | + |
| 137 | +Channels declare which modalities they support. The bus auto-routes output based on channel capabilities. |
| 138 | + |
| 139 | +```python |
| 140 | +bus.register_channel("discord-voice", [ModalityType.VOICE, ModalityType.TEXT], |
| 141 | + deliver=send_to_discord) |
| 142 | +``` |
| 143 | + |
| 144 | +### Bus Events |
| 145 | + |
| 146 | +Every boundary crossing is recorded as a `BusEvent` (type, modality, channel, timestamp, data). Listeners can subscribe via `bus.on_event(callback)` for ledger integration. The bus keeps the last 500 events in memory. |
| 147 | + |
| 148 | +## Current Modalities |
| 149 | + |
| 150 | +### Voice (modules/voice.py) |
| 151 | + |
| 152 | +| Component | Class | Implementation | |
| 153 | +|-----------|-------|----------------| |
| 154 | +| Gate | `VoiceGate` | Silero VAD via `vad.detect_speech()`. Threshold-configurable (default 0.5). Rejects audio with no detected speech. | |
| 155 | +| Decoder | `WhisperDecoder` | `mlx_whisper` STT on Apple Silicon. Lazy-loads `mlx-community/whisper-turbo`. Applies `vad.is_hallucination()` filter to reject phantom transcripts. | |
| 156 | +| Decoder (legacy) | `PlaceholderDecoder` | Accepts pre-transcribed text. Used by the MCP server for the `speak` tool path where text is already known. | |
| 157 | +| Encoder | `VoiceEncoder` | Wraps `engine.synthesize()` (Kokoro, Voxtral, Chatterbox, Spark). Default voice: `bm_lewis` at 1.25x speed. Returns WAV bytes. | |
| 158 | + |
| 159 | +### Text (modules/text.py) |
| 160 | + |
| 161 | +| Component | Class | Implementation | |
| 162 | +|-----------|-------|----------------| |
| 163 | +| Gate | None | All text passes through. | |
| 164 | +| Decoder | `TextDecoder` | Identity transform: `bytes.decode("utf-8")` -> `CognitiveEvent`. | |
| 165 | +| Encoder | `TextEncoder` | Identity transform: `intent.content.encode("utf-8")` -> `EncodedOutput`. | |
| 166 | + |
| 167 | +Text exists so it is a first-class modality on the bus, not a special case. |
| 168 | + |
| 169 | +## Integration Points |
| 170 | + |
| 171 | +### MCP Server (server.py) |
| 172 | + |
| 173 | +The MCP server creates the bus singleton at module level: |
| 174 | + |
| 175 | +```python |
| 176 | +_bus = _create_bus() # ModalityBus with VoiceModule(decoder=PlaceholderDecoder()) |
| 177 | +``` |
| 178 | + |
| 179 | +MCP tools (`speak`, `diagnostics`, `vad_check`) use `_bus` for voice state tracking, health reports, and VAD. The `speak` tool resolves voices through the bus's voice module, sets encoder state, and uses the engine directly for synthesis (the adaptive player handles local playback). |
| 180 | + |
| 181 | +The `diagnostics` tool returns `_bus.health()` and `_bus.hud()`. |
| 182 | + |
| 183 | +### HTTP API (http_api.py) |
| 184 | + |
| 185 | +The HTTP API imports the bus singleton from the MCP server: |
| 186 | + |
| 187 | +```python |
| 188 | +from server import _bus as _shared_bus # Shared instance when co-hosted |
| 189 | +_bus = _shared_bus # Falls back to fresh ModalityBus if import fails |
| 190 | +``` |
| 191 | + |
| 192 | +It ensures both Text and Voice modules are registered, then exposes the bus directly: |
| 193 | + |
| 194 | +| Endpoint | Bus Method | |
| 195 | +|----------|------------| |
| 196 | +| `GET /v1/bus/hud` | `_bus.hud()` | |
| 197 | +| `GET /v1/bus/health` | `_bus.health()` | |
| 198 | +| `POST /v1/bus/perceive` | `_bus.perceive(raw, modality, channel)` | |
| 199 | +| `POST /v1/bus/act` | `_bus.act(intent, channel, blocking=True)` | |
| 200 | +| `GET /health` | includes `_bus.health()` and `_bus.hud()` | |
| 201 | + |
| 202 | +When running with `--all`, both MCP and HTTP share the same bus instance and model cache. |
| 203 | + |
| 204 | +## Adding a New Modality |
| 205 | + |
| 206 | +1. **Create `modules/your_modality.py`** -- implement `Gate`, `Decoder`, `Encoder` (all optional), and a `ModalityModule` subclass that wires them together. See `modules/text.py` for the minimal case or `modules/voice.py` for the full pattern. |
| 207 | + |
| 208 | +2. **Add the modality type** to `ModalityType` in `modality.py` if needed. `VISION` and `SPATIAL` are already defined. |
| 209 | + |
| 210 | +3. **Register with the bus** where it is created (`server.py` and/or `http_api.py`): |
| 211 | + ```python |
| 212 | + bus.register(VisionModule()) |
| 213 | + bus.register_channel("webcam-feed", [ModalityType.VISION]) |
| 214 | + ``` |
| 215 | + |
| 216 | +4. **No routing changes needed.** The bus auto-routes `act()` based on channel capabilities. The HTTP API's `/v1/bus/perceive` and `/v1/bus/act` already accept any registered modality via the `modality` parameter. |
0 commit comments