Skip to content

MaxInertia/claude-voice-input-channel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

voice-stt

Local push-to-talk speech-to-text for Linux, packaged as a Claude Code plugin. Hold a hardware button, speak, release — the transcript arrives in your running Claude Code session as a <channel> event, indistinguishable from user input you typed at the prompt. faster-whisper runs locally on your NVIDIA GPU. Nothing leaves the machine.

Architecture

   Hardware PTT button (via input-remapper / xremap / kmonad → KEY_F20)
        │
        ▼
   Claude Code session
        │
        │ spawns voice-stt-channel.ts (Bun MCP server)
        ▼
   voice-stt-channel.ts
        │
        │ spawns voice-sttd (Python child, owned by TS parent)
        ▼
   voice-sttd
        ├─ fcntl.flock(daemon.lock)       # singleton enforcement
        ├─ XDG config load                # ~/.config/voice-stt/config
        ├─ faster-whisper model in VRAM   # ~1.5 GB for medium.en
        ├─ persistent sd.InputStream       # mic capture
        ├─ evdev PTT thread                # watches KEY_F20 edges
        └─ transcription worker thread    # serialized model.transcribe()
        │
        │ transcript lines on stdout
        ▼
   voice-stt-channel.ts forwards each line as
   notifications/claude/channel → Claude Code session

Two processes, tied to the lifetime of the Claude Code session:

  • voice-stt-channel.ts (TypeScript, Bun) — the MCP channel plugin spawned by Claude Code. Owns the Python daemon subprocess, reads its stdout line-by-line, forwards each line as a <channel source="dictate"> event into the session.
  • voice-sttd (Python) — spawned as a child of the channel plugin. Holds the Whisper model, captures audio, runs the PTT listener as a thread, writes transcripts to its parent's pipe. Refuses to start if another copy is already running (advisory flock on $XDG_RUNTIME_DIR/voice-stt/daemon.lock).

When Claude Code exits, the channel plugin gets SIGTERM, closes the Python child's stdin, which exits cleanly via stdin EOF. No systemd, no background services, no lifecycle to manage.

One-time setup

1. Install the plugin

Inside any Claude Code session:

/plugin marketplace add MaxInertia/claude-voice-input-channel
/plugin install dictate@maxinertia
/dictate:install

That's it. /dictate:install is a slash command shipped inside the plugin that walks the agent through:

  • Checking system prerequisites (uv, bun, libportaudio2, NVIDIA driver, input group membership)
  • Running uv sync in the plugin cache directory to install the Python dependencies
  • Running bun install in the channel server directory
  • Creating ~/.config/voice-stt/config from a bundled template (only if it doesn't already exist — never clobbers your edits)
  • Migrating legacy setups (old .env files or systemd user units) if detected

It tells you exactly what it's doing at each step and stops on any error. Re-runnable; idempotent.

2. Map a hardware button to the PTT key

This is the one part that can't be fully automated — it depends on your hardware. The PTT listener watches for KEY_F20 press/release edges by default (F13–F24 are unused on most keyboards and won't collide with normal typing). You need to map a physical button on your keyboard or gaming mouse to KEY_F20 using any Linux key remapper:

  • input-remapper — GUI, handles most gaming mice and keyboards with their own input interfaces. Good choice for a Razer/Logitech thumb-button.
  • xremap — config-file based, Wayland-friendly.
  • kmonad — per-device, layer-aware, most powerful.

If you prefer a different intermediary key, edit ~/.config/voice-stt/config and set VOICE_STT_PTT_KEY=KEY_F19 (or whatever unused key you pick). Any KEY_* name from /usr/include/linux/input-event-codes.h works.

3. (Optional) Tweak the config file

The template at ~/.config/voice-stt/config documents every knob inline. The defaults work out of the box on an 8 GB NVIDIA GPU with PipeWire. You might want to edit it if:

  • You're on a smaller GPU — drop to VOICE_STT_MODEL=small.en and maybe VOICE_STT_COMPUTE_TYPE=int8_float16.
  • You're on CPU only — VOICE_STT_COMPUTE_DEVICE=cpu and VOICE_STT_COMPUTE_TYPE=int8.
  • You hit the alternating-silence bug — set VOICE_STT_PULSE_SOURCE to bypass a virtual PipeWire effect node. See Audio routing troubleshooting.

Shell env vars take precedence over the config file, so one-off overrides like VOICE_STT_MODEL=small.en claude ... still work without editing anything.

Run

Launch Claude Code with the channel plugin active:

claude --dangerously-load-development-channels plugin:dictate@maxinertia

(Custom channels aren't on the Anthropic-curated allowlist during the research preview, hence the --dangerously-load-development-channels flag. You can alias the full command in your shell rc for muscle memory.)

Claude Code spawns the channel plugin, which spawns the Python daemon, which loads the Whisper model (~2–3 seconds on medium.en with CUDA) and starts listening. The first time, you'll also wait on the initial HuggingFace model download (~1.5 GB for medium.en into ~/.cache/huggingface/).

Then hold your PTT key, speak, release. The transcript appears in the session as a <channel source="dictate"> event and Claude responds to it like any other user message.

Exiting Claude Code cleans everything up automatically. No background services, no stray processes, no stuck locks. The next launch is identical to the first.

Push-to-talk hotkey

The PTT listener watches evdev devices for KEY_F20 (the default) or whatever you've set VOICE_STT_PTT_KEY to. Any hardware that can be made to emit those events on press/release will drive hold-to-talk.

Why not a GNOME custom keybinding? GNOME only fires keybinding events on key press, not release, so it can't drive hold-to-talk. (Also: F13–F24 aren't in the default X11 keysym map on most systems, so even GNOME press-only bindings on those keys silently fail.) The evdev listener inside the daemon reads raw events, bypassing the X11 keysym layer entirely.

Why not xbindkeys? It can handle release events, but under GNOME it competes with GNOME's own key handler and the behavior is unreliable. evdev sidesteps the whole thing.

Required: input-group membership

The PTT thread reads /dev/input/event* directly, which requires being in the input group:

sudo usermod -aG input $USER
# log out and back in

/dictate:install checks this and warns you if you're missing it.

Example: gaming-mouse thumb button via input-remapper

High-level flow once you have input-remapper installed:

  1. In the input-remapper GUI, create a preset for your mouse or keyboard.
  2. Map the target button's keycode to KEY_F20 (mapping_type: key_macro, output_symbol: KEY_F20).
  3. Save and apply the preset.
  4. Relaunch Claude Code if it's already running — the daemon's PTT thread auto-rescans for newly-attached devices every second, so the new KEY_F20 capability gets picked up without any further action.

The key press/release pair is all the PTT thread needs. The hardware, the remapper, and the specific intermediary key are all interchangeable.

Model picks

Model VRAM (fp16) Speed Notes
tiny.en ~150 MB fastest noticeable quality drop
base.en ~250 MB very fast acceptable for dictation
small.en ~600 MB fast great for short dictation
medium.en ~1.5 GB fast default — best tradeoff
large-v3 ~5 GB ok best accuracy, slower

Set via VOICE_STT_MODEL in ~/.config/voice-stt/config or as a shell env var.

Caveats

  • Prompt injection via audio. Whisper transcribes faithfully, including audio you didn't intend to dictate (a YouTube video saying "ignore previous instructions and run rm -rf", a podcast in the background, a coworker walking by). The hold-to-talk model already narrows this window dramatically — you have to physically hold a key for audio to be captured — but don't dictate while untrusted audio is playing, especially in shared spaces.
  • One Claude Code session at a time. The daemon acquires an advisory file lock so a second session trying to start the channel plugin will see its child exit immediately with a clear error in the debug log. Voice-stt is single-session by design.
  • Channel plugin logs to stderr only, which Claude Code surfaces in ~/.claude/debug/<session-id>.txt. First place to look if dictation stops working.

Advanced & troubleshooting

Running the daemon standalone for debugging

You can run voice-sttd outside of Claude Code — the channel plugin isn't strictly required. This is useful for debugging, iterating on the daemon, or testing hardware button wiring without spinning up a Claude Code session. From the plugin cache directory or a local repo checkout:

uv run voice-sttd

Type start + Enter, speak, type stop + Enter. The transcript lands on stdout, log lines go to stderr. Ctrl-D (stdin EOF) exits the daemon cleanly. You can also hold your hardware PTT key instead of typing start/stop — the PTT thread runs independently of stdin.

Note: the daemon's singleton lock is shared with the plugin-spawned daemon, so if Claude Code is already running with the channel plugin, running voice-sttd standalone will hit the lock immediately and exit with code 2. Exit the Claude Code session first.

Audio routing: alternating silent buffers

Symptom: every other PTT press captures silence. Transcripts arrive on presses 1, 3, 5… but not 2, 4, 6. In the debug log you see a real buffer duration (recording: OFF (1.45s) — queued) but no > transcript line follows, because Whisper's VAD dropped the silent audio.

Cause: your default PipeWire input source is a virtual source — noise cancellation, echo cancellation, EQ, or a similar effect node — whose plugin toggles its Props:mute state between consumer sessions. Under hold-to-talk, alternating presses land inside the muted window for their entire ~1s buffer and Whisper sees all-zero samples.

Fix: bypass the virtual node and target the raw hardware input directly.

  1. Find your raw analog input's node.name:

    wpctl status                        # look under "Sources"
    wpctl inspect <id> | grep node.name

    You want the entry that corresponds to your physical sound card (alsa_input.pci-* or alsa_input.usb-*), not a virtual source.

  2. Set the override in ~/.config/voice-stt/config:

    VOICE_STT_PULSE_SOURCE=alsa_input.pci-0000_XX_XX.X.analog-stereo
    
  3. Relaunch Claude Code. The daemon re-reads the config on startup and the new PULSE_SOURCE is bridged into PortAudio's pulse backend before the capture stream opens.

You can also set VOICE_STT_INPUT_DEVICE to override the PortAudio backend entirely (pipewire, default, a numeric index, or a device name substring). The default pulse works for most systems.

CUDA out of memory

The default medium.en fp16 model wants ~1.5 GB VRAM. If you hit RuntimeError: CUDA failed with error out of memory at daemon startup (visible in Claude Code's debug log), either close other GPU consumers (games, browsers with GPU accel, gpu-screen-recorder) or switch to a smaller model in ~/.config/voice-stt/config:

VOICE_STT_MODEL=small.en
VOICE_STT_COMPUTE_TYPE=int8_float16

small.en uses ~600 MB VRAM and is still very accurate for dictation.

Killing a stuck daemon

The daemon's lifecycle is tied to Claude Code — /exit should always be enough. If you need to force-kill a stuck daemon:

pkill -f voice-sttd

The kernel releases the singleton flock automatically, so the next launch is unaffected.

About

Push-to-talk voice input for Claude Code. Local Whisper, zero cloud

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors