Complete modern dictation model evaluation by future3OOO · Pull Request #4 · future3OOO/Whisper-Smart

future3OOO · 2026-05-02T06:11:03Z

Summary

Adds the production benchmark path for modern faster-whisper engines, including real-audio WAV benchmarking, WER scoring, model matrix runs, JSON output, and model-VAD comparison.
Moves transcription and post-processing behind focused modules, keeps CUDA model load/inference/close on one model thread, and updates runtime dependencies to faster-whisper 1.2.1 / ctranslate2 4.7.1.
Switches the English default to large-v3-turbo based on recorded RTX 3080 benchmark evidence plus live dictation feedback showing better accuracy than distil-large-v3 while remaining fast.

Test plan

pytest tests -q -m "not gpu" -> 118 passed, 3 deselected
pytest tests/test_gpu_smoke.py -q -m gpu -> 3 passed
black --check dictation_tool tests
ruff check dictation_tool tests pyproject.toml
mypy dictation_tool
pip check

Notes

CPU mode is still available with --device cpu, but the benchmarked production recommendation is GPU/CUDA. CPU use of large-v3-turbo is expected to be much slower than GPU for this workflow.

Made with Cursor

Summary by CodeRabbit

New Features
- CLI benchmark matrix, real-audio benchmarking, per-model runs, model-VAD comparison, JSONL profiling, WER scoring, and deterministic postprocessing helpers.
Changed
- Default English model -> large-v3-turbo; updated runtime dependency constraints for transcription backends and setuptools pin for compatibility.
Bug Fixes
- Fixed No-VAD buffering, benchmark/CLI chunk wiring, clipboard behavior, bullet/spacing rendering, and Windows model teardown.
Documentation
- Expanded README, CHANGELOG, and performance plan with benchmarks and usage guidance.
Tests
- Added/updated unit and golden tests covering benchmarking, postprocessing, transcription, I/O, engine, and GPU smoke.

Prevent mouse-hold dictation from transcribing overlapping VAD batches before release, and harden Windows microphone fallback so default devices are portable while explicit devices fail closed. Co-authored-by: Cursor <cursoragent@cursor.com>

Add the benchmarked faster-whisper model matrix path, real-audio scoring, and dedicated model-thread execution so modern CUDA models can be evaluated reliably. Switch the English default to distil-large-v3 based on recorded RTX 3080 benchmark evidence and document the remaining model guidance. Co-authored-by: Cursor <cursoragent@cursor.com>

coderabbitai · 2026-05-02T06:11:14Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
✅ Review completed - (🔄 Check again to review again)

Walkthrough

Adds a faster-whisper transcription backend, benchmark and profiling CLI (model matrix, real-audio runs, WER), deterministic post-processing, typed audio I/O, Pydantic v2 config migration, dependency constraint updates, engine/model lifecycle refactor, and corresponding tests and docs updates.

Changes

Dictation pipeline, benchmarking, and typing modernization

Layer / File(s)	Summary
Data Shape & Types `dictation_tool/benchmark.py`, `dictation_tool/transcription.py`, `dictation_tool/io.py`, `dictation_tool/engine.py`	Introduce numpy audio type aliases (`Int16Audio`, `Float32Audio`, `Int32Audio`, `Float64Array`), `TranscriptionOptions`/`TranscriptionResult`, and `DEFAULT_MODEL_MATRIX`.
Core Implementation `dictation_tool/transcription.py`, `dictation_tool/postprocess.py`, `dictation_tool/benchmark.py`	Add `FasterWhisperBackend` (lazy load/close/transcribe), deterministic `DictationPostProcessor` (clean/format pipeline), and benchmark utilities (`parse_model_matrix`, `summarize_latency_ms`, `load_wav_mono_int16`, `word_error_rate`).
Engine & I/O Behavior `dictation_tool/engine.py`, `dictation_tool/io.py`	Engine uses a lazily-loaded backend and model pool, routes model output through postprocessor, refactors `_Ring` to typed int16 buffer, updates VAD pre/post-buffering semantics and AudioStream resampling, and implements benchmark timing/WER reporting.
Configuration & CLI Wiring `dictation_tool/config.py`, `dictation_tool/__main__.py`	Migrate to Pydantic v2 (`model_config`, `@field_validator`, `@model_validator`), add `profile` field, change default model to `large-v3-turbo`, normalize vad/aggr/chunk fields; extend CLI with `--bench-models`, `--bench-audio`, `--bench-reference`, `--bench-runs`, `--no-model-vad`, `--profile`, and a benchmark-matrix runner that returns early.
Dependencies & Packaging `pyproject.toml`, `requirements.txt`	Adjust dependency constraints: `faster-whisper>=1.2.1,<2`, add `ctranslate2>=4.7.1,<5`, add `setuptools<81`, and update tooling/pytest markers.
Tests / Documentation `tests/`, `CHANGELOG.md`, `README.md`, `docs/plans/`	Add and update tests for benchmark helpers, FasterWhisper adapter, post-processing golden tests, engine/VAD/IO behavior, and CLI benchmark flags; update README, CHANGELOG, and add a PRD doc.
Utilities / Minor `dictation_tool/utils.py`, `dictation_tool/utils/__init__.py`, `dictation_tool/utils/profile.py`, `dictation_tool/prompts.py`	Minor import/formatting tweaks, typed `prof` signature, and a comment punctuation fix.

Sequence Diagram(s)

sequenceDiagram
    participant CLI
    participant Engine
    participant Backend
    participant FS
    CLI->>Engine: request benchmark (models, audio, runs)
    Engine->>FS: load_wav_mono_int16(path)
    FS-->>Engine: Int16Audio
    Engine->>Backend: instantiate FasterWhisperBackend / load()
    Backend-->>Engine: backend ready
    Engine->>Backend: transcribe(audio, TranscriptionOptions)
    Backend-->>Engine: TranscriptionResult(text, avg_logprob)
    Engine->>Engine: summarize latency, compute WER
    Engine->>CLI: print per-model JSON results

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

PRD: Complete live modern dictation engine evaluation #3: Implements live-evaluation and benchmarking features (model-matrix CLI, dependency bumps, scoring) referenced by the issue.
PRD: Improve dictation latency, efficiency, and model stack #2: Implements the PRD objectives (backend adapter, deterministic post-processing, VAD controls, measurement infra) described in the issue.

Poem

🐰 Soft hops of code, benchmarks in tow,

WAVs align, and latencies grow—then slow.
Regex prunes, bullets find their place,
Models load, transcribe, and yield neat grace,
Hop—large-v3-turbo leads the race.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 38.31% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Complete modern dictation model evaluation' is directly related to the main changes: adding benchmarking for modern faster-whisper engines, evaluating model performance, and changing the default model to large-v3-turbo based on benchmark evidence.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/dictation-performance-foundation

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-05-02T06:17:27Z

Greptile Summary

This PR completes the production benchmark infrastructure — real-audio WAV benchmarking, WER scoring, a multi-model matrix runner, JSONL profiling, and a --bench CLI path. It also extracts transcription and post-processing into dedicated modules, migrates Config cleanly to Pydantic v2, and switches the default English model to large-v3-turbo.

All five bugs previously flagged (asyncio.Event thread-unsafety in _hold_start/_hold_stop, shared _f32 buffer race in run_in_executor, avg_logprob always -1.0, stale-batch duplicate transcription, VAD frame-size mismatch in the resample fallback) are corrected in this revision. The one new observation (see inline) is P2-level dead code with no runtime impact.

Confidence Score: 5/5

Safe to merge — all previously-reported P1 bugs are resolved and no new P1 or P0 issues were found.

The five confirmed bugs from the prior review (asyncio.Event thread-unsafety, shared _f32 race, avg_logprob always -1.0, stale-batch duplicates, VAD frame-size mismatch) are all corrected. The only new finding is a one-line P2 dead code path in _config_with_model. No P1 or P0 issues remain.

No files require special attention. dictation_tool/main.py has a trivial dead-code cleanup opportunity.

Important Files Changed

Filename	Overview
dictation_tool/engine.py	Thread-safety fixes applied throughout (call_soon_threadsafe, _audio_lock, inference_audio.copy()); stale-batch duplicate transcription resolved; asyncio.Event properly guarded; model teardown moved to stop().
dictation_tool/transcription.py	New module; materializes segment generator and computes mean avg_logprob correctly, fixing the always-(-1.0) context-update bug from the previous review.
dictation_tool/io.py	Resample fallback now sizes native_frames to at least one VAD frame (fixing the silent-drop bug); device sort key corrected to use enumerate index; type annotations tightened.
dictation_tool/postprocess.py	New module; extracts all regex post-processing from engine; clean split between clean_model_text (email/URL/command) and format_clipboard_text (spoken punctuation + trim).
dictation_tool/benchmark.py	New module; WER via Levenshtein, WAV loader with linear resample, latency percentile summary, model matrix parser — all correct.
dictation_tool/main.py	Benchmark matrix runner wired correctly; _config_with_model contains a dead Pydantic v1 fallback (config.py now hard-requires pydantic_settings v2).
dictation_tool/config.py	Clean Pydantic v2 migration: BaseSettings via pydantic_settings, field_validator/@classmethod, model_validator(mode=after), object.setattr for frozen mutation in validators.

Sequence Diagram

sequenceDiagram
    participant Mic as AudioStream
    participant Shadow as Ring/Shadow Buffer
    participant EventLoop as Event Loop
    participant Pool as model_pool
    participant Backend as FasterWhisperBackend
    participant Post as DictationPostProcessor
    participant Clip as Clipboard

    Mic->>Shadow: on_raw_chunk(chunk) via _audio_lock
    Mic->>EventLoop: _q.put_nowait(chunk)
    EventLoop->>EventLoop: mic.chunks yields VAD segment
    EventLoop->>Pool: run_in_executor(backend.transcribe, audio.copy(), options)
    Pool->>Backend: model.transcribe(audio)
    Backend-->>Pool: TranscriptionResult(text, avg_logprob)
    Pool-->>EventLoop: result
    EventLoop->>Post: clean_model_text(result.text)
    Post-->>EventLoop: cleaned text
    EventLoop->>Post: format_clipboard_text(cleaned)
    Post-->>EventLoop: paste-ready text
    EventLoop->>Clip: _clip_q.put(text)
    Clip->>Clip: copy + optional auto-paste

_{Reviews (9): Last reviewed commit: "fix(transcription): propagate segment co..." | Re-trigger Greptile}

Use live dictation feedback to prefer large-v3-turbo as the balanced English default while keeping distil-large-v3 documented as the speed option. Co-authored-by: Cursor <cursoragent@cursor.com>

Copy inference audio before handing it to the model executor, avoid empty hold-release clipboard writes, and remove the stale duplicate utils module. Tighten the smallest related benchmark, config, VAD, and test assertions flagged in review. Co-authored-by: Cursor <cursoragent@cursor.com>

Preserve the config default by keeping mouse triggering disabled unless --mouse is passed. Co-authored-by: Cursor <cursoragent@cursor.com>

Recognize spoken email domains like "at gmail.com" and strip repeated cc/bcc hallucination tails after email addresses so pasted dictation preserves the intended address. Co-authored-by: Cursor <cursoragent@cursor.com>

Guard shared audio buffers across callback and event-loop threads, and report benchmark input errors through argparse instead of raw tracebacks. Co-authored-by: Cursor <cursoragent@cursor.com>

Preserve no-VAD ring contents across wrap and exact-fill cases so buffered audio is popped in order. Co-authored-by: Cursor <cursoragent@cursor.com>

Keep the documented middle-mouse hold-to-record workflow enabled by default while preserving --no-mouse as the explicit opt-out. Co-authored-by: Cursor <cursoragent@cursor.com>

Clear stale VAD batches while mouse hold recording owns audio, and align benchmark real-time factor with the reported inference latency. Co-authored-by: Cursor <cursoragent@cursor.com>

Align native resample fallback with VAD frame sizing and schedule mouse-hold recording event mutations on the event loop thread. Co-authored-by: Cursor <cursoragent@cursor.com>

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

greptile-apps · 2026-05-02T08:40:19Z

+    def transcribe(
+        self, audio: NDArray[np.float32], options: TranscriptionOptions
+    ) -> TranscriptionResult:
+        if self._model is None:
+            raise RuntimeError("transcription backend is not loaded")
+
+        segs, info = self._model.transcribe(
+            audio,
+            language=options.language,
+            initial_prompt=options.initial_prompt,
+            beam_size=options.beam_size,
+            best_of=options.best_of,
+            temperature=options.temperature,
+            vad_filter=options.vad_filter,
+            word_timestamps=False,
+        )
+        return TranscriptionResult(
+            text="".join(segment.text for segment in segs).strip(),
+            avg_logprob=float(getattr(info, "avg_logprob", -1.0)),
+        )


avg_logprob is always -1.0 — context buffer never updates

getattr(info, "avg_logprob", -1.0) looks up avg_logprob on the TranscriptionInfo object returned as the second element of model.transcribe(). TranscriptionInfo does not have this field — it lives on each Segment dataclass. So getattr always falls through to the -1.0 default.

Back in engine._transcribe, the guard is if txt and result.avg_logprob > -0.6, which is never True, so _ctx.push() is never called and the rolling context prompt (_ctx.prompt()) always returns "". Any accuracy improvement from contextual priming is silently lost.

The fix requires materializing the segments generator (which is already exhausted when building text) and computing the mean across segment logprobs:

def transcribe( self, audio: NDArray[np.float32], options: TranscriptionOptions ) -> TranscriptionResult: if self._model is None: raise RuntimeError("transcription backend is not loaded") segs, _info = self._model.transcribe( audio, language=options.language, initial_prompt=options.initial_prompt, beam_size=options.beam_size, best_of=options.best_of, temperature=options.temperature, vad_filter=options.vad_filter, word_timestamps=False, ) segments = list(segs) text = "".join(s.text for s in segments).strip() avg_logprob = ( float(sum(s.avg_logprob for s in segments) / len(segments)) if segments else -1.0 ) return TranscriptionResult(text=text, avg_logprob=avg_logprob)

@cubic-dev-ai is this legit?

@future3OOO Yes, that's a legitimate catch. I suggested this because the current implementation in transcription.py will result in your context buffer never updating, which degrades transcription accuracy over time.

In faster-whisper, the transcribe() method returns a tuple: a generator of segments and a TranscriptionInfo object. The avg_logprob attribute exists on the individual segments, not on the TranscriptionInfo object.

Because FasterWhisperBackend.transcribe does this:

# info is a TranscriptionInfo object return TranscriptionResult( text="".join(segment.text for segment in segs).strip(), avg_logprob=float(getattr(info, "avg_logprob", -1.0)), )

The getattr call will always fall back to -1.0. This is problematic because engine.py contains this logic:

txt = result.text if txt and result.avg_logprob > -0.6: self._ctx.push(txt)

Since avg_logprob is always -1.0, it will never be greater than -0.6, and self._ctx.push(txt) will never be called. This means the rolling prompt history (which Whisper uses to maintain context/vocabulary across segments) stays empty.

How to fix it

You need to collect the segments into a list so you can access their individual scores before the generator is exhausted:

def transcribe( self, audio: NDArray[np.float32], options: TranscriptionOptions ) -> TranscriptionResult: segs, _info = self._model.transcribe(audio, ...) segments = list(segs) text = "".join(s.text for s in segments).strip() # Calculate the average logprob across all segments in this utterance avg_logprob = ( sum(s.avg_logprob for s in segments) / len(segments) if segments else -1.0 ) return TranscriptionResult(text=text, avg_logprob=float(avg_logprob))

For more details on Whisper model settings or how to manage these learnings, check out the Cubic documentation.

future3OOO · 2026-05-02T08:53:36Z

@cubic-dev-ai review this

cubic-dev-ai · 2026-05-02T08:53:42Z

@cubic-dev-ai review this

@future3OOO I have started the AI code review. It will take a few minutes to complete.

cubic-dev-ai

3 issues found across 26 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dictation_tool/benchmark.py">

<violation number="1" location="dictation_tool/benchmark.py:23">
P2: WER tokenization is ASCII-only, which can report incorrect (often zero) error rates for non-English benchmark transcripts.</violation>
</file>

<file name="dictation_tool/postprocess.py">

<violation number="1" location="dictation_tool/postprocess.py:63">
P2: Including `.` in the global space-collapsing regex removes normal sentence spacing around periods (e.g., `word . next` becomes `word.next`). Restrict this normalization to `@` so prose punctuation formatting is preserved.</violation>
</file>

<file name="dictation_tool/io.py">

<violation number="1" location="dictation_tool/io.py:337">
P1: Truncating resampled frame count with `int(...)` can produce sub-frame chunks for VAD, causing speech frames to be dropped in native-rate fallback on some device sample rates.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.}

cubic-dev-ai · 2026-05-02T08:59:28Z

+        return candidates
+
+    def _setup_resampler(self, native_sr: int, native_frames: int) -> None:
+        target_frames = int(native_frames * self._sr / native_sr)


P1: Truncating resampled frame count with int(...) can produce sub-frame chunks for VAD, causing speech frames to be dropped in native-rate fallback on some device sample rates.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At dictation_tool/io.py, line 337: <comment>Truncating resampled frame count with `int(...)` can produce sub-frame chunks for VAD, causing speech frames to be dropped in native-rate fallback on some device sample rates.</comment> <file context> @@ -140,112 +156,266 @@ def _process_frame(self, frame: bytes, is_speech: bool) -> np.ndarray | None: + return candidates + + def _setup_resampler(self, native_sr: int, native_frames: int) -> None: + target_frames = int(native_frames * self._sr / native_sr) + self._native_sr = native_sr + self._resample_idx = np.linspace( </file context>

Suggested change

target_frames = int(native_frames * self._sr / native_sr)

target_frames = max(1, round(native_frames * self._sr / native_sr))

if self._gate is not None:

target_frames = max(target_frames, self._sr * self._gate.frame_duration_ms // 1000)

cubic-dev-ai · 2026-05-02T08:59:28Z

+)
+
+Int16Audio = NDArray[np.int16]
+TOKEN_RE = re.compile(r"[a-z0-9]+(?:'[a-z0-9]+)?")


P2: WER tokenization is ASCII-only, which can report incorrect (often zero) error rates for non-English benchmark transcripts.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At dictation_tool/benchmark.py, line 23: <comment>WER tokenization is ASCII-only, which can report incorrect (often zero) error rates for non-English benchmark transcripts.</comment> <file context> @@ -0,0 +1,103 @@ +) + +Int16Audio = NDArray[np.int16] +TOKEN_RE = re.compile(r"[a-z0-9]+(?:'[a-z0-9]+)?") + + </file context>

cubic-dev-ai · 2026-05-02T08:59:28Z

+    )
+    _SPACES_AROUND_DOT_AT: ClassVar[re.Pattern[str]] = re.compile(
+        r"[\t \u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]*"
+        r"([@.])"


P2: Including . in the global space-collapsing regex removes normal sentence spacing around periods (e.g., word . next becomes word.next). Restrict this normalization to @ so prose punctuation formatting is preserved.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At dictation_tool/postprocess.py, line 63: <comment>Including `.` in the global space-collapsing regex removes normal sentence spacing around periods (e.g., `word . next` becomes `word.next`). Restrict this normalization to `@` so prose punctuation formatting is preserved.</comment> <file context> @@ -0,0 +1,134 @@ + ) + _SPACES_AROUND_DOT_AT: ClassVar[re.Pattern[str]] = re.compile( + r"[\t \u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]*" + r"([@.])" + r"[\t \u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]*" + ) </file context>

This should have been resolved.

Thanks for the heads-up—I’ll re-check the latest changes and withdraw the comment if it’s already fixed.

Derive avg_logprob from faster-whisper segments so rolling context updates for high-confidence transcripts. Co-authored-by: Cursor <cursoragent@cursor.com>

future3OOO · 2026-05-02T12:13:31Z

@cubic-dev-ai

cubic-dev-ai · 2026-05-02T12:13:42Z

@cubic-dev-ai

@future3OOO I have started the AI code review. It will take a few minutes to complete.

cubic-dev-ai

1 issue found across 26 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dictation_tool/io.py">

<violation number="1" location="dictation_tool/io.py:263">
P1: When VAD is enabled, pass-1 stream opening still uses `chunk_ms` blocksize, which can be shorter than `VADGate.frame_duration_ms` and causes VAD to drop chunks.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.}

cubic-dev-ai · 2026-05-02T12:18:44Z

+        last_err: Exception | None = None
+        for dev in candidates:
+            try:
+                stream = self._try_open(dev, self._sr)


P1: When VAD is enabled, pass-1 stream opening still uses chunk_ms blocksize, which can be shorter than VADGate.frame_duration_ms and causes VAD to drop chunks.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At dictation_tool/io.py, line 263: <comment>When VAD is enabled, pass-1 stream opening still uses `chunk_ms` blocksize, which can be shorter than `VADGate.frame_duration_ms` and causes VAD to drop chunks.</comment> <file context> @@ -140,112 +156,266 @@ def _process_frame(self, frame: bytes, is_speech: bool) -> np.ndarray | None: + last_err: Exception | None = None + for dev in candidates: + try: + stream = self._try_open(dev, self._sr) + self._native_sr = self._sr + self._resample_idx = None </file context>

Suggested change

stream = self._try_open(dev, self._sr)

blocksize = self._frames

if self._gate is not None:

blocksize = max(

blocksize,

int(self._sr * self._gate.frame_duration_ms / 1000),

)

stream = self._try_open(dev, self._sr, blocksize=blocksize)

future3OOO and others added 2 commits May 2, 2026 14:38