Complete modern dictation model evaluation#4
Conversation
Prevent mouse-hold dictation from transcribing overlapping VAD batches before release, and harden Windows microphone fallback so default devices are portable while explicit devices fail closed. Co-authored-by: Cursor <cursoragent@cursor.com>
Add the benchmarked faster-whisper model matrix path, real-audio scoring, and dedicated model-thread execution so modern CUDA models can be evaluated reliably. Switch the English default to distil-large-v3 based on recorded RTX 3080 benchmark evidence and document the remaining model guidance. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a faster-whisper transcription backend, benchmark and profiling CLI (model matrix, real-audio runs, WER), deterministic post-processing, typed audio I/O, Pydantic v2 config migration, dependency constraint updates, engine/model lifecycle refactor, and corresponding tests and docs updates. ChangesDictation pipeline, benchmarking, and typing modernization
Sequence Diagram(s)sequenceDiagram
participant CLI
participant Engine
participant Backend
participant FS
CLI->>Engine: request benchmark (models, audio, runs)
Engine->>FS: load_wav_mono_int16(path)
FS-->>Engine: Int16Audio
Engine->>Backend: instantiate FasterWhisperBackend / load()
Backend-->>Engine: backend ready
Engine->>Backend: transcribe(audio, TranscriptionOptions)
Backend-->>Engine: TranscriptionResult(text, avg_logprob)
Engine->>Engine: summarize latency, compute WER
Engine->>CLI: print per-model JSON results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Greptile SummaryThis PR completes the production benchmark infrastructure — real-audio WAV benchmarking, WER scoring, a multi-model matrix runner, JSONL profiling, and a All five bugs previously flagged (asyncio.Event thread-unsafety in Confidence Score: 5/5Safe to merge — all previously-reported P1 bugs are resolved and no new P1 or P0 issues were found. The five confirmed bugs from the prior review (asyncio.Event thread-unsafety, shared _f32 race, avg_logprob always -1.0, stale-batch duplicates, VAD frame-size mismatch) are all corrected. The only new finding is a one-line P2 dead code path in _config_with_model. No P1 or P0 issues remain. No files require special attention. dictation_tool/main.py has a trivial dead-code cleanup opportunity. Important Files Changed
Sequence DiagramsequenceDiagram
participant Mic as AudioStream
participant Shadow as Ring/Shadow Buffer
participant EventLoop as Event Loop
participant Pool as model_pool
participant Backend as FasterWhisperBackend
participant Post as DictationPostProcessor
participant Clip as Clipboard
Mic->>Shadow: on_raw_chunk(chunk) via _audio_lock
Mic->>EventLoop: _q.put_nowait(chunk)
EventLoop->>EventLoop: mic.chunks yields VAD segment
EventLoop->>Pool: run_in_executor(backend.transcribe, audio.copy(), options)
Pool->>Backend: model.transcribe(audio)
Backend-->>Pool: TranscriptionResult(text, avg_logprob)
Pool-->>EventLoop: result
EventLoop->>Post: clean_model_text(result.text)
Post-->>EventLoop: cleaned text
EventLoop->>Post: format_clipboard_text(cleaned)
Post-->>EventLoop: paste-ready text
EventLoop->>Clip: _clip_q.put(text)
Clip->>Clip: copy + optional auto-paste
Reviews (9): Last reviewed commit: "fix(transcription): propagate segment co..." | Re-trigger Greptile |
Use live dictation feedback to prefer large-v3-turbo as the balanced English default while keeping distil-large-v3 documented as the speed option. Co-authored-by: Cursor <cursoragent@cursor.com>
Copy inference audio before handing it to the model executor, avoid empty hold-release clipboard writes, and remove the stale duplicate utils module. Tighten the smallest related benchmark, config, VAD, and test assertions flagged in review. Co-authored-by: Cursor <cursoragent@cursor.com>
Preserve the config default by keeping mouse triggering disabled unless --mouse is passed. Co-authored-by: Cursor <cursoragent@cursor.com>
Recognize spoken email domains like "at gmail.com" and strip repeated cc/bcc hallucination tails after email addresses so pasted dictation preserves the intended address. Co-authored-by: Cursor <cursoragent@cursor.com>
Guard shared audio buffers across callback and event-loop threads, and report benchmark input errors through argparse instead of raw tracebacks. Co-authored-by: Cursor <cursoragent@cursor.com>
Preserve no-VAD ring contents across wrap and exact-fill cases so buffered audio is popped in order. Co-authored-by: Cursor <cursoragent@cursor.com>
Keep the documented middle-mouse hold-to-record workflow enabled by default while preserving --no-mouse as the explicit opt-out. Co-authored-by: Cursor <cursoragent@cursor.com>
Clear stale VAD batches while mouse hold recording owns audio, and align benchmark real-time factor with the reported inference latency. Co-authored-by: Cursor <cursoragent@cursor.com>
Align native resample fallback with VAD frame sizing and schedule mouse-hold recording event mutations on the event loop thread. Co-authored-by: Cursor <cursoragent@cursor.com>
| def transcribe( | ||
| self, audio: NDArray[np.float32], options: TranscriptionOptions | ||
| ) -> TranscriptionResult: | ||
| if self._model is None: | ||
| raise RuntimeError("transcription backend is not loaded") | ||
|
|
||
| segs, info = self._model.transcribe( | ||
| audio, | ||
| language=options.language, | ||
| initial_prompt=options.initial_prompt, | ||
| beam_size=options.beam_size, | ||
| best_of=options.best_of, | ||
| temperature=options.temperature, | ||
| vad_filter=options.vad_filter, | ||
| word_timestamps=False, | ||
| ) | ||
| return TranscriptionResult( | ||
| text="".join(segment.text for segment in segs).strip(), | ||
| avg_logprob=float(getattr(info, "avg_logprob", -1.0)), | ||
| ) |
There was a problem hiding this comment.
avg_logprob is always -1.0 — context buffer never updates
getattr(info, "avg_logprob", -1.0) looks up avg_logprob on the TranscriptionInfo object returned as the second element of model.transcribe(). TranscriptionInfo does not have this field — it lives on each Segment dataclass. So getattr always falls through to the -1.0 default.
Back in engine._transcribe, the guard is if txt and result.avg_logprob > -0.6, which is never True, so _ctx.push() is never called and the rolling context prompt (_ctx.prompt()) always returns "". Any accuracy improvement from contextual priming is silently lost.
The fix requires materializing the segments generator (which is already exhausted when building text) and computing the mean across segment logprobs:
def transcribe(
self, audio: NDArray[np.float32], options: TranscriptionOptions
) -> TranscriptionResult:
if self._model is None:
raise RuntimeError("transcription backend is not loaded")
segs, _info = self._model.transcribe(
audio,
language=options.language,
initial_prompt=options.initial_prompt,
beam_size=options.beam_size,
best_of=options.best_of,
temperature=options.temperature,
vad_filter=options.vad_filter,
word_timestamps=False,
)
segments = list(segs)
text = "".join(s.text for s in segments).strip()
avg_logprob = (
float(sum(s.avg_logprob for s in segments) / len(segments))
if segments
else -1.0
)
return TranscriptionResult(text=text, avg_logprob=avg_logprob)There was a problem hiding this comment.
@future3OOO Yes, that's a legitimate catch. I suggested this because the current implementation in transcription.py will result in your context buffer never updating, which degrades transcription accuracy over time.
In faster-whisper, the transcribe() method returns a tuple: a generator of segments and a TranscriptionInfo object. The avg_logprob attribute exists on the individual segments, not on the TranscriptionInfo object.
Because FasterWhisperBackend.transcribe does this:
# info is a TranscriptionInfo object
return TranscriptionResult(
text="".join(segment.text for segment in segs).strip(),
avg_logprob=float(getattr(info, "avg_logprob", -1.0)),
)The getattr call will always fall back to -1.0. This is problematic because engine.py contains this logic:
txt = result.text
if txt and result.avg_logprob > -0.6:
self._ctx.push(txt)Since avg_logprob is always -1.0, it will never be greater than -0.6, and self._ctx.push(txt) will never be called. This means the rolling prompt history (which Whisper uses to maintain context/vocabulary across segments) stays empty.
How to fix it
You need to collect the segments into a list so you can access their individual scores before the generator is exhausted:
def transcribe(
self, audio: NDArray[np.float32], options: TranscriptionOptions
) -> TranscriptionResult:
segs, _info = self._model.transcribe(audio, ...)
segments = list(segs)
text = "".join(s.text for s in segments).strip()
# Calculate the average logprob across all segments in this utterance
avg_logprob = (
sum(s.avg_logprob for s in segments) / len(segments)
if segments else -1.0
)
return TranscriptionResult(text=text, avg_logprob=float(avg_logprob))For more details on Whisper model settings or how to manage these learnings, check out the Cubic documentation.
|
@cubic-dev-ai review this |
@future3OOO I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
3 issues found across 26 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="dictation_tool/benchmark.py">
<violation number="1" location="dictation_tool/benchmark.py:23">
P2: WER tokenization is ASCII-only, which can report incorrect (often zero) error rates for non-English benchmark transcripts.</violation>
</file>
<file name="dictation_tool/postprocess.py">
<violation number="1" location="dictation_tool/postprocess.py:63">
P2: Including `.` in the global space-collapsing regex removes normal sentence spacing around periods (e.g., `word . next` becomes `word.next`). Restrict this normalization to `@` so prose punctuation formatting is preserved.</violation>
</file>
<file name="dictation_tool/io.py">
<violation number="1" location="dictation_tool/io.py:337">
P1: Truncating resampled frame count with `int(...)` can produce sub-frame chunks for VAD, causing speech frames to be dropped in native-rate fallback on some device sample rates.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
| return candidates | ||
|
|
||
| def _setup_resampler(self, native_sr: int, native_frames: int) -> None: | ||
| target_frames = int(native_frames * self._sr / native_sr) |
There was a problem hiding this comment.
P1: Truncating resampled frame count with int(...) can produce sub-frame chunks for VAD, causing speech frames to be dropped in native-rate fallback on some device sample rates.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dictation_tool/io.py, line 337:
<comment>Truncating resampled frame count with `int(...)` can produce sub-frame chunks for VAD, causing speech frames to be dropped in native-rate fallback on some device sample rates.</comment>
<file context>
@@ -140,112 +156,266 @@ def _process_frame(self, frame: bytes, is_speech: bool) -> np.ndarray | None:
+ return candidates
+
+ def _setup_resampler(self, native_sr: int, native_frames: int) -> None:
+ target_frames = int(native_frames * self._sr / native_sr)
+ self._native_sr = native_sr
+ self._resample_idx = np.linspace(
</file context>
| target_frames = int(native_frames * self._sr / native_sr) | |
| target_frames = max(1, round(native_frames * self._sr / native_sr)) | |
| if self._gate is not None: | |
| target_frames = max(target_frames, self._sr * self._gate.frame_duration_ms // 1000) |
| ) | ||
|
|
||
| Int16Audio = NDArray[np.int16] | ||
| TOKEN_RE = re.compile(r"[a-z0-9]+(?:'[a-z0-9]+)?") |
There was a problem hiding this comment.
P2: WER tokenization is ASCII-only, which can report incorrect (often zero) error rates for non-English benchmark transcripts.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dictation_tool/benchmark.py, line 23:
<comment>WER tokenization is ASCII-only, which can report incorrect (often zero) error rates for non-English benchmark transcripts.</comment>
<file context>
@@ -0,0 +1,103 @@
+)
+
+Int16Audio = NDArray[np.int16]
+TOKEN_RE = re.compile(r"[a-z0-9]+(?:'[a-z0-9]+)?")
+
+
</file context>
| ) | ||
| _SPACES_AROUND_DOT_AT: ClassVar[re.Pattern[str]] = re.compile( | ||
| r"[\t \u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]*" | ||
| r"([@.])" |
There was a problem hiding this comment.
P2: Including . in the global space-collapsing regex removes normal sentence spacing around periods (e.g., word . next becomes word.next). Restrict this normalization to @ so prose punctuation formatting is preserved.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dictation_tool/postprocess.py, line 63:
<comment>Including `.` in the global space-collapsing regex removes normal sentence spacing around periods (e.g., `word . next` becomes `word.next`). Restrict this normalization to `@` so prose punctuation formatting is preserved.</comment>
<file context>
@@ -0,0 +1,134 @@
+ )
+ _SPACES_AROUND_DOT_AT: ClassVar[re.Pattern[str]] = re.compile(
+ r"[\t \u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]*"
+ r"([@.])"
+ r"[\t \u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]*"
+ )
</file context>
There was a problem hiding this comment.
This should have been resolved.
There was a problem hiding this comment.
Thanks for the heads-up—I’ll re-check the latest changes and withdraw the comment if it’s already fixed.
Derive avg_logprob from faster-whisper segments so rolling context updates for high-confidence transcripts. Co-authored-by: Cursor <cursoragent@cursor.com>
|
@future3OOO I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
1 issue found across 26 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="dictation_tool/io.py">
<violation number="1" location="dictation_tool/io.py:263">
P1: When VAD is enabled, pass-1 stream opening still uses `chunk_ms` blocksize, which can be shorter than `VADGate.frame_duration_ms` and causes VAD to drop chunks.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
| last_err: Exception | None = None | ||
| for dev in candidates: | ||
| try: | ||
| stream = self._try_open(dev, self._sr) |
There was a problem hiding this comment.
P1: When VAD is enabled, pass-1 stream opening still uses chunk_ms blocksize, which can be shorter than VADGate.frame_duration_ms and causes VAD to drop chunks.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dictation_tool/io.py, line 263:
<comment>When VAD is enabled, pass-1 stream opening still uses `chunk_ms` blocksize, which can be shorter than `VADGate.frame_duration_ms` and causes VAD to drop chunks.</comment>
<file context>
@@ -140,112 +156,266 @@ def _process_frame(self, frame: bytes, is_speech: bool) -> np.ndarray | None:
+ last_err: Exception | None = None
+ for dev in candidates:
+ try:
+ stream = self._try_open(dev, self._sr)
+ self._native_sr = self._sr
+ self._resample_idx = None
</file context>
| stream = self._try_open(dev, self._sr) | |
| blocksize = self._frames | |
| if self._gate is not None: | |
| blocksize = max( | |
| blocksize, | |
| int(self._sr * self._gate.frame_duration_ms / 1000), | |
| ) | |
| stream = self._try_open(dev, self._sr, blocksize=blocksize) |
Summary
faster-whisper 1.2.1/ctranslate2 4.7.1.large-v3-turbobased on recorded RTX 3080 benchmark evidence plus live dictation feedback showing better accuracy thandistil-large-v3while remaining fast.Test plan
pytest tests -q -m "not gpu"-> 118 passed, 3 deselectedpytest tests/test_gpu_smoke.py -q -m gpu-> 3 passedblack --check dictation_tool testsruff check dictation_tool tests pyproject.tomlmypy dictation_toolpip checkNotes
--device cpu, but the benchmarked production recommendation is GPU/CUDA. CPU use oflarge-v3-turbois expected to be much slower than GPU for this workflow.Made with Cursor
Summary by CodeRabbit