Skip to content

Complete modern dictation model evaluation#4

Open
future3OOO wants to merge 12 commits into
masterfrom
feature/dictation-performance-foundation
Open

Complete modern dictation model evaluation#4
future3OOO wants to merge 12 commits into
masterfrom
feature/dictation-performance-foundation

Conversation

@future3OOO
Copy link
Copy Markdown
Owner

@future3OOO future3OOO commented May 2, 2026

Summary

  • Adds the production benchmark path for modern faster-whisper engines, including real-audio WAV benchmarking, WER scoring, model matrix runs, JSON output, and model-VAD comparison.
  • Moves transcription and post-processing behind focused modules, keeps CUDA model load/inference/close on one model thread, and updates runtime dependencies to faster-whisper 1.2.1 / ctranslate2 4.7.1.
  • Switches the English default to large-v3-turbo based on recorded RTX 3080 benchmark evidence plus live dictation feedback showing better accuracy than distil-large-v3 while remaining fast.

Test plan

  • pytest tests -q -m "not gpu" -> 118 passed, 3 deselected
  • pytest tests/test_gpu_smoke.py -q -m gpu -> 3 passed
  • black --check dictation_tool tests
  • ruff check dictation_tool tests pyproject.toml
  • mypy dictation_tool
  • pip check

Notes

  • CPU mode is still available with --device cpu, but the benchmarked production recommendation is GPU/CUDA. CPU use of large-v3-turbo is expected to be much slower than GPU for this workflow.

Made with Cursor

Summary by CodeRabbit

  • New Features
    • CLI benchmark matrix, real-audio benchmarking, per-model runs, model-VAD comparison, JSONL profiling, WER scoring, and deterministic postprocessing helpers.
  • Changed
    • Default English model -> large-v3-turbo; updated runtime dependency constraints for transcription backends and setuptools pin for compatibility.
  • Bug Fixes
    • Fixed No-VAD buffering, benchmark/CLI chunk wiring, clipboard behavior, bullet/spacing rendering, and Windows model teardown.
  • Documentation
    • Expanded README, CHANGELOG, and performance plan with benchmarks and usage guidance.
  • Tests
    • Added/updated unit and golden tests covering benchmarking, postprocessing, transcription, I/O, engine, and GPU smoke.

future3OOO and others added 2 commits May 2, 2026 14:38
Prevent mouse-hold dictation from transcribing overlapping VAD batches before release, and harden Windows microphone fallback so default devices are portable while explicit devices fail closed.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add the benchmarked faster-whisper model matrix path, real-audio scoring, and dedicated model-thread execution so modern CUDA models can be evaluated reliably. Switch the English default to distil-large-v3 based on recorded RTX 3080 benchmark evidence and document the remaining model guidance.

Co-authored-by: Cursor <cursoragent@cursor.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 2, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • ✅ Review completed - (🔄 Check again to review again)

Walkthrough

Adds a faster-whisper transcription backend, benchmark and profiling CLI (model matrix, real-audio runs, WER), deterministic post-processing, typed audio I/O, Pydantic v2 config migration, dependency constraint updates, engine/model lifecycle refactor, and corresponding tests and docs updates.

Changes

Dictation pipeline, benchmarking, and typing modernization

Layer / File(s) Summary
Data Shape & Types
dictation_tool/benchmark.py, dictation_tool/transcription.py, dictation_tool/io.py, dictation_tool/engine.py
Introduce numpy audio type aliases (Int16Audio, Float32Audio, Int32Audio, Float64Array), TranscriptionOptions/TranscriptionResult, and DEFAULT_MODEL_MATRIX.
Core Implementation
dictation_tool/transcription.py, dictation_tool/postprocess.py, dictation_tool/benchmark.py
Add FasterWhisperBackend (lazy load/close/transcribe), deterministic DictationPostProcessor (clean/format pipeline), and benchmark utilities (parse_model_matrix, summarize_latency_ms, load_wav_mono_int16, word_error_rate).
Engine & I/O Behavior
dictation_tool/engine.py, dictation_tool/io.py
Engine uses a lazily-loaded backend and model pool, routes model output through postprocessor, refactors _Ring to typed int16 buffer, updates VAD pre/post-buffering semantics and AudioStream resampling, and implements benchmark timing/WER reporting.
Configuration & CLI Wiring
dictation_tool/config.py, dictation_tool/__main__.py
Migrate to Pydantic v2 (model_config, @field_validator, @model_validator), add profile field, change default model to large-v3-turbo, normalize vad/aggr/chunk fields; extend CLI with --bench-models, --bench-audio, --bench-reference, --bench-runs, --no-model-vad, --profile, and a benchmark-matrix runner that returns early.
Dependencies & Packaging
pyproject.toml, requirements.txt
Adjust dependency constraints: faster-whisper>=1.2.1,<2, add ctranslate2>=4.7.1,<5, add setuptools<81, and update tooling/pytest markers.
Tests / Documentation
tests/*, CHANGELOG.md, README.md, docs/plans/*
Add and update tests for benchmark helpers, FasterWhisper adapter, post-processing golden tests, engine/VAD/IO behavior, and CLI benchmark flags; update README, CHANGELOG, and add a PRD doc.
Utilities / Minor
dictation_tool/utils.py, dictation_tool/utils/__init__.py, dictation_tool/utils/profile.py, dictation_tool/prompts.py
Minor import/formatting tweaks, typed prof signature, and a comment punctuation fix.

Sequence Diagram(s)

sequenceDiagram
    participant CLI
    participant Engine
    participant Backend
    participant FS
    CLI->>Engine: request benchmark (models, audio, runs)
    Engine->>FS: load_wav_mono_int16(path)
    FS-->>Engine: Int16Audio
    Engine->>Backend: instantiate FasterWhisperBackend / load()
    Backend-->>Engine: backend ready
    Engine->>Backend: transcribe(audio, TranscriptionOptions)
    Backend-->>Engine: TranscriptionResult(text, avg_logprob)
    Engine->>Engine: summarize latency, compute WER
    Engine->>CLI: print per-model JSON results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Poem

🐰 Soft hops of code, benchmarks in tow,

WAVs align, and latencies grow—then slow.
Regex prunes, bullets find their place,
Models load, transcribe, and yield neat grace,
Hop—large-v3-turbo leads the race.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.31% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Complete modern dictation model evaluation' is directly related to the main changes: adding benchmarking for modern faster-whisper engines, evaluating model performance, and changing the default model to large-v3-turbo based on benchmark evidence.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/dictation-performance-foundation

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 2, 2026

Greptile Summary

This PR completes the production benchmark infrastructure — real-audio WAV benchmarking, WER scoring, a multi-model matrix runner, JSONL profiling, and a --bench CLI path. It also extracts transcription and post-processing into dedicated modules, migrates Config cleanly to Pydantic v2, and switches the default English model to large-v3-turbo.

All five bugs previously flagged (asyncio.Event thread-unsafety in _hold_start/_hold_stop, shared _f32 buffer race in run_in_executor, avg_logprob always -1.0, stale-batch duplicate transcription, VAD frame-size mismatch in the resample fallback) are corrected in this revision. The one new observation (see inline) is P2-level dead code with no runtime impact.

Confidence Score: 5/5

Safe to merge — all previously-reported P1 bugs are resolved and no new P1 or P0 issues were found.

The five confirmed bugs from the prior review (asyncio.Event thread-unsafety, shared _f32 race, avg_logprob always -1.0, stale-batch duplicates, VAD frame-size mismatch) are all corrected. The only new finding is a one-line P2 dead code path in _config_with_model. No P1 or P0 issues remain.

No files require special attention. dictation_tool/main.py has a trivial dead-code cleanup opportunity.

Important Files Changed

Filename Overview
dictation_tool/engine.py Thread-safety fixes applied throughout (call_soon_threadsafe, _audio_lock, inference_audio.copy()); stale-batch duplicate transcription resolved; asyncio.Event properly guarded; model teardown moved to stop().
dictation_tool/transcription.py New module; materializes segment generator and computes mean avg_logprob correctly, fixing the always-(-1.0) context-update bug from the previous review.
dictation_tool/io.py Resample fallback now sizes native_frames to at least one VAD frame (fixing the silent-drop bug); device sort key corrected to use enumerate index; type annotations tightened.
dictation_tool/postprocess.py New module; extracts all regex post-processing from engine; clean split between clean_model_text (email/URL/command) and format_clipboard_text (spoken punctuation + trim).
dictation_tool/benchmark.py New module; WER via Levenshtein, WAV loader with linear resample, latency percentile summary, model matrix parser — all correct.
dictation_tool/main.py Benchmark matrix runner wired correctly; _config_with_model contains a dead Pydantic v1 fallback (config.py now hard-requires pydantic_settings v2).
dictation_tool/config.py Clean Pydantic v2 migration: BaseSettings via pydantic_settings, field_validator/@classmethod, model_validator(mode=after), object.setattr for frozen mutation in validators.

Sequence Diagram

sequenceDiagram
    participant Mic as AudioStream
    participant Shadow as Ring/Shadow Buffer
    participant EventLoop as Event Loop
    participant Pool as model_pool
    participant Backend as FasterWhisperBackend
    participant Post as DictationPostProcessor
    participant Clip as Clipboard

    Mic->>Shadow: on_raw_chunk(chunk) via _audio_lock
    Mic->>EventLoop: _q.put_nowait(chunk)
    EventLoop->>EventLoop: mic.chunks yields VAD segment
    EventLoop->>Pool: run_in_executor(backend.transcribe, audio.copy(), options)
    Pool->>Backend: model.transcribe(audio)
    Backend-->>Pool: TranscriptionResult(text, avg_logprob)
    Pool-->>EventLoop: result
    EventLoop->>Post: clean_model_text(result.text)
    Post-->>EventLoop: cleaned text
    EventLoop->>Post: format_clipboard_text(cleaned)
    Post-->>EventLoop: paste-ready text
    EventLoop->>Clip: _clip_q.put(text)
    Clip->>Clip: copy + optional auto-paste
Loading

Fix All in Codex

Reviews (9): Last reviewed commit: "fix(transcription): propagate segment co..." | Re-trigger Greptile

greptile-apps[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

Use live dictation feedback to prefer large-v3-turbo as the balanced English default while keeping distil-large-v3 documented as the speed option.

Co-authored-by: Cursor <cursoragent@cursor.com>
coderabbitai[bot]

This comment was marked as resolved.

future3OOO and others added 2 commits May 2, 2026 18:26
Copy inference audio before handing it to the model executor, avoid empty hold-release clipboard writes, and remove the stale duplicate utils module. Tighten the smallest related benchmark, config, VAD, and test assertions flagged in review.

Co-authored-by: Cursor <cursoragent@cursor.com>
Preserve the config default by keeping mouse triggering disabled unless --mouse is passed.

Co-authored-by: Cursor <cursoragent@cursor.com>
coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

future3OOO and others added 2 commits May 2, 2026 19:12
Recognize spoken email domains like "at gmail.com" and strip repeated cc/bcc hallucination tails after email addresses so pasted dictation preserves the intended address.

Co-authored-by: Cursor <cursoragent@cursor.com>
Guard shared audio buffers across callback and event-loop threads, and report benchmark input errors through argparse instead of raw tracebacks.

Co-authored-by: Cursor <cursoragent@cursor.com>
coderabbitai[bot]

This comment was marked as resolved.

Preserve no-VAD ring contents across wrap and exact-fill cases so buffered audio is popped in order.

Co-authored-by: Cursor <cursoragent@cursor.com>
greptile-apps[bot]

This comment was marked as resolved.

Keep the documented middle-mouse hold-to-record workflow enabled by default while preserving --no-mouse as the explicit opt-out.

Co-authored-by: Cursor <cursoragent@cursor.com>
greptile-apps[bot]

This comment was marked as resolved.

future3OOO and others added 2 commits May 2, 2026 20:09
Clear stale VAD batches while mouse hold recording owns audio, and align benchmark real-time factor with the reported inference latency.

Co-authored-by: Cursor <cursoragent@cursor.com>
Align native resample fallback with VAD frame sizing and schedule mouse-hold recording event mutations on the event loop thread.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

Comment on lines +65 to +84
def transcribe(
self, audio: NDArray[np.float32], options: TranscriptionOptions
) -> TranscriptionResult:
if self._model is None:
raise RuntimeError("transcription backend is not loaded")

segs, info = self._model.transcribe(
audio,
language=options.language,
initial_prompt=options.initial_prompt,
beam_size=options.beam_size,
best_of=options.best_of,
temperature=options.temperature,
vad_filter=options.vad_filter,
word_timestamps=False,
)
return TranscriptionResult(
text="".join(segment.text for segment in segs).strip(),
avg_logprob=float(getattr(info, "avg_logprob", -1.0)),
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 avg_logprob is always -1.0 — context buffer never updates

getattr(info, "avg_logprob", -1.0) looks up avg_logprob on the TranscriptionInfo object returned as the second element of model.transcribe(). TranscriptionInfo does not have this field — it lives on each Segment dataclass. So getattr always falls through to the -1.0 default.

Back in engine._transcribe, the guard is if txt and result.avg_logprob > -0.6, which is never True, so _ctx.push() is never called and the rolling context prompt (_ctx.prompt()) always returns "". Any accuracy improvement from contextual priming is silently lost.

The fix requires materializing the segments generator (which is already exhausted when building text) and computing the mean across segment logprobs:

def transcribe(
    self, audio: NDArray[np.float32], options: TranscriptionOptions
) -> TranscriptionResult:
    if self._model is None:
        raise RuntimeError("transcription backend is not loaded")

    segs, _info = self._model.transcribe(
        audio,
        language=options.language,
        initial_prompt=options.initial_prompt,
        beam_size=options.beam_size,
        best_of=options.best_of,
        temperature=options.temperature,
        vad_filter=options.vad_filter,
        word_timestamps=False,
    )
    segments = list(segs)
    text = "".join(s.text for s in segments).strip()
    avg_logprob = (
        float(sum(s.avg_logprob for s in segments) / len(segments))
        if segments
        else -1.0
    )
    return TranscriptionResult(text=text, avg_logprob=avg_logprob)

Fix in Codex

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cubic-dev-ai is this legit?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@future3OOO Yes, that's a legitimate catch. I suggested this because the current implementation in transcription.py will result in your context buffer never updating, which degrades transcription accuracy over time.

In faster-whisper, the transcribe() method returns a tuple: a generator of segments and a TranscriptionInfo object. The avg_logprob attribute exists on the individual segments, not on the TranscriptionInfo object.

Because FasterWhisperBackend.transcribe does this:

# info is a TranscriptionInfo object
return TranscriptionResult(
    text="".join(segment.text for segment in segs).strip(),
    avg_logprob=float(getattr(info, "avg_logprob", -1.0)),
)

The getattr call will always fall back to -1.0. This is problematic because engine.py contains this logic:

txt = result.text
if txt and result.avg_logprob > -0.6:
    self._ctx.push(txt)

Since avg_logprob is always -1.0, it will never be greater than -0.6, and self._ctx.push(txt) will never be called. This means the rolling prompt history (which Whisper uses to maintain context/vocabulary across segments) stays empty.

How to fix it

You need to collect the segments into a list so you can access their individual scores before the generator is exhausted:

def transcribe(
    self, audio: NDArray[np.float32], options: TranscriptionOptions
) -> TranscriptionResult:
    segs, _info = self._model.transcribe(audio, ...)
    
    segments = list(segs)
    text = "".join(s.text for s in segments).strip()
    
    # Calculate the average logprob across all segments in this utterance
    avg_logprob = (
        sum(s.avg_logprob for s in segments) / len(segments)
        if segments else -1.0
    )
    
    return TranscriptionResult(text=text, avg_logprob=float(avg_logprob))

For more details on Whisper model settings or how to manage these learnings, check out the Cubic documentation.

@future3OOO
Copy link
Copy Markdown
Owner Author

@cubic-dev-ai review this

@cubic-dev-ai
Copy link
Copy Markdown

cubic-dev-ai Bot commented May 2, 2026

@cubic-dev-ai review this

@future3OOO I have started the AI code review. It will take a few minutes to complete.

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 26 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dictation_tool/benchmark.py">

<violation number="1" location="dictation_tool/benchmark.py:23">
P2: WER tokenization is ASCII-only, which can report incorrect (often zero) error rates for non-English benchmark transcripts.</violation>
</file>

<file name="dictation_tool/postprocess.py">

<violation number="1" location="dictation_tool/postprocess.py:63">
P2: Including `.` in the global space-collapsing regex removes normal sentence spacing around periods (e.g., `word . next` becomes `word.next`). Restrict this normalization to `@` so prose punctuation formatting is preserved.</violation>
</file>

<file name="dictation_tool/io.py">

<violation number="1" location="dictation_tool/io.py:337">
P1: Truncating resampled frame count with `int(...)` can produce sub-frame chunks for VAD, causing speech frames to be dropped in native-rate fallback on some device sample rates.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.

Comment thread dictation_tool/io.py
return candidates

def _setup_resampler(self, native_sr: int, native_frames: int) -> None:
target_frames = int(native_frames * self._sr / native_sr)
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot May 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Truncating resampled frame count with int(...) can produce sub-frame chunks for VAD, causing speech frames to be dropped in native-rate fallback on some device sample rates.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dictation_tool/io.py, line 337:

<comment>Truncating resampled frame count with `int(...)` can produce sub-frame chunks for VAD, causing speech frames to be dropped in native-rate fallback on some device sample rates.</comment>

<file context>
@@ -140,112 +156,266 @@ def _process_frame(self, frame: bytes, is_speech: bool) -> np.ndarray | None:
+        return candidates
+
+    def _setup_resampler(self, native_sr: int, native_frames: int) -> None:
+        target_frames = int(native_frames * self._sr / native_sr)
+        self._native_sr = native_sr
+        self._resample_idx = np.linspace(
</file context>
Suggested change
target_frames = int(native_frames * self._sr / native_sr)
target_frames = max(1, round(native_frames * self._sr / native_sr))
if self._gate is not None:
target_frames = max(target_frames, self._sr * self._gate.frame_duration_ms // 1000)
Fix with Cubic

)

Int16Audio = NDArray[np.int16]
TOKEN_RE = re.compile(r"[a-z0-9]+(?:'[a-z0-9]+)?")
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot May 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: WER tokenization is ASCII-only, which can report incorrect (often zero) error rates for non-English benchmark transcripts.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dictation_tool/benchmark.py, line 23:

<comment>WER tokenization is ASCII-only, which can report incorrect (often zero) error rates for non-English benchmark transcripts.</comment>

<file context>
@@ -0,0 +1,103 @@
+)
+
+Int16Audio = NDArray[np.int16]
+TOKEN_RE = re.compile(r"[a-z0-9]+(?:'[a-z0-9]+)?")
+
+
</file context>
Fix with Cubic

)
_SPACES_AROUND_DOT_AT: ClassVar[re.Pattern[str]] = re.compile(
r"[\t \u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]*"
r"([@.])"
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot May 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Including . in the global space-collapsing regex removes normal sentence spacing around periods (e.g., word . next becomes word.next). Restrict this normalization to @ so prose punctuation formatting is preserved.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dictation_tool/postprocess.py, line 63:

<comment>Including `.` in the global space-collapsing regex removes normal sentence spacing around periods (e.g., `word . next` becomes `word.next`). Restrict this normalization to `@` so prose punctuation formatting is preserved.</comment>

<file context>
@@ -0,0 +1,134 @@
+    )
+    _SPACES_AROUND_DOT_AT: ClassVar[re.Pattern[str]] = re.compile(
+        r"[\t \u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]*"
+        r"([@.])"
+        r"[\t \u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]*"
+    )
</file context>
Fix with Cubic

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have been resolved.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads-up—I’ll re-check the latest changes and withdraw the comment if it’s already fixed.

Derive avg_logprob from faster-whisper segments so rolling context updates for high-confidence transcripts.

Co-authored-by: Cursor <cursoragent@cursor.com>
@future3OOO
Copy link
Copy Markdown
Owner Author

@cubic-dev-ai

@cubic-dev-ai
Copy link
Copy Markdown

cubic-dev-ai Bot commented May 2, 2026

@cubic-dev-ai

@future3OOO I have started the AI code review. It will take a few minutes to complete.

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 26 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dictation_tool/io.py">

<violation number="1" location="dictation_tool/io.py:263">
P1: When VAD is enabled, pass-1 stream opening still uses `chunk_ms` blocksize, which can be shorter than `VADGate.frame_duration_ms` and causes VAD to drop chunks.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.

Comment thread dictation_tool/io.py
last_err: Exception | None = None
for dev in candidates:
try:
stream = self._try_open(dev, self._sr)
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot May 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: When VAD is enabled, pass-1 stream opening still uses chunk_ms blocksize, which can be shorter than VADGate.frame_duration_ms and causes VAD to drop chunks.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dictation_tool/io.py, line 263:

<comment>When VAD is enabled, pass-1 stream opening still uses `chunk_ms` blocksize, which can be shorter than `VADGate.frame_duration_ms` and causes VAD to drop chunks.</comment>

<file context>
@@ -140,112 +156,266 @@ def _process_frame(self, frame: bytes, is_speech: bool) -> np.ndarray | None:
+        last_err: Exception | None = None
+        for dev in candidates:
+            try:
+                stream = self._try_open(dev, self._sr)
+                self._native_sr = self._sr
+                self._resample_idx = None
</file context>
Suggested change
stream = self._try_open(dev, self._sr)
blocksize = self._frames
if self._gate is not None:
blocksize = max(
blocksize,
int(self._sr * self._gate.frame_duration_ms / 1000),
)
stream = self._try_open(dev, self._sr, blocksize=blocksize)
Fix with Cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant