Cohere Transcribe Support by nenad1002 · Pull Request #2133 · microsoft/onnxruntime-genai

nenad1002 · 2026-05-06T20:18:22Z

This PR introduces the Cohere Transcribe model alongside support for long-form transcription on both CUDA and CPU backends.

Although the model was trained with long-form transcription in mind, the supported sequence length remains limited. Additionally, the current export uses a fixed 30-second input window, similar to Whisper.

For longer audio, we utilize Silero VAD to split speech into smaller segments based on silence boundaries before transcription. The current model export also does not support word-level timestamps, although this is not required for the intended Foundry Local integration scenarios, as Silero VAD already provides sufficiently accurate segmentation for long-form transcription. Future work will focus on improving VAD robustness and segmentation quality in noisy acoustic environments.

While this results in exceptional WER quality accross standard benchmarks, punctuation and capitalization can sometimes become inconsistent across chunk boundaries due to limited cross-chunk context in the current model. We are looking to work with Cohere to improve this behavior.

Copilot

Pull request overview

This PR adds first-class support for the Cohere Transcribe ASR model to ONNX Runtime GenAI, including long-form transcription via Silero VAD-based chunking and multi-chunk decoding support integrated into the existing audio/model/generator pipeline.

Changes:

Introduces Cohere Transcribe model implementation (processor + model state) with VAD-driven chunking and cross-cache decoder reuse.
Extends Generator to support Cohere-specific multi-chunk decoding while presenting a single continuous token stream to callers.
Adds E2E + C API tests and a Python example script; updates config schema and bumps onnxruntime-extensions dependency.

Reviewed changes

Copilot reviewed 33 out of 35 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
test/python/test_onnxruntime_genai_e2e.py	Adds Cohere Transcribe Python E2E coverage (short + long audio).
test/c_api_tests.cpp	Adds Cohere Transcribe C API creation + E2E WER sanity test helpers.
src/models/whisper.h	Allows CohereState to access WhisperDecoderState internals (friend) and adds pragma once.
src/models/whisper_processor.h	Updates processor constructor signature to accept `Model&`.
src/models/whisper_processor.cpp	Updates constructor implementation to accept `Model&` (unused).
src/models/silero_vad.h	Moves SileroVadState out of the header (encapsulation cleanup).
src/models/silero_vad.cpp	Defines SileroVadState in the .cpp (no longer in header).
src/models/qwen2_5_vl_image_processor.h	Updates processor constructor signature to accept `Model&`.
src/models/qwen2_5_vl_image_processor.cpp	Updates constructor implementation to accept `Model&` (unused).
src/models/processor.h	Updates `Processor::Create` factory to pass `Model&` into processors.
src/models/phi_multimodal_processor.h	Updates processor constructor signature to accept `Model&`.
src/models/phi_multimodal_processor.cpp	Updates constructor implementation to accept `Model&` (unused).
src/models/phi_image_processor.h	Updates processor constructor signature to accept `Model&`.
src/models/phi_image_processor.cpp	Updates constructor implementation to accept `Model&` (unused).
src/models/model.h	Registers CohereProcessor; updates MultiModalProcessor factory type; adds State::IsFirstRun.
src/models/model.cpp	Wires Cohere model selection; updates MultiModalProcessor creation + factory invocation.
src/models/model_type.h	Registers `cohere_transcribe` as an ALM model type.
src/models/mistral3_image_processor.h	Updates processor constructor signature to accept `Model&`.
src/models/mistral3_image_processor.cpp	Updates constructor implementation to accept `Model&` (unused).
src/models/gemma4_multimodal_processor.h	Updates processor constructor signature to accept `Model&`.
src/models/gemma4_multimodal_processor.cpp	Updates constructor implementation to accept `Model&` (unused).
src/models/gemma_image_processor.h	Updates processor constructor signature to accept `Model&`.
src/models/gemma_image_processor.cpp	Updates constructor implementation to accept `Model&` (unused).
src/models/cohere_processor.h	Adds CohereProcessor interface (mel computation + VAD splitting).
src/models/cohere_processor.cpp	Implements Cohere mel feature generation, VAD chunking, and multi-chunk tensor packing.
src/models/cohere_model.h	Adds Cohere encoder/decoder state and CohereModel wiring (multi-chunk, commit/stream).
src/models/cohere_model.cpp	Implements multi-chunk decoding, seam cleanup, and encoder/decoder wiring.
src/generators.h	Adds Cohere-specific generator fields/entry points (prompt caching, chunk loop).
src/generators.cpp	Implements Cohere multi-chunk decoding loop and user-visible committed-token streaming.
src/config.h	Adds Cohere mel normalization + VAD chunking config fields and encoder audio_stride.
src/config.cpp	Parses newly-added Cohere-related config fields from JSON.
examples/python/cohere_transcribe.py	Adds a Cohere Transcribe example (streaming output + optional WER check).
cmake/deps.txt	Updates onnxruntime-extensions revision.
.gitignore	Allows committing test audio assets under test/test_models/audios.


 std::shared_ptr<MultiModalProcessor> Model::CreateMultiModalProcessor() const {
-  return std::make_shared<MultiModalProcessor>(*config_, session_info_);
+  return std::make_shared<MultiModalProcessor>(*config_, session_info_, const_cast<Model&>(*this));


+  if (encoder_state_->IsFirstRun()) {
+    encoder_state_->Run(current_length, next_tokens, next_indices);
+    decoder_state_->UpdateInputsOutputs(next_tokens, next_indices, current_length, first_run_);
+    return decoder_state_->Run(current_length, next_tokens, next_indices);


+  const size_t pad_samples = static_cast<size_t>(config_->model.cohere_vad_speech_pad_ms) * samples_per_ms;
+  const size_t max_speech_samples =
+      static_cast<size_t>(config_->model.cohere_vad_max_speech_s * static_cast<float>(sample_rate));
+


+  for (size_t off = 0; off + window <= num_samples; off += window) {
+    float p = vad_->ProcessWindow(samples + off, window);
+    is_speech.push_back(p >= threshold ? 1u : 0u);


+  // Decode audio to PCM, resampled to the model's expected sample rate.
+  ort_extensions::OrtxObjectPtr<OrtxTensorResult> decode_result;
+  int sample_rate = 0;
+  auto [pcm_data, num_samples] = GetDecodedPCM(audios->audios_.get(), 0, mel_cfg_.sample_rate, decode_result, sample_rate);
+


nenad1002 · 2026-05-06T23:35:25Z

+  is_speech.reserve(num_samples / window);
+  for (size_t off = 0; off + window <= num_samples; off += window) {
+    float p = vad_->ProcessWindow(samples + off, window);
+    is_speech.push_back(p >= threshold ? 1u : 0u);
+  }


low value comment, at most 32ms can be skipped, no real world issue.

+  if (is_cohere_model_) {
+    // Return only the committed tokens that have already been yielded to
+    // the caller via GenerateNextToken. The underlying search sequence holds
+    // raw per-chunk decoder output (with prompt prefix and pre-cleanup
+    // boundary tokens), which is not what the user should observe.


kunal-vaishnavi · 2026-05-18T10:07:14Z

@@ -0,0 +1,137 @@
+"""
+Audio chunking is handled internally by CohereProcessor + Generator.


Rather than creating new example files for each speech model, let's find a way to support all of them in a more generic and scalable way. Otherwise, we will need to keep creating, maintaining, and validating each model's example in each language.

kunal-vaishnavi · 2026-05-18T10:07:22Z

+  for (const auto& [name, value] : extra_inputs) {
+    if (name == audio_name) {
+      SetOrReplaceInput(audio_name.c_str(), value->ort_tensor_, audio_features_);
+    } else if (name == "mel_length") {


If this is an extra named input that comes from the preprocessor, can we compare its name to the one set in the GenAI config?

kunal-vaishnavi · 2026-05-18T10:07:43Z

+  std::map<int, std::shared_ptr<Tensor>> indexed_mel_lengths;
+
+  for (const auto& [name, tensor] : extra_inputs) {
+    if (name == "cohere_chunk_count") {


We should remove the cohere prefixes in these names. Same comment as here in addition.

kunal-vaishnavi · 2026-05-18T10:19:28Z

+// Returns true if the rendered text of `token_id` does not begin with
+// whitespace, i.e. a connecting space must be injected before it at a
+// chunk seam to avoid "scamsIn" / "thisInto" artifacts.
+bool TokenLacksLeadingSpace(int32_t token_id, const Tokenizer& tokenizer) {


We could get/set a logits mask during the generation loop using the get_logits/set_logits APIs so that we can customize this logic as needed for future models (e.g. fine-tuned variants that don't have this issue). It also has the extra benefit that we don't have to hardcode this logic beforehand.

kunal-vaishnavi · 2026-05-18T11:01:14Z

  std::vector<float> input_data_;
 };

-/// Internal State for SileroVad — uses State::Run for proper EP/run-options support.


Why move from header file to implementation file?

kunal-vaishnavi · 2026-05-18T11:01:52Z

      v_.blank_id = static_cast<int>(JSON::Get<double>(value));
    } else if (name == "max_symbols_per_step") {
      v_.max_symbols_per_step = static_cast<int>(JSON::Get<double>(value));
+    } else if (name == "cohere_vad_min_silence_ms") {


Can we remove the cohere_vad prefix for these names and re-use any existing names where possible?

kunal-vaishnavi · 2026-05-18T11:06:11Z

  state_ = model.CreateState(search_->GetSequenceLengths(), params);    // Search sequence lengths set when creating state
  guidance_logits_processor_ = CreateGuidanceLogitsProcessor(*state_);  // Could be nullptr if use_guidance (constrained decoding) is not used

+  is_cohere_model_ = model.config_->model.type == "cohere_transcribe";


There is a lot of "if cohere, do X, else do Y" logic that is being added in this file. This pattern was already done with Nemotron speech, but ideally this shouldn't be happening. This file is meant to be generic for all model types and not have model-specific logic as much as possible. Let's find another way to add these changes or introduce new higher-level internal methods for better abstraction.

nenad1002 added 30 commits April 11, 2026 20:09

Init Cohere Transcribe

aaee178

More changes

93d9a0a

more changes

352560a

Add streaming

9e5f94b

More updates

770e6d3

Introduce pipeline for execution

21f85e5

More changes

c1c78c5

More refactoring

800ad35

cohere transcribe py change

7e99070

Merge branch 'main' into nebanfic/cohere-transcribe

4a93541

Support long audio

b218564

Decoding through extensions

15263f2

Remove manual wav conversions

f5ddeb4

Remove logs

1156739

Streaming on output

44e7b2c

Use extensions perf feature normalize

56ac0a3

moe chanes

e3cec74

Fix long form

f5c168a

Fix eos issue

873bfad

Change samples

114a78c

More changes 2

adbd029

Fix CUDA

c8bfbcb

Move from audio_processor

308152b

Add tests

8253b7c

More processor changes

d1a4a49

remove boundary punc

78644a6

A little bit of cleaning

d3e4336

Refactor generaotrs.cpp

6fb7be5

Remove some dead code

ad385bd

More fixes

c3fb62b

nenad1002 and others added 19 commits May 2, 2026 00:21

more cleanup

ac2382c

more fixes

61bdaaf

Update special token comment

c6e7462

Copilot comments

6939c19

Integrate silero VAD

f760e5d

Cleanup code

0b497b8

More changes

9904d1a

more changes 2

185db59

Refactor processor

105c461

Some processor cleaning

939e089

New extensions

30a9c03

Connect to old extensions

616ca41

Merge branch 'main' into nebanfic/cohere-transcribe-2

252bf07

Fix stereo to mono downmixing

23ea4c8

enable processor to take model globally

cd7b7d3

Clang

48e6dff

Cleanup

7bcfb17

More changes

a49b0e7

long form audio test

a13d7ff

Copilot AI review requested due to automatic review settings May 6, 2026 20:18

nenad1002 requested a review from a team as a code owner May 6, 2026 20:18

Copilot started reviewing on behalf of nenad1002 May 6, 2026 20:19 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

kunal-vaishnavi reviewed May 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cohere Transcribe Support#2133

Cohere Transcribe Support#2133
nenad1002 wants to merge 51 commits into
mainfrom
nebanfic/cohere-transcribe-pr

nenad1002 commented May 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

nenad1002 May 6, 2026

Uh oh!

kunal-vaishnavi May 18, 2026

Uh oh!

kunal-vaishnavi May 18, 2026

Uh oh!

kunal-vaishnavi May 18, 2026

Uh oh!

kunal-vaishnavi May 18, 2026

Uh oh!

kunal-vaishnavi May 18, 2026

Uh oh!

kunal-vaishnavi May 18, 2026

Uh oh!

kunal-vaishnavi May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -0,0 +1,137 @@
		"""
		Audio chunking is handled internally by CohereProcessor + Generator.

Conversation

nenad1002 commented May 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants