Cohere Transcribe Support#2133
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds first-class support for the Cohere Transcribe ASR model to ONNX Runtime GenAI, including long-form transcription via Silero VAD-based chunking and multi-chunk decoding support integrated into the existing audio/model/generator pipeline.
Changes:
- Introduces Cohere Transcribe model implementation (processor + model state) with VAD-driven chunking and cross-cache decoder reuse.
- Extends Generator to support Cohere-specific multi-chunk decoding while presenting a single continuous token stream to callers.
- Adds E2E + C API tests and a Python example script; updates config schema and bumps onnxruntime-extensions dependency.
Reviewed changes
Copilot reviewed 33 out of 35 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| test/python/test_onnxruntime_genai_e2e.py | Adds Cohere Transcribe Python E2E coverage (short + long audio). |
| test/c_api_tests.cpp | Adds Cohere Transcribe C API creation + E2E WER sanity test helpers. |
| src/models/whisper.h | Allows CohereState to access WhisperDecoderState internals (friend) and adds pragma once. |
| src/models/whisper_processor.h | Updates processor constructor signature to accept Model&. |
| src/models/whisper_processor.cpp | Updates constructor implementation to accept Model& (unused). |
| src/models/silero_vad.h | Moves SileroVadState out of the header (encapsulation cleanup). |
| src/models/silero_vad.cpp | Defines SileroVadState in the .cpp (no longer in header). |
| src/models/qwen2_5_vl_image_processor.h | Updates processor constructor signature to accept Model&. |
| src/models/qwen2_5_vl_image_processor.cpp | Updates constructor implementation to accept Model& (unused). |
| src/models/processor.h | Updates Processor::Create factory to pass Model& into processors. |
| src/models/phi_multimodal_processor.h | Updates processor constructor signature to accept Model&. |
| src/models/phi_multimodal_processor.cpp | Updates constructor implementation to accept Model& (unused). |
| src/models/phi_image_processor.h | Updates processor constructor signature to accept Model&. |
| src/models/phi_image_processor.cpp | Updates constructor implementation to accept Model& (unused). |
| src/models/model.h | Registers CohereProcessor; updates MultiModalProcessor factory type; adds State::IsFirstRun. |
| src/models/model.cpp | Wires Cohere model selection; updates MultiModalProcessor creation + factory invocation. |
| src/models/model_type.h | Registers cohere_transcribe as an ALM model type. |
| src/models/mistral3_image_processor.h | Updates processor constructor signature to accept Model&. |
| src/models/mistral3_image_processor.cpp | Updates constructor implementation to accept Model& (unused). |
| src/models/gemma4_multimodal_processor.h | Updates processor constructor signature to accept Model&. |
| src/models/gemma4_multimodal_processor.cpp | Updates constructor implementation to accept Model& (unused). |
| src/models/gemma_image_processor.h | Updates processor constructor signature to accept Model&. |
| src/models/gemma_image_processor.cpp | Updates constructor implementation to accept Model& (unused). |
| src/models/cohere_processor.h | Adds CohereProcessor interface (mel computation + VAD splitting). |
| src/models/cohere_processor.cpp | Implements Cohere mel feature generation, VAD chunking, and multi-chunk tensor packing. |
| src/models/cohere_model.h | Adds Cohere encoder/decoder state and CohereModel wiring (multi-chunk, commit/stream). |
| src/models/cohere_model.cpp | Implements multi-chunk decoding, seam cleanup, and encoder/decoder wiring. |
| src/generators.h | Adds Cohere-specific generator fields/entry points (prompt caching, chunk loop). |
| src/generators.cpp | Implements Cohere multi-chunk decoding loop and user-visible committed-token streaming. |
| src/config.h | Adds Cohere mel normalization + VAD chunking config fields and encoder audio_stride. |
| src/config.cpp | Parses newly-added Cohere-related config fields from JSON. |
| examples/python/cohere_transcribe.py | Adds a Cohere Transcribe example (streaming output + optional WER check). |
| cmake/deps.txt | Updates onnxruntime-extensions revision. |
| .gitignore | Allows committing test audio assets under test/test_models/audios. |
|
|
||
| std::shared_ptr<MultiModalProcessor> Model::CreateMultiModalProcessor() const { | ||
| return std::make_shared<MultiModalProcessor>(*config_, session_info_); | ||
| return std::make_shared<MultiModalProcessor>(*config_, session_info_, const_cast<Model&>(*this)); |
| if (encoder_state_->IsFirstRun()) { | ||
| encoder_state_->Run(current_length, next_tokens, next_indices); | ||
| decoder_state_->UpdateInputsOutputs(next_tokens, next_indices, current_length, first_run_); | ||
| return decoder_state_->Run(current_length, next_tokens, next_indices); |
| const size_t pad_samples = static_cast<size_t>(config_->model.cohere_vad_speech_pad_ms) * samples_per_ms; | ||
| const size_t max_speech_samples = | ||
| static_cast<size_t>(config_->model.cohere_vad_max_speech_s * static_cast<float>(sample_rate)); | ||
|
|
| for (size_t off = 0; off + window <= num_samples; off += window) { | ||
| float p = vad_->ProcessWindow(samples + off, window); | ||
| is_speech.push_back(p >= threshold ? 1u : 0u); |
| // Decode audio to PCM, resampled to the model's expected sample rate. | ||
| ort_extensions::OrtxObjectPtr<OrtxTensorResult> decode_result; | ||
| int sample_rate = 0; | ||
| auto [pcm_data, num_samples] = GetDecodedPCM(audios->audios_.get(), 0, mel_cfg_.sample_rate, decode_result, sample_rate); | ||
|
|
| is_speech.reserve(num_samples / window); | ||
| for (size_t off = 0; off + window <= num_samples; off += window) { | ||
| float p = vad_->ProcessWindow(samples + off, window); | ||
| is_speech.push_back(p >= threshold ? 1u : 0u); | ||
| } |
There was a problem hiding this comment.
low value comment, at most 32ms can be skipped, no real world issue.
| if (is_cohere_model_) { | ||
| // Return only the committed tokens that have already been yielded to | ||
| // the caller via GenerateNextToken. The underlying search sequence holds | ||
| // raw per-chunk decoder output (with prompt prefix and pre-cleanup | ||
| // boundary tokens), which is not what the user should observe. |
| @@ -0,0 +1,137 @@ | |||
| """ | |||
| Audio chunking is handled internally by CohereProcessor + Generator. | |||
There was a problem hiding this comment.
Rather than creating new example files for each speech model, let's find a way to support all of them in a more generic and scalable way. Otherwise, we will need to keep creating, maintaining, and validating each model's example in each language.
| for (const auto& [name, value] : extra_inputs) { | ||
| if (name == audio_name) { | ||
| SetOrReplaceInput(audio_name.c_str(), value->ort_tensor_, audio_features_); | ||
| } else if (name == "mel_length") { |
There was a problem hiding this comment.
If this is an extra named input that comes from the preprocessor, can we compare its name to the one set in the GenAI config?
| std::map<int, std::shared_ptr<Tensor>> indexed_mel_lengths; | ||
|
|
||
| for (const auto& [name, tensor] : extra_inputs) { | ||
| if (name == "cohere_chunk_count") { |
There was a problem hiding this comment.
We should remove the cohere prefixes in these names. Same comment as here in addition.
| // Returns true if the rendered text of `token_id` does not begin with | ||
| // whitespace, i.e. a connecting space must be injected before it at a | ||
| // chunk seam to avoid "scamsIn" / "thisInto" artifacts. | ||
| bool TokenLacksLeadingSpace(int32_t token_id, const Tokenizer& tokenizer) { |
There was a problem hiding this comment.
We could get/set a logits mask during the generation loop using the get_logits/set_logits APIs so that we can customize this logic as needed for future models (e.g. fine-tuned variants that don't have this issue). It also has the extra benefit that we don't have to hardcode this logic beforehand.
| std::vector<float> input_data_; | ||
| }; | ||
|
|
||
| /// Internal State for SileroVad — uses State::Run for proper EP/run-options support. |
There was a problem hiding this comment.
Why move from header file to implementation file?
| v_.blank_id = static_cast<int>(JSON::Get<double>(value)); | ||
| } else if (name == "max_symbols_per_step") { | ||
| v_.max_symbols_per_step = static_cast<int>(JSON::Get<double>(value)); | ||
| } else if (name == "cohere_vad_min_silence_ms") { |
There was a problem hiding this comment.
Can we remove the cohere_vad prefix for these names and re-use any existing names where possible?
| state_ = model.CreateState(search_->GetSequenceLengths(), params); // Search sequence lengths set when creating state | ||
| guidance_logits_processor_ = CreateGuidanceLogitsProcessor(*state_); // Could be nullptr if use_guidance (constrained decoding) is not used | ||
|
|
||
| is_cohere_model_ = model.config_->model.type == "cohere_transcribe"; |
There was a problem hiding this comment.
There is a lot of "if cohere, do X, else do Y" logic that is being added in this file. This pattern was already done with Nemotron speech, but ideally this shouldn't be happening. This file is meant to be generic for all model types and not have model-specific logic as much as possible. Let's find another way to add these changes or introduce new higher-level internal methods for better abstraction.
This PR introduces the Cohere Transcribe model alongside support for long-form transcription on both CUDA and CPU backends.
Although the model was trained with long-form transcription in mind, the supported sequence length remains limited. Additionally, the current export uses a fixed 30-second input window, similar to Whisper.
For longer audio, we utilize Silero VAD to split speech into smaller segments based on silence boundaries before transcription. The current model export also does not support word-level timestamps, although this is not required for the intended Foundry Local integration scenarios, as Silero VAD already provides sufficiently accurate segmentation for long-form transcription. Future work will focus on improving VAD robustness and segmentation quality in noisy acoustic environments.
While this results in exceptional WER quality accross standard benchmarks, punctuation and capitalization can sometimes become inconsistent across chunk boundaries due to limited cross-chunk context in the current model. We are looking to work with Cohere to improve this behavior.