Skip to content

Cohere Transcribe Support#2133

Open
nenad1002 wants to merge 51 commits into
mainfrom
nebanfic/cohere-transcribe-pr
Open

Cohere Transcribe Support#2133
nenad1002 wants to merge 51 commits into
mainfrom
nebanfic/cohere-transcribe-pr

Conversation

@nenad1002
Copy link
Copy Markdown
Contributor

This PR introduces the Cohere Transcribe model alongside support for long-form transcription on both CUDA and CPU backends.

Although the model was trained with long-form transcription in mind, the supported sequence length remains limited. Additionally, the current export uses a fixed 30-second input window, similar to Whisper.

For longer audio, we utilize Silero VAD to split speech into smaller segments based on silence boundaries before transcription. The current model export also does not support word-level timestamps, although this is not required for the intended Foundry Local integration scenarios, as Silero VAD already provides sufficiently accurate segmentation for long-form transcription. Future work will focus on improving VAD robustness and segmentation quality in noisy acoustic environments.

While this results in exceptional WER quality accross standard benchmarks, punctuation and capitalization can sometimes become inconsistent across chunk boundaries due to limited cross-chunk context in the current model. We are looking to work with Cohere to improve this behavior.

Copilot AI review requested due to automatic review settings May 6, 2026 20:18
@nenad1002 nenad1002 requested a review from a team as a code owner May 6, 2026 20:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds first-class support for the Cohere Transcribe ASR model to ONNX Runtime GenAI, including long-form transcription via Silero VAD-based chunking and multi-chunk decoding support integrated into the existing audio/model/generator pipeline.

Changes:

  • Introduces Cohere Transcribe model implementation (processor + model state) with VAD-driven chunking and cross-cache decoder reuse.
  • Extends Generator to support Cohere-specific multi-chunk decoding while presenting a single continuous token stream to callers.
  • Adds E2E + C API tests and a Python example script; updates config schema and bumps onnxruntime-extensions dependency.

Reviewed changes

Copilot reviewed 33 out of 35 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
test/python/test_onnxruntime_genai_e2e.py Adds Cohere Transcribe Python E2E coverage (short + long audio).
test/c_api_tests.cpp Adds Cohere Transcribe C API creation + E2E WER sanity test helpers.
src/models/whisper.h Allows CohereState to access WhisperDecoderState internals (friend) and adds pragma once.
src/models/whisper_processor.h Updates processor constructor signature to accept Model&.
src/models/whisper_processor.cpp Updates constructor implementation to accept Model& (unused).
src/models/silero_vad.h Moves SileroVadState out of the header (encapsulation cleanup).
src/models/silero_vad.cpp Defines SileroVadState in the .cpp (no longer in header).
src/models/qwen2_5_vl_image_processor.h Updates processor constructor signature to accept Model&.
src/models/qwen2_5_vl_image_processor.cpp Updates constructor implementation to accept Model& (unused).
src/models/processor.h Updates Processor::Create factory to pass Model& into processors.
src/models/phi_multimodal_processor.h Updates processor constructor signature to accept Model&.
src/models/phi_multimodal_processor.cpp Updates constructor implementation to accept Model& (unused).
src/models/phi_image_processor.h Updates processor constructor signature to accept Model&.
src/models/phi_image_processor.cpp Updates constructor implementation to accept Model& (unused).
src/models/model.h Registers CohereProcessor; updates MultiModalProcessor factory type; adds State::IsFirstRun.
src/models/model.cpp Wires Cohere model selection; updates MultiModalProcessor creation + factory invocation.
src/models/model_type.h Registers cohere_transcribe as an ALM model type.
src/models/mistral3_image_processor.h Updates processor constructor signature to accept Model&.
src/models/mistral3_image_processor.cpp Updates constructor implementation to accept Model& (unused).
src/models/gemma4_multimodal_processor.h Updates processor constructor signature to accept Model&.
src/models/gemma4_multimodal_processor.cpp Updates constructor implementation to accept Model& (unused).
src/models/gemma_image_processor.h Updates processor constructor signature to accept Model&.
src/models/gemma_image_processor.cpp Updates constructor implementation to accept Model& (unused).
src/models/cohere_processor.h Adds CohereProcessor interface (mel computation + VAD splitting).
src/models/cohere_processor.cpp Implements Cohere mel feature generation, VAD chunking, and multi-chunk tensor packing.
src/models/cohere_model.h Adds Cohere encoder/decoder state and CohereModel wiring (multi-chunk, commit/stream).
src/models/cohere_model.cpp Implements multi-chunk decoding, seam cleanup, and encoder/decoder wiring.
src/generators.h Adds Cohere-specific generator fields/entry points (prompt caching, chunk loop).
src/generators.cpp Implements Cohere multi-chunk decoding loop and user-visible committed-token streaming.
src/config.h Adds Cohere mel normalization + VAD chunking config fields and encoder audio_stride.
src/config.cpp Parses newly-added Cohere-related config fields from JSON.
examples/python/cohere_transcribe.py Adds a Cohere Transcribe example (streaming output + optional WER check).
cmake/deps.txt Updates onnxruntime-extensions revision.
.gitignore Allows committing test audio assets under test/test_models/audios.

Comment thread src/models/model.cpp

std::shared_ptr<MultiModalProcessor> Model::CreateMultiModalProcessor() const {
return std::make_shared<MultiModalProcessor>(*config_, session_info_);
return std::make_shared<MultiModalProcessor>(*config_, session_info_, const_cast<Model&>(*this));
if (encoder_state_->IsFirstRun()) {
encoder_state_->Run(current_length, next_tokens, next_indices);
decoder_state_->UpdateInputsOutputs(next_tokens, next_indices, current_length, first_run_);
return decoder_state_->Run(current_length, next_tokens, next_indices);
Comment on lines +167 to +170
const size_t pad_samples = static_cast<size_t>(config_->model.cohere_vad_speech_pad_ms) * samples_per_ms;
const size_t max_speech_samples =
static_cast<size_t>(config_->model.cohere_vad_max_speech_s * static_cast<float>(sample_rate));

Comment on lines +101 to +103
for (size_t off = 0; off + window <= num_samples; off += window) {
float p = vad_->ProcessWindow(samples + off, window);
is_speech.push_back(p >= threshold ? 1u : 0u);
Comment on lines +258 to +262
// Decode audio to PCM, resampled to the model's expected sample rate.
ort_extensions::OrtxObjectPtr<OrtxTensorResult> decode_result;
int sample_rate = 0;
auto [pcm_data, num_samples] = GetDecodedPCM(audios->audios_.get(), 0, mel_cfg_.sample_rate, decode_result, sample_rate);

Comment on lines +100 to +104
is_speech.reserve(num_samples / window);
for (size_t off = 0; off + window <= num_samples; off += window) {
float p = vad_->ProcessWindow(samples + off, window);
is_speech.push_back(p >= threshold ? 1u : 0u);
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low value comment, at most 32ms can be skipped, no real world issue.

Comment thread src/generators.cpp
Comment on lines +723 to +727
if (is_cohere_model_) {
// Return only the committed tokens that have already been yielded to
// the caller via GenerateNextToken. The underlying search sequence holds
// raw per-chunk decoder output (with prompt prefix and pre-cleanup
// boundary tokens), which is not what the user should observe.
@@ -0,0 +1,137 @@
"""
Audio chunking is handled internally by CohereProcessor + Generator.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than creating new example files for each speech model, let's find a way to support all of them in a more generic and scalable way. Otherwise, we will need to keep creating, maintaining, and validating each model's example in each language.

for (const auto& [name, value] : extra_inputs) {
if (name == audio_name) {
SetOrReplaceInput(audio_name.c_str(), value->ort_tensor_, audio_features_);
} else if (name == "mel_length") {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is an extra named input that comes from the preprocessor, can we compare its name to the one set in the GenAI config?

std::map<int, std::shared_ptr<Tensor>> indexed_mel_lengths;

for (const auto& [name, tensor] : extra_inputs) {
if (name == "cohere_chunk_count") {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove the cohere prefixes in these names. Same comment as here in addition.

// Returns true if the rendered text of `token_id` does not begin with
// whitespace, i.e. a connecting space must be injected before it at a
// chunk seam to avoid "scamsIn" / "thisInto" artifacts.
bool TokenLacksLeadingSpace(int32_t token_id, const Tokenizer& tokenizer) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could get/set a logits mask during the generation loop using the get_logits/set_logits APIs so that we can customize this logic as needed for future models (e.g. fine-tuned variants that don't have this issue). It also has the extra benefit that we don't have to hardcode this logic beforehand.

Comment thread src/models/silero_vad.h
std::vector<float> input_data_;
};

/// Internal State for SileroVad — uses State::Run for proper EP/run-options support.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why move from header file to implementation file?

Comment thread src/config.cpp
v_.blank_id = static_cast<int>(JSON::Get<double>(value));
} else if (name == "max_symbols_per_step") {
v_.max_symbols_per_step = static_cast<int>(JSON::Get<double>(value));
} else if (name == "cohere_vad_min_silence_ms") {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the cohere_vad prefix for these names and re-use any existing names where possible?

Comment thread src/generators.cpp
state_ = model.CreateState(search_->GetSequenceLengths(), params); // Search sequence lengths set when creating state
guidance_logits_processor_ = CreateGuidanceLogitsProcessor(*state_); // Could be nullptr if use_guidance (constrained decoding) is not used

is_cohere_model_ = model.config_->model.type == "cohere_transcribe";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of "if cohere, do X, else do Y" logic that is being added in this file. This pattern was already done with Nemotron speech, but ideally this shouldn't be happening. This file is meant to be generic for all model types and not have model-specific logic as much as possible. Let's find another way to add these changes or introduce new higher-level internal methods for better abstraction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants