Pipeline-as-Config: Declarative model dispatch replacing model_type string registry by justinchuby · Pull Request #2115 · microsoft/onnxruntime-genai

justinchuby · 2026-05-02T03:18:45Z

Summary

Introduces a Pipeline-as-Config architecture that enables new model support through JSON configuration instead of C++ source changes.

Motivation

Currently, every new model architecture requires adding a string to the model_type whitelist and potentially new C++ model/state classes. This creates a bottleneck: the ORT GenAI team must make source changes for every new HuggingFace model.

With Pipeline-as-Config, the runtime becomes a generic pipeline executor. New models are supported by generating ONNX graphs + a v2 pipeline config — zero C++ changes needed.

Design Principle

"Detect, don't declare" — the runtime infers model capabilities from config structure and ONNX session I/O, not from hardcoded model_type strings.

What This PR Adds

1. Config v2 Schema (pipeline_config_schema.h/cpp)

Pipeline config with sessions, flow, dataflow, state sections
extends mechanism for preset inheritance
Built-in presets: autoregressive-decoder, vision-language, encoder-decoder

2. PipelineConfigModel (pipeline_config.h/cpp)

Decoder-only: inherits DecoderOnly_Model (zero code duplication)
Multi-session: FlowInterpreter orchestrates vision/embedding/decoder sessions
Decoder session delegates to DecoderOnly_State via composition

3. v1-to-v2 Translator (v1_translator.h/cpp)

Auto-translates existing genai_config.json to v2 pipeline format at load time
Full backward compatibility — all existing models continue to work unchanged

4. Config-driven dispatch (model.cpp)

Replaces model_type string checks with pipeline config fields
position_ids.strategy instead of IsQwen25VL()
Encoder session presence instead of model_type=="whisper"

Config Examples

Minimal decoder-only LLM (4 lines):

{"version": 2, "pipeline": {"extends": "autoregressive-decoder", "sessions": {"decoder": {"file": "model.onnx"}}}, "tokens": {"eos": [2]}}

Testing

78+ tests pass (all existing + new pipeline tests)
All existing model types tested via v1-to-v2 translator
Backward compatible — v1 configs unchanged

Competitive Advantage

This makes ORT GenAI the only inference runtime where adding a new model — including multimodal, multi-session models — requires zero code in any language. Just ONNX graphs + JSON config.

Introduce Pipeline-as-Config: a new model dispatch path that uses config version instead of model_type string matching. When a config file has version >= 2, CreateModel() routes to PipelineConfigModel instead of the string-based dispatch chain. New files: - pipeline_config.h: PipelineConfigModel and PipelineConfigState - pipeline_config.cpp: Implementation reusing existing components (DefaultKeyValueCache, DefaultPositionInputs, Logits, etc.) Modified files: - config.h: Add version field (default 1 for backward compat) - config.cpp: Parse top-level 'version' field from JSON - model.cpp: Add v2 dispatch before existing model_type logic The PipelineConfigModel supports decoder-only models and produces identical output to DecoderOnly_Model. All existing v1 configs continue to route through the existing string dispatch unchanged. Build: 37 tests passed, 9 skipped (GPU-only) Net: +150 lines (3 new files, 2 modified) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>

Address review findings by inheriting from DecoderOnly_Model instead of reimplementing State from scratch. This fixes: 1. Missing logits_.Update() in UpdateInputsOutputs 2. Missing sliding window handling 3. Missing chunking support By inheriting from DecoderOnly_Model, PipelineConfigModel gets all current and future behavior for free — no code duplication, no drift. The class becomes a thin dispatch entry point (3 lines of code) that future PRs will extend for multi-session pipeline support. Net: -109 lines (14 added, 123 removed) Build: 37 tests passed, 9 skipped (GPU-only) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>

Implement PR 2 of Pipeline-as-Config: full v2 schema support with sessions, flow, dataflow, state, and extends mechanism. New files: - pipeline_config_schema.h/.cpp: PipelineConfig struct definitions and validation (sessions, flow steps, dataflow wires, state config) - pipeline_presets.h/.cpp: Built-in presets (autoregressive-decoder, vision-language, encoder-decoder) with override/merge semantics - v1_translator.h/.cpp: Auto-translates legacy v1 configs to PipelineConfig so downstream code has a single representation Modified files: - config.h: Add PipelineConfig field, include schema header - config.cpp: Add Pipeline_V2_Element JSON parser for pipeline object; resolve presets via extends; auto-translate v1 configs Tests: 22 new GTest cases covering presets, overrides, validation, and v1 translation for all model categories. Signed-off-by: Justin Chu <justinchu@microsoft.com>

1. ApplyOverrides sentinel bug: Changed state.kv_cache.format and state.position_ids.strategy from std::string with 'auto' default to std::optional<std::string>. This ensures ApplyOverrides only copies fields that were explicitly set, not default-initialized. Previously, explicitly setting 'auto' to override a non-auto preset value would silently fail. 2. Hardcoded dataflow index: v1_translator now searches for the vision→embedding wire by session name instead of using config.dataflow[0]. Robust against reordering of preset wires. 3. New regression tests: - OverrideAutoValueApplies: setting 'auto' overrides non-auto - UnsetOverrideDoesNotClobber: unset fields preserve base values Signed-off-by: Justin Chu <justinchu@microsoft.com>

- Add tests for phi4mm (MMM→vision-language) and marian-ssru (→encoder-decoder) translator branches - Add comprehensive LLM/VLM type coverage (all 21 LLM + all 4 VLM) - Extract PropagateKVCachePatterns() helper to eliminate copy-paste between LLM and VLM translator branches Signed-off-by: Justin Chu <justinchu@microsoft.com>

Implement PR 3 of Pipeline-as-Config: extend PipelineConfigModel to handle multi-session pipelines (vision + embedding + decoder). New components: - FlowInterpreter: thin orchestration layer that partitions flow steps into prompt-phase and decode-phase groups, stores intermediate tensors between sessions, and wires outputs to downstream inputs based on dataflow config. - PipelineConfigModel: loads named sessions from pipeline_config, creates per-session ORT options (graph capture disabled for non-decoder sessions), and constructs FlowInterpreter. - PipelineConfigState: orchestrates multi-session execution. Decoder session reuses existing components (KV cache, position inputs, logits) for full parity with DecoderOnly_State. Non-decoder sessions are run directly with I/O built from extra inputs + wired intermediates. Flow execution model: - Prompt phase: prompt_steps (when=prompt/once) then always_steps - Decode phase: only always_steps - Prompt-only intermediates are cleared after prompt completes Dataflow wiring: - Intermediate tensors from upstream sessions are automatically wired to downstream session inputs based on dataflow[] config entries. - Decoder inputs can be dynamically extended with wired intermediates (e.g., inputs_embeds from embedding session). Tests: 14 new FlowInterpreter tests covering flow partitioning, intermediate storage, dataflow wiring, and custom pipeline configs. All 79 existing tests pass (9 skipped, GPU-only). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>

Clear owned OrtValue unique_ptrs from intermediate_store_ alongside the non-owning pointers in FlowInterpreter when prompt phase ends. Without this, prompt-only session outputs (e.g., vision features) would leak memory after the prompt completes. Expose prompt_only_sessions() accessor on FlowInterpreter so the state can identify which intermediates to release. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>

Replace model_type string dispatch with config-driven fields: - CreatePositionInputs: check pipeline_config.state.position_ids.strategy instead of ModelType::IsQwen25VL(). The v1 translator already sets strategy='mrope_3d' for Qwen2.5-VL models, so existing configs work. - IsPastPresentShareBufferEnabled: check for encoder session in pipeline_config.sessions instead of model_type=='whisper'. Any encoder-decoder model (not just Whisper) gets the share buffer exception for beam search. - Remove model_type.h include from position_inputs.cpp (no longer needed). This eliminates 2 of the 6 model_type coupling points identified in the architecture analysis. The remaining checks in model.cpp (factory dispatch) and generators.cpp (error messages) are preserved for v1 backward compatibility. Build: 79 tests passed, 9 skipped (GPU-only) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>

Address review findings: PipelineConfigState::UpdateInputsOutputs was missing sliding window length clamping, and Run() was missing chunked context processing — both behaviors present in DecoderOnly_State. Fixes: 1. UpdateInputsOutputs: add sliding window config checks for position_length and kv_cache_length clamping, matching DecoderOnly_State::UpdateInputsOutputs exactly. 2. Run: add chunking support for single-session decoder path via RunDecoderWithChunking, matching DecoderOnly_State::RunWithChunking. 3. Multi-session path: move UpdateInputsOutputs call after the single-session early return for cleaner control flow. All 79 tests pass (9 skipped, GPU-only). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>

Restructure per review: PipelineConfigModel now inherits from DecoderOnly_Model (same pattern as PR 1). This gives us: 1. Single-session (decoder-only) configs: CreateState() returns DecoderOnly_State directly — zero custom code, full parity with sliding window, chunking, graph capture, and all future improvements. 2. Multi-session (VLM/encoder-decoder) configs: PipelineConfigState holds a DecoderOnly_State internally and delegates all decoder operations to it. FlowInterpreter orchestrates non-decoder sessions around the decoder, wiring intermediates via dataflow config. This eliminates ~80 lines of duplicated decoder logic (UpdateInputsOutputs, RunDecoderWithChunking, sliding window clamping, graph capture gating) and ensures the decoder path can never drift from DecoderOnly_State. Net: pipeline_config.cpp drops from 290 LOC to 220 LOC. All 79 tests pass (9 skipped, GPU-only). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>

…ntermediates Fix 2 remaining review findings: 1. FlowInterpreter is now fully stateless — no mutable intermediates map. GetWiredInputs takes an externally-owned intermediates map as a parameter. This prevents clobbering when multiple States share a Model (finding microsoft#3). 2. NonDecoderSessionIO stores owned std::string copies instead of raw c_str() pointers from ExtraInput, preventing dangling pointer bugs if the ExtraInput vector is relocated (finding microsoft#4). Intermediate tensor ownership is now fully in PipelineConfigState: - intermediate_owned_: unique_ptr<OrtValue> map (memory ownership) - intermediates_: raw OrtValue* map (fast lookup for wiring) Both are per-State, not per-Model. Tests updated to match stateless FlowInterpreter API. All 78 tests pass (9 skipped, GPU-only). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>

When multiple sessions produce outputs with the same tensor name, GetOutput now returns the last-stored match (most-downstream session in flow order) rather than the first match. This ensures that e.g. GetOutput('hidden_states') returns the decoder's output, not the encoder's, when both exist. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>

Copilot

Pull request overview

This PR introduces a “Pipeline-as-Config” v2 architecture where model dispatch and multi-session orchestration are driven by a declarative pipeline config (with presets + inheritance), and legacy v1 configs are translated into the v2 representation for backward compatibility.

Changes:

Add v2 pipeline schema + preset system (extends, sessions/flow/dataflow/state) with validation.
Introduce v1→v2 translation and config loading logic to populate Config::pipeline_config for both v1 and v2.
Add PipelineConfigModel + FlowInterpreter to execute multi-session pipelines and switch position-id behavior to be config-driven.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`src/pipeline_config_schema.h`	Defines v2 `PipelineConfig` schema (sessions/flow/dataflow/state) and validation API.
`src/pipeline_config_schema.cpp`	Implements `ValidatePipelineConfig` for internal consistency checks.
`src/pipeline_presets.h`	Declares built-in presets and an overrides-merge API.
`src/pipeline_presets.cpp`	Implements presets (`autoregressive-decoder`, `vision-language`, `encoder-decoder`) and override application.
`src/v1_translator.h`	Declares v1→v2 translation API and documents mapping behavior.
`src/v1_translator.cpp`	Implements v1→v2 translation using `ModelType` categorization and preset wiring.
`src/config.h`	Adds `Config::version` and `Config::pipeline_config` fields.
`src/config.cpp`	Parses `version` and `pipeline` JSON; resolves presets/overrides; validates v2; translates v1 into v2.
`src/models/flow_interpreter.h`	Adds stateless flow partitioning and dataflow wiring resolution.
`src/models/flow_interpreter.cpp`	Implements flow partitioning and wired-input resolution from intermediates.
`src/models/pipeline_config.h`	Declares `PipelineConfigModel`/`PipelineConfigState` for config-driven execution.
`src/models/pipeline_config.cpp`	Implements multi-session orchestration and intermediate tensor management.
`src/models/model.cpp`	Dispatches v2 configs to `PipelineConfigModel` (v1 remains legacy string dispatch).
`src/models/position_inputs.cpp`	Switches Qwen 2.5 VL 3D position-id selection to config-driven strategy.
`src/models/kv_cache.cpp`	Updates past/present share-buffer enablement call-site to use the new signature.
`src/generators.h`	Changes `IsPastPresentShareBufferEnabled` signature to accept `Config&`.
`src/generators.cpp`	Implements new share-buffer gating using `pipeline_config` encoder presence.
`test/pipeline_config_schema_test.cpp`	Adds unit tests for presets, overrides, validation, and v1→v2 translation.
`test/flow_interpreter_test.cpp`	Adds unit tests for flow partitioning and dataflow wiring behavior.

+  if (is_prompt_) {
+    is_prompt_ = false;
+    // Clear prompt-only intermediates from both maps
+    const auto& prompt_sessions = model_.flow_interpreter_->prompt_only_sessions();
+    for (auto it = intermediates_.begin(); it != intermediates_.end();) {
+      auto dot = it->first.find('.');
+      if (dot != std::string::npos &&
+          prompt_sessions.count(it->first.substr(0, dot))) {
+        intermediate_owned_.erase(it->first);
+        it = intermediates_.erase(it);
+      } else {
+        ++it;
+      }
+    }
+  }


+void PipelineConfigState::RunFlowStep(
+    const PipelineConfig::FlowStep& step,
+    int total_length,
+    DeviceSpan<int32_t>& next_tokens,
+    DeviceSpan<int32_t> next_indices) {
+  if (step.run == "decoder") {
+    // Wire any intermediate inputs (e.g. inputs_embeds from embedding session)
+    // into the decoder state's input bindings before running.
+    auto wired = model_.flow_interpreter_->GetWiredInputs(
+        "decoder", intermediates_);
+    for (const auto& [input_name, value] : wired) {
+      bool found = false;
+      for (size_t i = 0; i < decoder_state_->input_names_.size(); ++i) {
+        if (std::strcmp(decoder_state_->input_names_[i],
+                        input_name.c_str()) == 0) {
+          decoder_state_->inputs_[i] = value;
+          found = true;
+          break;
+        }
+      }
+      if (!found) {
+        wired_decoder_input_names_.push_back(input_name);
+        decoder_state_->input_names_.push_back(
+            wired_decoder_input_names_.back().c_str());
+        decoder_state_->inputs_.push_back(value);
+      }
+    }
+
+    // Delegate to DecoderOnly_State::Run which handles everything:
+    // KV cache, position inputs, logits, sliding window, chunking,
+    // graph capture, run options.
+    last_logits_ = decoder_state_->Run(total_length, next_tokens, next_indices);
+    return;
+  }
+
+  RunNonDecoderSession(step.run);
+}


+  // Search intermediates by tensor name (suffix after "session.").
+  // If multiple sessions produce the same tensor name, the last-stored
+  // one wins (which is the most-downstream session in flow order).
+  OrtValue* result = nullptr;
+  for (const auto& [key, value] : intermediates_) {
+    auto dot = key.find('.');
+    if (dot != std::string::npos && key.substr(dot + 1) == name) {
+      result = value;
+    }
+  }
+  if (result) return result;


+// Throws std::runtime_error if the model_type is not supported for
+// translation (e.g. unknown or highly custom model types).


  guidance_ff_tokens_enabled = enable_ff_tokens;
 }

-bool GeneratorParams::IsPastPresentShareBufferEnabled(const std::string& model_type) const {
+bool GeneratorParams::IsPastPresentShareBufferEnabled(const Config& config) const {
  // past_present_share_buffer is only actually enabled when:
  // 1. The config option is set to true, AND
-  // 2. Either num_beams == 1 OR the model is Whisper
+  // 2. Either num_beams == 1 OR the model is encoder-decoder (has encoder session)
+  bool is_encoder_decoder = config.pipeline_config.sessions.count("encoder") > 0;


+PipelineConfigModel::PipelineConfigModel(
+    std::unique_ptr<Config> config, OrtEnv& ort_env)
+    : DecoderOnly_Model{std::move(config), ort_env} {
+  // DecoderOnly_Model already loaded session_decoder_ and registered it
+  // with session_info_.  Now load any additional sessions for multi-session
+  // pipelines (vision, embedding, encoder, etc.).
+  const auto& pipeline_config = config_->pipeline_config;
+
+  for (const auto& [name, session_config] : pipeline_config.sessions) {


+    for (const auto& [input_name, value] : wired) {
+      bool found = false;
+      for (size_t i = 0; i < decoder_state_->input_names_.size(); ++i) {
+        if (std::strcmp(decoder_state_->input_names_[i],
+                        input_name.c_str()) == 0) {
+          decoder_state_->inputs_[i] = value;
+          found = true;
+          break;
+        }
+      }
+      if (!found) {
+        wired_decoder_input_names_.push_back(input_name);
+        decoder_state_->input_names_.push_back(
+            wired_decoder_input_names_.back().c_str());
+        decoder_state_->inputs_.push_back(value);
+      }


Fix all 7 review findings from Copilot review: 1. Use-after-free in wired decoder inputs: save/restore decoder input_names_/inputs_ size around wiring so prompt intermediates can be safely freed without leaving dangling pointers. 2. FlowStep.loop ignored: add loop awareness in RunFlowStep with TODO for per_image implementation (currently runs once). 3. GetOutput map order vs flow order: add output_by_tensor_name_ map updated at store time so last-executed session wins. 4. v1 translator doc comment: document fallback to autoregressive- decoder preset for unknown model types (matches implementation). 5. IsPastPresentShareBufferEnabled: keep encoder-session check (matches original Whisper behavior) with documented assumption that encoder-decoder models use cache_indirection for beams. 6. v2 decoder filename sync: after preset resolution, sync pipeline_config.sessions["decoder"].file to model.decoder.filename so DecoderOnly_Model loads the correct session. 7. c_str() invalidation: change wired_decoder_input_names_ from vector to deque for stable string addresses on push_back. Build: 82 tests passed, 18 skipped (GPU-only) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>

16 new test cases covering: - Complex multi-session dataflow (5-session chain) - Diamond dataflow topology (A→B, A→C, B→D, C→D) - Standalone session with no dataflow wires - Flow with only once/prompt/always steps - Large flow (10 sessions in chain) - Preset override with conflicting roles and state - Dataflow replacement semantics - KV cache pattern override propagation - Duplicate session in flow (valid — runs multiple times) - Empty loop string validation - VLM KV cache pattern propagation - Fara position strategy (mrope_3d) Total: 44 pipeline tests, 96 overall tests pass. Signed-off-by: Justin Chu <justinchu@microsoft.com>

Design updates: - Rename when vocabulary: prompt→init, always→step, add final - Backward-compat aliases auto-normalized via NormalizePipelineConfig() - generation_loop as std::optional (autoregressive/single_pass/denoising) - FlowInterpreter: init_steps_/step_steps_/final_steps_ - Throw clear error for unsupported 'when: final' - Unrecognized when values produce clear error messages Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>

Copilot AI review requested due to automatic review settings May 2, 2026 03:18

Copilot started reviewing on behalf of justinchuby May 2, 2026 03:19 View session

justinchuby and others added 12 commits May 2, 2026 03:21

justinchuby linked an issue May 2, 2026 that may be closed by this pull request

Copilot design of Pipeline-as-Config #2114

Open

justinchuby force-pushed the pipeline-as-config branch from e749c21 to ab971aa Compare May 2, 2026 03:23

Copilot AI reviewed May 2, 2026

View reviewed changes

justinchuby and others added 3 commits May 2, 2026 03:50

justinchuby force-pushed the pipeline-as-config branch from a8e46c5 to 52dee03 Compare May 2, 2026 04:16

justinchuby marked this pull request as draft May 5, 2026 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline-as-Config: Declarative model dispatch replacing model_type string registry#2115

Pipeline-as-Config: Declarative model dispatch replacing model_type string registry#2115
justinchuby wants to merge 15 commits into
microsoft:mainfrom
justinchuby:pipeline-as-config

justinchuby commented May 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// Throws std::runtime_error if the model_type is not supported for
		// translation (e.g. unknown or highly custom model types).

Conversation

justinchuby commented May 2, 2026

Summary

Motivation

Design Principle

What This PR Adds

1. Config v2 Schema (pipeline_config_schema.h/cpp)

2. PipelineConfigModel (pipeline_config.h/cpp)

3. v1-to-v2 Translator (v1_translator.h/cpp)

4. Config-driven dispatch (model.cpp)

Config Examples

Testing

Competitive Advantage

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants