Pipeline-as-Config: Declarative model dispatch replacing model_type string registry#2115
Draft
justinchuby wants to merge 15 commits into
Draft
Pipeline-as-Config: Declarative model dispatch replacing model_type string registry#2115justinchuby wants to merge 15 commits into
justinchuby wants to merge 15 commits into
Conversation
Introduce Pipeline-as-Config: a new model dispatch path that uses config version instead of model_type string matching. When a config file has version >= 2, CreateModel() routes to PipelineConfigModel instead of the string-based dispatch chain. New files: - pipeline_config.h: PipelineConfigModel and PipelineConfigState - pipeline_config.cpp: Implementation reusing existing components (DefaultKeyValueCache, DefaultPositionInputs, Logits, etc.) Modified files: - config.h: Add version field (default 1 for backward compat) - config.cpp: Parse top-level 'version' field from JSON - model.cpp: Add v2 dispatch before existing model_type logic The PipelineConfigModel supports decoder-only models and produces identical output to DecoderOnly_Model. All existing v1 configs continue to route through the existing string dispatch unchanged. Build: 37 tests passed, 9 skipped (GPU-only) Net: +150 lines (3 new files, 2 modified) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>
Address review findings by inheriting from DecoderOnly_Model instead of reimplementing State from scratch. This fixes: 1. Missing logits_.Update() in UpdateInputsOutputs 2. Missing sliding window handling 3. Missing chunking support By inheriting from DecoderOnly_Model, PipelineConfigModel gets all current and future behavior for free — no code duplication, no drift. The class becomes a thin dispatch entry point (3 lines of code) that future PRs will extend for multi-session pipeline support. Net: -109 lines (14 added, 123 removed) Build: 37 tests passed, 9 skipped (GPU-only) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>
Implement PR 2 of Pipeline-as-Config: full v2 schema support with sessions, flow, dataflow, state, and extends mechanism. New files: - pipeline_config_schema.h/.cpp: PipelineConfig struct definitions and validation (sessions, flow steps, dataflow wires, state config) - pipeline_presets.h/.cpp: Built-in presets (autoregressive-decoder, vision-language, encoder-decoder) with override/merge semantics - v1_translator.h/.cpp: Auto-translates legacy v1 configs to PipelineConfig so downstream code has a single representation Modified files: - config.h: Add PipelineConfig field, include schema header - config.cpp: Add Pipeline_V2_Element JSON parser for pipeline object; resolve presets via extends; auto-translate v1 configs Tests: 22 new GTest cases covering presets, overrides, validation, and v1 translation for all model categories. Signed-off-by: Justin Chu <justinchu@microsoft.com>
1. ApplyOverrides sentinel bug: Changed state.kv_cache.format and state.position_ids.strategy from std::string with 'auto' default to std::optional<std::string>. This ensures ApplyOverrides only copies fields that were explicitly set, not default-initialized. Previously, explicitly setting 'auto' to override a non-auto preset value would silently fail. 2. Hardcoded dataflow index: v1_translator now searches for the vision→embedding wire by session name instead of using config.dataflow[0]. Robust against reordering of preset wires. 3. New regression tests: - OverrideAutoValueApplies: setting 'auto' overrides non-auto - UnsetOverrideDoesNotClobber: unset fields preserve base values Signed-off-by: Justin Chu <justinchu@microsoft.com>
- Add tests for phi4mm (MMM→vision-language) and marian-ssru (→encoder-decoder) translator branches - Add comprehensive LLM/VLM type coverage (all 21 LLM + all 4 VLM) - Extract PropagateKVCachePatterns() helper to eliminate copy-paste between LLM and VLM translator branches Signed-off-by: Justin Chu <justinchu@microsoft.com>
Implement PR 3 of Pipeline-as-Config: extend PipelineConfigModel to handle multi-session pipelines (vision + embedding + decoder). New components: - FlowInterpreter: thin orchestration layer that partitions flow steps into prompt-phase and decode-phase groups, stores intermediate tensors between sessions, and wires outputs to downstream inputs based on dataflow config. - PipelineConfigModel: loads named sessions from pipeline_config, creates per-session ORT options (graph capture disabled for non-decoder sessions), and constructs FlowInterpreter. - PipelineConfigState: orchestrates multi-session execution. Decoder session reuses existing components (KV cache, position inputs, logits) for full parity with DecoderOnly_State. Non-decoder sessions are run directly with I/O built from extra inputs + wired intermediates. Flow execution model: - Prompt phase: prompt_steps (when=prompt/once) then always_steps - Decode phase: only always_steps - Prompt-only intermediates are cleared after prompt completes Dataflow wiring: - Intermediate tensors from upstream sessions are automatically wired to downstream session inputs based on dataflow[] config entries. - Decoder inputs can be dynamically extended with wired intermediates (e.g., inputs_embeds from embedding session). Tests: 14 new FlowInterpreter tests covering flow partitioning, intermediate storage, dataflow wiring, and custom pipeline configs. All 79 existing tests pass (9 skipped, GPU-only). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>
Clear owned OrtValue unique_ptrs from intermediate_store_ alongside the non-owning pointers in FlowInterpreter when prompt phase ends. Without this, prompt-only session outputs (e.g., vision features) would leak memory after the prompt completes. Expose prompt_only_sessions() accessor on FlowInterpreter so the state can identify which intermediates to release. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>
Replace model_type string dispatch with config-driven fields: - CreatePositionInputs: check pipeline_config.state.position_ids.strategy instead of ModelType::IsQwen25VL(). The v1 translator already sets strategy='mrope_3d' for Qwen2.5-VL models, so existing configs work. - IsPastPresentShareBufferEnabled: check for encoder session in pipeline_config.sessions instead of model_type=='whisper'. Any encoder-decoder model (not just Whisper) gets the share buffer exception for beam search. - Remove model_type.h include from position_inputs.cpp (no longer needed). This eliminates 2 of the 6 model_type coupling points identified in the architecture analysis. The remaining checks in model.cpp (factory dispatch) and generators.cpp (error messages) are preserved for v1 backward compatibility. Build: 79 tests passed, 9 skipped (GPU-only) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>
Address review findings: PipelineConfigState::UpdateInputsOutputs was missing sliding window length clamping, and Run() was missing chunked context processing — both behaviors present in DecoderOnly_State. Fixes: 1. UpdateInputsOutputs: add sliding window config checks for position_length and kv_cache_length clamping, matching DecoderOnly_State::UpdateInputsOutputs exactly. 2. Run: add chunking support for single-session decoder path via RunDecoderWithChunking, matching DecoderOnly_State::RunWithChunking. 3. Multi-session path: move UpdateInputsOutputs call after the single-session early return for cleaner control flow. All 79 tests pass (9 skipped, GPU-only). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>
Restructure per review: PipelineConfigModel now inherits from DecoderOnly_Model (same pattern as PR 1). This gives us: 1. Single-session (decoder-only) configs: CreateState() returns DecoderOnly_State directly — zero custom code, full parity with sliding window, chunking, graph capture, and all future improvements. 2. Multi-session (VLM/encoder-decoder) configs: PipelineConfigState holds a DecoderOnly_State internally and delegates all decoder operations to it. FlowInterpreter orchestrates non-decoder sessions around the decoder, wiring intermediates via dataflow config. This eliminates ~80 lines of duplicated decoder logic (UpdateInputsOutputs, RunDecoderWithChunking, sliding window clamping, graph capture gating) and ensures the decoder path can never drift from DecoderOnly_State. Net: pipeline_config.cpp drops from 290 LOC to 220 LOC. All 79 tests pass (9 skipped, GPU-only). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>
…ntermediates Fix 2 remaining review findings: 1. FlowInterpreter is now fully stateless — no mutable intermediates map. GetWiredInputs takes an externally-owned intermediates map as a parameter. This prevents clobbering when multiple States share a Model (finding microsoft#3). 2. NonDecoderSessionIO stores owned std::string copies instead of raw c_str() pointers from ExtraInput, preventing dangling pointer bugs if the ExtraInput vector is relocated (finding microsoft#4). Intermediate tensor ownership is now fully in PipelineConfigState: - intermediate_owned_: unique_ptr<OrtValue> map (memory ownership) - intermediates_: raw OrtValue* map (fast lookup for wiring) Both are per-State, not per-Model. Tests updated to match stateless FlowInterpreter API. All 78 tests pass (9 skipped, GPU-only). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>
When multiple sessions produce outputs with the same tensor name,
GetOutput now returns the last-stored match (most-downstream session
in flow order) rather than the first match. This ensures that e.g.
GetOutput('hidden_states') returns the decoder's output, not the
encoder's, when both exist.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Justin Chu <justinchu@microsoft.com>
e749c21 to
ab971aa
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR introduces a “Pipeline-as-Config” v2 architecture where model dispatch and multi-session orchestration are driven by a declarative pipeline config (with presets + inheritance), and legacy v1 configs are translated into the v2 representation for backward compatibility.
Changes:
- Add v2 pipeline schema + preset system (
extends, sessions/flow/dataflow/state) with validation. - Introduce v1→v2 translation and config loading logic to populate
Config::pipeline_configfor both v1 and v2. - Add
PipelineConfigModel+FlowInterpreterto execute multi-session pipelines and switch position-id behavior to be config-driven.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
src/pipeline_config_schema.h |
Defines v2 PipelineConfig schema (sessions/flow/dataflow/state) and validation API. |
src/pipeline_config_schema.cpp |
Implements ValidatePipelineConfig for internal consistency checks. |
src/pipeline_presets.h |
Declares built-in presets and an overrides-merge API. |
src/pipeline_presets.cpp |
Implements presets (autoregressive-decoder, vision-language, encoder-decoder) and override application. |
src/v1_translator.h |
Declares v1→v2 translation API and documents mapping behavior. |
src/v1_translator.cpp |
Implements v1→v2 translation using ModelType categorization and preset wiring. |
src/config.h |
Adds Config::version and Config::pipeline_config fields. |
src/config.cpp |
Parses version and pipeline JSON; resolves presets/overrides; validates v2; translates v1 into v2. |
src/models/flow_interpreter.h |
Adds stateless flow partitioning and dataflow wiring resolution. |
src/models/flow_interpreter.cpp |
Implements flow partitioning and wired-input resolution from intermediates. |
src/models/pipeline_config.h |
Declares PipelineConfigModel/PipelineConfigState for config-driven execution. |
src/models/pipeline_config.cpp |
Implements multi-session orchestration and intermediate tensor management. |
src/models/model.cpp |
Dispatches v2 configs to PipelineConfigModel (v1 remains legacy string dispatch). |
src/models/position_inputs.cpp |
Switches Qwen 2.5 VL 3D position-id selection to config-driven strategy. |
src/models/kv_cache.cpp |
Updates past/present share-buffer enablement call-site to use the new signature. |
src/generators.h |
Changes IsPastPresentShareBufferEnabled signature to accept Config&. |
src/generators.cpp |
Implements new share-buffer gating using pipeline_config encoder presence. |
test/pipeline_config_schema_test.cpp |
Adds unit tests for presets, overrides, validation, and v1→v2 translation. |
test/flow_interpreter_test.cpp |
Adds unit tests for flow partitioning and dataflow wiring behavior. |
Comment on lines
+205
to
+219
| if (is_prompt_) { | ||
| is_prompt_ = false; | ||
| // Clear prompt-only intermediates from both maps | ||
| const auto& prompt_sessions = model_.flow_interpreter_->prompt_only_sessions(); | ||
| for (auto it = intermediates_.begin(); it != intermediates_.end();) { | ||
| auto dot = it->first.find('.'); | ||
| if (dot != std::string::npos && | ||
| prompt_sessions.count(it->first.substr(0, dot))) { | ||
| intermediate_owned_.erase(it->first); | ||
| it = intermediates_.erase(it); | ||
| } else { | ||
| ++it; | ||
| } | ||
| } | ||
| } |
Comment on lines
+153
to
+189
| void PipelineConfigState::RunFlowStep( | ||
| const PipelineConfig::FlowStep& step, | ||
| int total_length, | ||
| DeviceSpan<int32_t>& next_tokens, | ||
| DeviceSpan<int32_t> next_indices) { | ||
| if (step.run == "decoder") { | ||
| // Wire any intermediate inputs (e.g. inputs_embeds from embedding session) | ||
| // into the decoder state's input bindings before running. | ||
| auto wired = model_.flow_interpreter_->GetWiredInputs( | ||
| "decoder", intermediates_); | ||
| for (const auto& [input_name, value] : wired) { | ||
| bool found = false; | ||
| for (size_t i = 0; i < decoder_state_->input_names_.size(); ++i) { | ||
| if (std::strcmp(decoder_state_->input_names_[i], | ||
| input_name.c_str()) == 0) { | ||
| decoder_state_->inputs_[i] = value; | ||
| found = true; | ||
| break; | ||
| } | ||
| } | ||
| if (!found) { | ||
| wired_decoder_input_names_.push_back(input_name); | ||
| decoder_state_->input_names_.push_back( | ||
| wired_decoder_input_names_.back().c_str()); | ||
| decoder_state_->inputs_.push_back(value); | ||
| } | ||
| } | ||
|
|
||
| // Delegate to DecoderOnly_State::Run which handles everything: | ||
| // KV cache, position inputs, logits, sliding window, chunking, | ||
| // graph capture, run options. | ||
| last_logits_ = decoder_state_->Run(total_length, next_tokens, next_indices); | ||
| return; | ||
| } | ||
|
|
||
| RunNonDecoderSession(step.run); | ||
| } |
Comment on lines
+241
to
+251
| // Search intermediates by tensor name (suffix after "session."). | ||
| // If multiple sessions produce the same tensor name, the last-stored | ||
| // one wins (which is the most-downstream session in flow order). | ||
| OrtValue* result = nullptr; | ||
| for (const auto& [key, value] : intermediates_) { | ||
| auto dot = key.find('.'); | ||
| if (dot != std::string::npos && key.substr(dot + 1) == name) { | ||
| result = value; | ||
| } | ||
| } | ||
| if (result) return result; |
Comment on lines
+30
to
+31
| // Throws std::runtime_error if the model_type is not supported for | ||
| // translation (e.g. unknown or highly custom model types). |
Comment on lines
283
to
+290
| guidance_ff_tokens_enabled = enable_ff_tokens; | ||
| } | ||
|
|
||
| bool GeneratorParams::IsPastPresentShareBufferEnabled(const std::string& model_type) const { | ||
| bool GeneratorParams::IsPastPresentShareBufferEnabled(const Config& config) const { | ||
| // past_present_share_buffer is only actually enabled when: | ||
| // 1. The config option is set to true, AND | ||
| // 2. Either num_beams == 1 OR the model is Whisper | ||
| // 2. Either num_beams == 1 OR the model is encoder-decoder (has encoder session) | ||
| bool is_encoder_decoder = config.pipeline_config.sessions.count("encoder") > 0; |
Comment on lines
+13
to
+21
| PipelineConfigModel::PipelineConfigModel( | ||
| std::unique_ptr<Config> config, OrtEnv& ort_env) | ||
| : DecoderOnly_Model{std::move(config), ort_env} { | ||
| // DecoderOnly_Model already loaded session_decoder_ and registered it | ||
| // with session_info_. Now load any additional sessions for multi-session | ||
| // pipelines (vision, embedding, encoder, etc.). | ||
| const auto& pipeline_config = config_->pipeline_config; | ||
|
|
||
| for (const auto& [name, session_config] : pipeline_config.sessions) { |
Comment on lines
+163
to
+178
| for (const auto& [input_name, value] : wired) { | ||
| bool found = false; | ||
| for (size_t i = 0; i < decoder_state_->input_names_.size(); ++i) { | ||
| if (std::strcmp(decoder_state_->input_names_[i], | ||
| input_name.c_str()) == 0) { | ||
| decoder_state_->inputs_[i] = value; | ||
| found = true; | ||
| break; | ||
| } | ||
| } | ||
| if (!found) { | ||
| wired_decoder_input_names_.push_back(input_name); | ||
| decoder_state_->input_names_.push_back( | ||
| wired_decoder_input_names_.back().c_str()); | ||
| decoder_state_->inputs_.push_back(value); | ||
| } |
Fix all 7 review findings from Copilot review: 1. Use-after-free in wired decoder inputs: save/restore decoder input_names_/inputs_ size around wiring so prompt intermediates can be safely freed without leaving dangling pointers. 2. FlowStep.loop ignored: add loop awareness in RunFlowStep with TODO for per_image implementation (currently runs once). 3. GetOutput map order vs flow order: add output_by_tensor_name_ map updated at store time so last-executed session wins. 4. v1 translator doc comment: document fallback to autoregressive- decoder preset for unknown model types (matches implementation). 5. IsPastPresentShareBufferEnabled: keep encoder-session check (matches original Whisper behavior) with documented assumption that encoder-decoder models use cache_indirection for beams. 6. v2 decoder filename sync: after preset resolution, sync pipeline_config.sessions["decoder"].file to model.decoder.filename so DecoderOnly_Model loads the correct session. 7. c_str() invalidation: change wired_decoder_input_names_ from vector to deque for stable string addresses on push_back. Build: 82 tests passed, 18 skipped (GPU-only) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>
16 new test cases covering: - Complex multi-session dataflow (5-session chain) - Diamond dataflow topology (A→B, A→C, B→D, C→D) - Standalone session with no dataflow wires - Flow with only once/prompt/always steps - Large flow (10 sessions in chain) - Preset override with conflicting roles and state - Dataflow replacement semantics - KV cache pattern override propagation - Duplicate session in flow (valid — runs multiple times) - Empty loop string validation - VLM KV cache pattern propagation - Fara position strategy (mrope_3d) Total: 44 pipeline tests, 96 overall tests pass. Signed-off-by: Justin Chu <justinchu@microsoft.com>
Design updates: - Rename when vocabulary: prompt→init, always→step, add final - Backward-compat aliases auto-normalized via NormalizePipelineConfig() - generation_loop as std::optional (autoregressive/single_pass/denoising) - FlowInterpreter: init_steps_/step_steps_/final_steps_ - Throw clear error for unsupported 'when: final' - Unrecognized when values produce clear error messages Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>
a8e46c5 to
52dee03
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Introduces a Pipeline-as-Config architecture that enables new model support through JSON configuration instead of C++ source changes.
Motivation
Currently, every new model architecture requires adding a string to the model_type whitelist and potentially new C++ model/state classes. This creates a bottleneck: the ORT GenAI team must make source changes for every new HuggingFace model.
With Pipeline-as-Config, the runtime becomes a generic pipeline executor. New models are supported by generating ONNX graphs + a v2 pipeline config — zero C++ changes needed.
Design Principle
"Detect, don't declare" — the runtime infers model capabilities from config structure and ONNX session I/O, not from hardcoded model_type strings.
What This PR Adds
1. Config v2 Schema (pipeline_config_schema.h/cpp)
extendsmechanism for preset inheritance2. PipelineConfigModel (pipeline_config.h/cpp)
3. v1-to-v2 Translator (v1_translator.h/cpp)
4. Config-driven dispatch (model.cpp)
Config Examples
Minimal decoder-only LLM (4 lines):
{"version": 2, "pipeline": {"extends": "autoregressive-decoder", "sessions": {"decoder": {"file": "model.onnx"}}}, "tokens": {"eos": [2]}}Testing
Competitive Advantage
This makes ORT GenAI the only inference runtime where adding a new model — including multimodal, multi-session models — requires zero code in any language. Just ONNX graphs + JSON config.