feat(vllm_batch): add reasoning_parser support for batch inference#804
Open
feat(vllm_batch): add reasoning_parser support for batch inference#804
Conversation
- Add `reasoning_parser` field to VLLMModelConfig - Pass it to `structured_outputs_config` in init_engine (async path) - Filter it from AsyncEngineArgs (it's a serving-layer arg, not engine arg) - init_vllm subprocess path already handles it via CLI arg conversion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
tool_call_parser, enable_auto_tool_choice, chat_template, and reasoning_parser are app-state args, not AsyncEngineArgs params. Pass them via app_state_args only. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
reasoning_parserfield toVLLMModelConfigso batch jobs can configure reasoning parsers (e.g.nemotron_v3,deepseek_r1)structured_outputs_configininit_engine(async path) — previously hardcoded toNoneAsyncEngineArgs(it's a serving-layer arg, not an engine arg)init_vllmsubprocess path already handles it correctly via CLI arg conversionMotivation
Needed to deploy NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 to batch inference with
reasoning_parser=nemotron_v3,tool_call_parser=qwen3_coder, andenable_auto_tool_choice=True.🤖 Generated with Claude Code
Greptile Summary
This PR adds
reasoning_parsersupport for vLLM batch inference by introducing the field inVLLMModelConfigand wiring it through the asyncinit_enginepath'sstructured_outputs_config. It also correctly expands_serving_only_keysto filtertool_call_parser,enable_auto_tool_choice, andchat_templatefromAsyncEngineArgs— fixing a latentTypeErrorfor any batch job using those fields.Key changes:
VLLMModelConfiggains a newreasoning_parser: Optional[str]field alongside the existing serving-layer fields._serving_only_keysnow covers all four serving-layer args so they are safely excluded when constructingAsyncEngineArgs.structured_outputs_configis populated withparsed_configs.reasoning_parserinstead of the previously hardcodedNone.Issue found:
tool_call_parserandenable_auto_tool_choiceare correctly removed fromengine_args_dictbut their user-configured values are never forwarded toapp_state_args.default_app_state_argshardcodes them asNone/False, and since they no longer arrive via the**engine_args_dictmerge, any batch job configured withtool_call_parser=qwen3_coderorenable_auto_tool_choice=Truewill silently use the defaults — the exact use case motivating this PR.Confidence Score: 4/5
Not safe to merge as-is — the primary use case (tool_call_parser + enable_auto_tool_choice for the Nemotron deployment) is silently broken.
One P1 bug:
tool_call_parserandenable_auto_tool_choiceare filtered fromengine_args_dictbut never explicitly propagated toapp_state_args, so user-configured values are silently ignored. This directly undermines the stated motivation of the PR. The fix is a two-line change indefault_app_state_args. Thereasoning_parserwiring and the DTO addition are correct.model-engine/model_engine_server/inference/vllm/vllm_batch.py— specificallydefault_app_state_argswheretool_call_parserandenable_auto_tool_choiceneed to referenceparsed_configsinstead of being hardcoded.Important Files Changed
reasoning_parser: Optional[str]field toVLLMModelConfig— straightforward, well-placed alongside the other serving-layer fields (tool_call_parser,enable_auto_tool_choice)._serving_only_keysto filter serving-layer args fromAsyncEngineArgsand correctly passesreasoning_parserthroughstructured_outputs_config, buttool_call_parserandenable_auto_tool_choicefromparsed_configsare silently dropped — they're filtered out ofengine_args_dictbut hardcoded asNone/Falseindefault_app_state_argsinstead of being sourced fromparsed_configs.Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[VLLMModelConfig\nreasoning_parser\ntool_call_parser\nenable_auto_tool_choice\nchat_template] --> B[parsed_configs.model_dump\nexclude_none=True] B --> C{k in _serving_only_keys?} C -- No --> D[engine_args_dict] C -- Yes --> E[Filtered out] D --> F[AsyncEngineArgs **engine_args_dict] D --> G[app_state_args merge\n**default_app_state_args\n**engine_args_dict] E --> H[reasoning_parser\n→ structured_outputs_config ✅] E --> I[chat_template\n→ default_app_state_args ✅] E --> J[tool_call_parser\n→ hardcoded None ❌] E --> K[enable_auto_tool_choice\n→ hardcoded False ❌] H --> G I --> G G --> L[init_app_state]Prompt To Fix All With AI
Reviews (2): Last reviewed commit: "fix(vllm_batch): filter serving-only arg..." | Re-trigger Greptile