Skip to content

feat(vllm_batch): add reasoning_parser support for batch inference#804

Open
lilyz-ai wants to merge 2 commits intomainfrom
lilyzhu/nemotron-reasoning-parser
Open

feat(vllm_batch): add reasoning_parser support for batch inference#804
lilyz-ai wants to merge 2 commits intomainfrom
lilyzhu/nemotron-reasoning-parser

Conversation

@lilyz-ai
Copy link
Copy Markdown
Collaborator

@lilyz-ai lilyz-ai commented Apr 3, 2026

Summary

  • Adds reasoning_parser field to VLLMModelConfig so batch jobs can configure reasoning parsers (e.g. nemotron_v3, deepseek_r1)
  • Passes it to structured_outputs_config in init_engine (async path) — previously hardcoded to None
  • Filters it from AsyncEngineArgs (it's a serving-layer arg, not an engine arg)
  • init_vllm subprocess path already handles it correctly via CLI arg conversion

Motivation

Needed to deploy NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 to batch inference with reasoning_parser=nemotron_v3, tool_call_parser=qwen3_coder, and enable_auto_tool_choice=True.

🤖 Generated with Claude Code

Greptile Summary

This PR adds reasoning_parser support for vLLM batch inference by introducing the field in VLLMModelConfig and wiring it through the async init_engine path's structured_outputs_config. It also correctly expands _serving_only_keys to filter tool_call_parser, enable_auto_tool_choice, and chat_template from AsyncEngineArgs — fixing a latent TypeError for any batch job using those fields.

Key changes:

  • VLLMModelConfig gains a new reasoning_parser: Optional[str] field alongside the existing serving-layer fields.
  • _serving_only_keys now covers all four serving-layer args so they are safely excluded when constructing AsyncEngineArgs.
  • structured_outputs_config is populated with parsed_configs.reasoning_parser instead of the previously hardcoded None.

Issue found:

  • tool_call_parser and enable_auto_tool_choice are correctly removed from engine_args_dict but their user-configured values are never forwarded to app_state_args. default_app_state_args hardcodes them as None/False, and since they no longer arrive via the **engine_args_dict merge, any batch job configured with tool_call_parser=qwen3_coder or enable_auto_tool_choice=True will silently use the defaults — the exact use case motivating this PR.

Confidence Score: 4/5

Not safe to merge as-is — the primary use case (tool_call_parser + enable_auto_tool_choice for the Nemotron deployment) is silently broken.

One P1 bug: tool_call_parser and enable_auto_tool_choice are filtered from engine_args_dict but never explicitly propagated to app_state_args, so user-configured values are silently ignored. This directly undermines the stated motivation of the PR. The fix is a two-line change in default_app_state_args. The reasoning_parser wiring and the DTO addition are correct.

model-engine/model_engine_server/inference/vllm/vllm_batch.py — specifically default_app_state_args where tool_call_parser and enable_auto_tool_choice need to reference parsed_configs instead of being hardcoded.

Important Files Changed

Filename Overview
model-engine/model_engine_server/common/dtos/llms/vllm.py Adds reasoning_parser: Optional[str] field to VLLMModelConfig — straightforward, well-placed alongside the other serving-layer fields (tool_call_parser, enable_auto_tool_choice).
model-engine/model_engine_server/inference/vllm/vllm_batch.py Extends _serving_only_keys to filter serving-layer args from AsyncEngineArgs and correctly passes reasoning_parser through structured_outputs_config, but tool_call_parser and enable_auto_tool_choice from parsed_configs are silently dropped — they're filtered out of engine_args_dict but hardcoded as None/False in default_app_state_args instead of being sourced from parsed_configs.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[VLLMModelConfig\nreasoning_parser\ntool_call_parser\nenable_auto_tool_choice\nchat_template] --> B[parsed_configs.model_dump\nexclude_none=True]
    B --> C{k in _serving_only_keys?}
    C -- No --> D[engine_args_dict]
    C -- Yes --> E[Filtered out]
    D --> F[AsyncEngineArgs **engine_args_dict]
    D --> G[app_state_args merge\n**default_app_state_args\n**engine_args_dict]
    E --> H[reasoning_parser\n→ structured_outputs_config ✅]
    E --> I[chat_template\n→ default_app_state_args ✅]
    E --> J[tool_call_parser\n→ hardcoded None ❌]
    E --> K[enable_auto_tool_choice\n→ hardcoded False ❌]
    H --> G
    I --> G
    G --> L[init_app_state]
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: model-engine/model_engine_server/inference/vllm/vllm_batch.py
Line: 327-329

Comment:
**`tool_call_parser` and `enable_auto_tool_choice` silently ignored**

`tool_call_parser` and `enable_auto_tool_choice` are now excluded from `engine_args_dict` (correctly, to prevent `TypeError` in `AsyncEngineArgs`), but they are also **not** propagated to `app_state_args`. Since `engine_args_dict` no longer carries them into the `**engine_args_dict` merge on line 351, the final `app_state_args` will always use these hardcoded defaults — ignoring whatever the user configured in `VLLMModelConfig`.

This contradicts the PR's stated motivation of deploying with `tool_call_parser=qwen3_coder` and `enable_auto_tool_choice=True`.

Compare how `chat_template` (line 343) and `reasoning_parser` (lines 330–332) are handled — both are sourced directly from `parsed_configs` inside `default_app_state_args`. The same pattern needs to be applied here:

```python
        enable_auto_tool_choice=parsed_configs.enable_auto_tool_choice or False,
        tool_call_parser=parsed_configs.tool_call_parser,
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: model-engine/model_engine_server/inference/vllm/vllm_batch.py
Line: 293

Comment:
**Add a comment explaining why these keys are excluded**

The `_serving_only_keys` set is non-obvious — it's not immediately clear why these specific keys must be excluded from `engine_args_dict`. A short inline comment would make the intent clear for future contributors who add new serving-layer fields to `VLLMModelConfig`.

```suggestion
    # These are vLLM serving-layer args that are not accepted by AsyncEngineArgs.
    # They are handled separately: chat_template and tool_call_parser/enable_auto_tool_choice
    # go into default_app_state_args directly, and reasoning_parser goes into structured_outputs_config.
    _serving_only_keys = {"reasoning_parser", "tool_call_parser", "enable_auto_tool_choice", "chat_template"}
```

**Rule Used:** Add comments to explain complex or non-obvious log... ([source](https://app.greptile.com/review/custom-context?memory=928586f9-9432-435e-a385-026fa49318a2))

**Learnt From**
[scaleapi/scaleapi#126958](https://github.com/scaleapi/scaleapi/pull/126958)

How can I resolve this? If you propose a fix, please make it concise.

Reviews (2): Last reviewed commit: "fix(vllm_batch): filter serving-only arg..." | Re-trigger Greptile

- Add `reasoning_parser` field to VLLMModelConfig
- Pass it to `structured_outputs_config` in init_engine (async path)
- Filter it from AsyncEngineArgs (it's a serving-layer arg, not engine arg)
- init_vllm subprocess path already handles it via CLI arg conversion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
tool_call_parser, enable_auto_tool_choice, chat_template, and
reasoning_parser are app-state args, not AsyncEngineArgs params.
Pass them via app_state_args only.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant