Skip to content

[serving] Fix continuous batching JSON response serialization#45057

Merged
NathanHB merged 10 commits intohuggingface:mainfrom
NathanHB:fix-continuous-batching-json-response
Mar 31, 2026
Merged

[serving] Fix continuous batching JSON response serialization#45057
NathanHB merged 10 commits intohuggingface:mainfrom
NathanHB:fix-continuous-batching-json-response

Conversation

@NathanHB
Copy link
Copy Markdown
Member

@NathanHB NathanHB commented Mar 27, 2026

Change model_dump_json() to model_dump() to avoid double JSON encoding. When using continuous batching with stream=false, the response was being double-encoded as a string instead of returning a proper JSON object.

Added a UV script to run GPQA using transformer-serve:

hf jobs uv run examples/pytorch/transformers_serve_cb_eval_job.py \
--model {model} \
--flavor l4x1 \
--timeout 1h \
--secrets HF_TOKEN \
-e TRANSFORMERS_SERVE_API_KEY="1234"
Model Job ID Issue Output Directory
meta-llama/Llama-3.1-8B-Instruct 69c695adf900226fc14ae1ab AttributeError: 'NoneType' object has no attribute 'generated_tokens' - Server starts but all requests fail hf/SaylorTwift/llama-3.1-8b-instruct-cb
Qwen/Qwen2.5-0.5B-Instruct 69c696bcbf20ec90acee2ffa ✅ Working - Evaluation running successfully hf/SaylorTwift/qwen2.5-0.5b-instruct-cb

Change model_dump_json() to model_dump() to avoid double JSON encoding.
When using continuous batching with stream=false, the response was being
double-encoded as a string instead of returning a proper JSON object.
Copy link
Copy Markdown
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Can you add a test so that this gets caught?

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Test verifies that non-streaming responses with continuous batching
return proper JSON objects rather than double-encoded JSON strings.
This is a regression test for the fix where model_dump_json() was
changed to model_dump() in the continuous batching response handler.
@NathanHB
Copy link
Copy Markdown
Member Author

@bot /style

Changed dependency from personal fork to official huggingface/transformers@main
for production use of the evaluation script.
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!!! fyi @remi-or

@NathanHB NathanHB force-pushed the fix-continuous-batching-json-response branch from ec55108 to 5fddeb2 Compare March 30, 2026 14:36
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: clap, deit, depth_anything, dpt, ijepa, llama4, marian, mimi

NathanHB and others added 3 commits March 30, 2026 16:37
- Add --cb-block-size, --cb-num-blocks, --cb-max-batch-tokens, --cb-max-memory-percent, and --cb-use-cuda-graph flags
- Flags allow users to customize KV cache and performance settings for continuous batching
- Update transformers_serve_cb_eval_job.py to support and pass through CB config arguments
- Update transformers dependency to use NathanHB/transformers@fix-continuous-batching-json-response branch
- All arguments use auto-inference defaults when not specified (backward compatible)
Copy link
Copy Markdown
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok good!

@NathanHB NathanHB enabled auto-merge March 31, 2026 12:48
@NathanHB NathanHB added this pull request to the merge queue Mar 31, 2026
Merged via the queue into huggingface:main with commit a91232a Mar 31, 2026
18 checks passed
@NathanHB NathanHB deleted the fix-continuous-batching-json-response branch March 31, 2026 13:04
sirzechs66 pushed a commit to sirzechs66/transformers that referenced this pull request Mar 31, 2026
…gface#45057)

* Fix continuous batching JSON response serialization

Change model_dump_json() to model_dump() to avoid double JSON encoding.
When using continuous batching with stream=false, the response was being
double-encoded as a string instead of returning a proper JSON object.

* add example script eval-job

* fix script

* Add test for continuous batching non-streaming JSON response

Test verifies that non-streaming responses with continuous batching
return proper JSON objects rather than double-encoded JSON strings.
This is a regression test for the fix where model_dump_json() was
changed to model_dump() in the continuous batching response handler.

* fix ci

* Update eval script to use official transformers repo main branch

Changed dependency from personal fork to official huggingface/transformers@main
for production use of the evaluation script.

* add kernels and flash attn 2

* Add continuous batching configuration CLI arguments to serve command

- Add --cb-block-size, --cb-num-blocks, --cb-max-batch-tokens, --cb-max-memory-percent, and --cb-use-cuda-graph flags
- Flags allow users to customize KV cache and performance settings for continuous batching
- Update transformers_serve_cb_eval_job.py to support and pass through CB config arguments
- Update transformers dependency to use NathanHB/transformers@fix-continuous-batching-json-response branch
- All arguments use auto-inference defaults when not specified (backward compatible)

* Add thread lock for manager creation to avoid double manager

* change transformers dep

---------

Co-authored-by: remi-or <remi.pierre_o@orange.fr>
SangbumChoi pushed a commit to SangbumChoi/transformers that referenced this pull request Apr 4, 2026
…gface#45057)

* Fix continuous batching JSON response serialization

Change model_dump_json() to model_dump() to avoid double JSON encoding.
When using continuous batching with stream=false, the response was being
double-encoded as a string instead of returning a proper JSON object.

* add example script eval-job

* fix script

* Add test for continuous batching non-streaming JSON response

Test verifies that non-streaming responses with continuous batching
return proper JSON objects rather than double-encoded JSON strings.
This is a regression test for the fix where model_dump_json() was
changed to model_dump() in the continuous batching response handler.

* fix ci

* Update eval script to use official transformers repo main branch

Changed dependency from personal fork to official huggingface/transformers@main
for production use of the evaluation script.

* add kernels and flash attn 2

* Add continuous batching configuration CLI arguments to serve command

- Add --cb-block-size, --cb-num-blocks, --cb-max-batch-tokens, --cb-max-memory-percent, and --cb-use-cuda-graph flags
- Flags allow users to customize KV cache and performance settings for continuous batching
- Update transformers_serve_cb_eval_job.py to support and pass through CB config arguments
- Update transformers dependency to use NathanHB/transformers@fix-continuous-batching-json-response branch
- All arguments use auto-inference defaults when not specified (backward compatible)

* Add thread lock for manager creation to avoid double manager

* change transformers dep

---------

Co-authored-by: remi-or <remi.pierre_o@orange.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants