[serving] Fix continuous batching JSON response serialization by NathanHB · Pull Request #45057 · huggingface/transformers

NathanHB · 2026-03-27T13:02:59Z

Change model_dump_json() to model_dump() to avoid double JSON encoding. When using continuous batching with stream=false, the response was being double-encoded as a string instead of returning a proper JSON object.

Added a UV script to run GPQA using transformer-serve:

hf jobs uv run examples/pytorch/transformers_serve_cb_eval_job.py \
--model {model} \
--flavor l4x1 \
--timeout 1h \
--secrets HF_TOKEN \
-e TRANSFORMERS_SERVE_API_KEY="1234"

Model	Job ID	Issue	Output Directory
meta-llama/Llama-3.1-8B-Instruct	`69c695adf900226fc14ae1ab`	❌ `AttributeError: 'NoneType' object has no attribute 'generated_tokens'` - Server starts but all requests fail	hf/SaylorTwift/llama-3.1-8b-instruct-cb
Qwen/Qwen2.5-0.5B-Instruct	`69c696bcbf20ec90acee2ffa`	✅ Working - Evaluation running successfully	hf/SaylorTwift/qwen2.5-0.5b-instruct-cb

Change model_dump_json() to model_dump() to avoid double JSON encoding. When using continuous batching with stream=false, the response was being double-encoded as a string instead of returning a proper JSON object.

LysandreJik

Thank you! Can you add a test so that this gets caught?

HuggingFaceDocBuilderDev · 2026-03-27T13:19:56Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Test verifies that non-streaming responses with continuous batching return proper JSON objects rather than double-encoded JSON strings. This is a regression test for the fix where model_dump_json() was changed to model_dump() in the continuous batching response handler.

NathanHB · 2026-03-27T15:17:14Z

@bot /style

Changed dependency from personal fork to official huggingface/transformers@main for production use of the evaluation script.

ArthurZucker

Nice!!! fyi @remi-or

github-actions · 2026-03-30T14:36:45Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: clap, deit, depth_anything, dpt, ijepa, llama4, marian, mimi

- Add --cb-block-size, --cb-num-blocks, --cb-max-batch-tokens, --cb-max-memory-percent, and --cb-use-cuda-graph flags - Flags allow users to customize KV cache and performance settings for continuous batching - Update transformers_serve_cb_eval_job.py to support and pass through CB config arguments - Update transformers dependency to use NathanHB/transformers@fix-continuous-batching-json-response branch - All arguments use auto-inference defaults when not specified (backward compatible)

LysandreJik

ok good!

…gface#45057) * Fix continuous batching JSON response serialization Change model_dump_json() to model_dump() to avoid double JSON encoding. When using continuous batching with stream=false, the response was being double-encoded as a string instead of returning a proper JSON object. * add example script eval-job * fix script * Add test for continuous batching non-streaming JSON response Test verifies that non-streaming responses with continuous batching return proper JSON objects rather than double-encoded JSON strings. This is a regression test for the fix where model_dump_json() was changed to model_dump() in the continuous batching response handler. * fix ci * Update eval script to use official transformers repo main branch Changed dependency from personal fork to official huggingface/transformers@main for production use of the evaluation script. * add kernels and flash attn 2 * Add continuous batching configuration CLI arguments to serve command - Add --cb-block-size, --cb-num-blocks, --cb-max-batch-tokens, --cb-max-memory-percent, and --cb-use-cuda-graph flags - Flags allow users to customize KV cache and performance settings for continuous batching - Update transformers_serve_cb_eval_job.py to support and pass through CB config arguments - Update transformers dependency to use NathanHB/transformers@fix-continuous-batching-json-response branch - All arguments use auto-inference defaults when not specified (backward compatible) * Add thread lock for manager creation to avoid double manager * change transformers dep --------- Co-authored-by: remi-or <remi.pierre_o@orange.fr>

Fix continuous batching JSON response serialization

b126ffd

Change model_dump_json() to model_dump() to avoid double JSON encoding. When using continuous batching with stream=false, the response was being double-encoded as a string instead of returning a proper JSON object.

LysandreJik reviewed Mar 27, 2026

View reviewed changes

add example script eval-job

1a9bee0

NathanHB added 2 commits March 27, 2026 15:53

fix script

5ee9fcf

NathanHB added 2 commits March 27, 2026 16:20

fix ci

9289970

Update eval script to use official transformers repo main branch

5fddeb2

Changed dependency from personal fork to official huggingface/transformers@main for production use of the evaluation script.

ArthurZucker approved these changes Mar 30, 2026

View reviewed changes

NathanHB force-pushed the fix-continuous-batching-json-response branch from ec55108 to 5fddeb2 Compare March 30, 2026 14:36

NathanHB and others added 3 commits March 30, 2026 16:37

add kernels and flash attn 2

db4a774

Add thread lock for manager creation to avoid double manager

f9729fa

LysandreJik approved these changes Mar 31, 2026

View reviewed changes

change transformers dep

9f7e184

NathanHB enabled auto-merge March 31, 2026 12:48

NathanHB added this pull request to the merge queue Mar 31, 2026

Merged via the queue into huggingface:main with commit a91232a Mar 31, 2026
18 checks passed

NathanHB deleted the fix-continuous-batching-json-response branch March 31, 2026 13:04

SunMarc mentioned this pull request Mar 31, 2026

[refactor] Serving into proper modules #44796

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[serving] Fix continuous batching JSON response serialization#45057

[serving] Fix continuous batching JSON response serialization#45057
NathanHB merged 10 commits intohuggingface:mainfrom
NathanHB:fix-continuous-batching-json-response

NathanHB commented Mar 27, 2026 •

edited

Loading

Uh oh!

LysandreJik left a comment

Uh oh!

HuggingFaceDocBuilderDev commented Mar 27, 2026

Uh oh!

NathanHB commented Mar 27, 2026

Uh oh!

ArthurZucker left a comment

Uh oh!

github-actions Bot commented Mar 30, 2026

Uh oh!

LysandreJik left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

NathanHB commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Mar 27, 2026

Uh oh!

NathanHB commented Mar 27, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Mar 30, 2026

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

NathanHB commented Mar 27, 2026 •

edited

Loading