feat(client): add dynamo_chat transport + routed_experts to renderer generate#79
Open
biswapanda wants to merge 26 commits into
Open
feat(client): add dynamo_chat transport + routed_experts to renderer generate#79biswapanda wants to merge 26 commits into
biswapanda wants to merge 26 commits into
Conversation
1 task
…ols from dynamo body, raise on missing ids; rename transport to dynamo_chat
1 task
…d, drop routed_experts on dynamo (codex round 2)
…ake); docstring fix
…erge nvext, canonical completion-ids, logprobs alignment)
…payload to contract
…first-turn stays full)
…trim is now a back-compat fallback
3 tasks
…oid event-loop json.loads)
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit fec0a81. Configure here.
AmeenP
reviewed
Jun 18, 2026
AmeenP
reviewed
Jun 18, 2026
AmeenP
reviewed
Jun 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Description
Adds a
dynamo_chattransport to the renderer-basedgenerate()client so it can run against NVIDIA Dynamo, which serves no/inference/v1/generateroute. Selected per-call viatransport=; defaults to the existing vLLM path, so behavior is unchanged unless opted in.Two transports:
vllm_generate(default): unchanged —messages → render_ids() → POST /inference/v1/generate → parse_response()(vLLM TITO surface).dynamo_chat:messages → render_ids() → POST /v1/chat/completionswithnvext.token_data(pre-tokenized prompt) +nvext.extra_fields=["engine_data"]. Completion token IDs and logprobs are read back fromnvext.engine_data.Dynamo wire shape (
_post_dynamo_chat)Mirrors the verifiers token client so the payload is identical whether a rollout goes through the token client or the renderer client.
nvext.token_data(Dynamo skips tokenization when present);cache_salt→nvext.cache_salt,priority→nvext.agent_hints.priority; a single placeholder user message; sampling remap (max_tokens→max_completion_tokens,logprobs=N→logprobs=true+top_logprobs=N); passthrough fields ride the Dynamo allowlist. Tools are baked intotoken_databy the renderer (not sent on the wire).routed_experts (MoE expert replay) — now surfaced on dynamo_chat
(Supersedes the earlier "routed_experts intentionally NOT surfaced" note — it now is.)
parsereads routed_experts fromnvext.routed_experts(ornvext.engine_data.routed_experts) and maps it to the downstreamRoutedExpertsPayload{data, shape, start, dtype}. The Dynamo worker returns full-sequence routing withstart=0; the renderer row-trims the leading prompt rows only when the caller explicitly setsrouted_experts_prompt_start— a first-turn request with no caller start stays full-sequence withstart=0(no phantom prefix). Completion logprobs prefernvext.engine_data.completion_logprobs(the same authoritative source as the engine token IDs) over the chat echo; a present-but-empty engine list is authoritative and does not fall back to chat.Other
RendererTransport = Literal["vllm_generate", "dynamo_chat"]alias. A present-but-emptycompletion_token_idsis a valid zero-token completion; only a fully absent field raises. Multimodal renderers raiseNotImplementedErrorondynamo_chat(vLLM path / token-client TITO remain available for VLMs).Type of Change
Review
Codex adversarial review: SIGN-OFF (F1/F2/F3 + the N1 logprob-presence finding resolved; head
5f2a914). All review threads resolved.Testing
tests/test_client.pycovers the Dynamo request body shape (priority/detokenize/sampling remap), routed_experts parse + row-trim (explicitprompt_startvs first-turn full-sequence), engine-logprob preference incl. present-but-empty, and missing/empty completion IDs.Note
Medium Risk
New Dynamo wire/parse path affects RL-critical completion IDs, logprobs, and MoE
routed_experts; strict runtime errors and no Dynamo multimodal are new failure modes for opted-in rollouts.Overview
Adds a per-call
transportparameter togenerate()("vllm"default,"dynamo"opt-in). The existing vLLM TITO flow is moved into_VllmGenerateTransport; behavior stays the same whentransportis omitted.Dynamo uses
_DynamoChatTransport: pre-tokenized prompts go toPOST /v1/chat/completionsvianvext.token_data, withcache_salt,priority, androuted_experts_prompt_startmapped intonvextand vLLM-only sampling keys dropped. Responses readnvext.engine_datafor completion IDs and logprobs (not chat echo), normalizerouted_experts, keep large blobs as zero-copymemoryview, and optionally client-trim prompt rows when an older worker returns full-sequence routing.Multimodal on Dynamo raises
NotImplementedError; missing engine completion IDs or logprob length mismatches raiseRuntimeError. Tests cover wire shape,nvextmerge, and parse edge cases.Reviewed by Cursor Bugbot for commit 57846ec. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Add
dynamotransport and routed-experts support togenerate()in the renderer clienttransportparameter togenerate()in renderers/client.py, defaulting to'vllm'(existing/inference/v1/generatepath); passing'dynamo'routes to OpenAI-compatible/v1/chat/completionswith NVIDIA Dynamonvextfields._TransportABC with_VllmGenerateTransportand_DynamoChatTransportimplementations; each handles body construction, POST, and response normalization into a common_WireResult.cache_salt/priorityintonvext, and prefersengine_datafields over chat-echo fields when parsing responses.routed_expertsvia_trim_dynamo_routed_expertswhenrouted_experts_prompt_startis set and the worker has not already trimmed.generate()withtransport='dynamo'raisesRuntimeErroron missingcompletion_token_idsor logprob/token-ID length mismatches, where the vLLM path does not.Macroscope summarized 57846ec.