Add Qwen3.6-27B contrib model with vLLM APC baseline by m-deepankar-singh · Pull Request #164 · aws-neuron/neuronx-distributed-inference

m-deepankar-singh · 2026-05-13T05:25:34Z

Summary

Adds a contrib implementation of Qwen3.6-27B, a 27B dense hybrid DeltaNet + GQA model.
Builds on Jim Burtoft's Qwen3.6-27B contrib work in PR Contrib: Add Qwen3.6-27B (post-training update of Qwen3.5-27B) #140 and the shared Qwen3.5/Qwen3.6 hybrid architecture pattern.
Adds the validated Qwen3.6 dense text/VL model code, DeltaNet NKI kernels, hybrid cache manager, MLP-only FP8 compile path, vLLM Neuron launch helpers, and OpenAI-compatible serving utilities.
Adds a vLLM/APC long-context baseline for 128K serving on Trn2 with chunked prefill and prefix-cache validation.

Relationship to PR #140

This PR keeps the same high-level Qwen3.6 architecture described in PR #140: Qwen3.6-27B is a post-training update of Qwen3.5-27B with the same qwen3_5 architecture, 64 layers, and [3 DeltaNet + 1 GQA] pattern. The additional focus here is the production serving path:

hybrid cache manager for DeltaNet recurrent/conv state plus GQA KV cache;
chunked prefill for long prompts;
MLP-only FP8 artifact path for 128K support on trn2.3xlarge;
vLLM Neuron registry/launcher support;
native vLLM APC validation for repeated and partial-prefix reuse.

Architecture

Feature	Value
Layers	64
Layer pattern	`[3 DeltaNet + 1 GQA] x 16`
Hidden size	5120
GQA attention	24 heads, 4 KV heads, head_dim=256
DeltaNet	48 value heads, 16 key heads, k_dim=v_dim=128
Position encoding	Partial RoPE, mRoPE-compatible
Vocabulary	248,320
Long-context artifact	131,072 tokens, CTE bucket 512

Files

contrib/models/Qwen3.6-27B/
├── README.md
├── scripts/
│   └── openai_compat_server.py
├── src/
│   ├── __init__.py
│   ├── modeling_qwen35.py
│   ├── modeling_qwen35_vision.py
│   ├── modeling_qwen35_vl.py
│   └── nki_kernels/
│       ├── __init__.py
│       ├── nki_deltanet.py
│       ├── nki_deltanet_chunked.py
│       └── nki_deltanet_fused.py
├── test/
│   ├── integration/
│   │   ├── qwen36_27b_compile_fp8.py
│   │   └── test_model.py
│   └── unit/
│       ├── test_config.py
│       ├── test_deltanet_decay.py
│       ├── test_hybrid_cache_manager.py
│       └── test_weight_conversion.py
└── vllm/
    ├── README.md
    ├── hf_qwen35_config.py
    ├── install_qwen36_vllm.sh
    ├── patch_nxdi_registry.py
    ├── qwen36_chat_proxy.py
    ├── run_offline_inference.py
    ├── serve_qwen36.py
    ├── sitecustomize.py
    └── start_vllm_server.sh

Test Results

Static Checks

git diff --check: PASS
python3 -m py_compile on model, kernel, vLLM helper, and OpenAI server Python files: PASS
bash -n on vLLM shell helpers: PASS

Local unit-test execution on the Mac checkout is blocked because the Neuron runtime packages (neuronx_distributed) are not installed there. Hardware/unit validation below was run on Trn2 with the Neuron inference environment.

Unit Coverage

The contrib includes 57 CPU unit tests:

Test module	Tests
`test_config.py`	26
`test_weight_conversion.py`	16
`test_hybrid_cache_manager.py`	13
`test_deltanet_decay.py`	2

Coverage includes config parsing, Qwen3.6/Qwen3.5 architectural compatibility, weight conversion, q/gate split handling, RMSNorm +1 conversion, hybrid cache allocation, DeltaNet state cache shapes, and decay handling.

Long-Context vLLM/APC Validation

Hardware: trn2.3xlarge, TP=4, LNC=2, SDK 2.29, vLLM Neuron plugin path, MLP-only FP8 artifact, CTE bucket 512.

Scenario	Result
128K artifact compile/load	PASS
32K and 64K short-after-long state reset	PASS
32K and 64K needle retrieval prompts	PASS
Prefill throughput	404-428 tok/s from 512 through 64K prompt tokens
Decode throughput	26.3-26.6 tok/s
Decode TPOT	~37.6-38.0 ms/token
Cold 512-token TTFT	~1.2-1.3s, derived from measured prefill plus one decode step
Cold 32K-token TTFT	~76.6-81.1s, derived from measured prefill plus one decode step
Cold 64K-token TTFT	~153-162s, derived from measured prefill plus one decode step
Peak Neuron device memory	~53.25 GB decimal during 64K eval

APC / Prefix Cache Validation

Native vLLM APC was validated with exact greedy output matches.

APC scenario	Cold	Warm	Speedup	Result
Server exact-repeat, ~10.8K prompt tokens	26.68s	1.67s	16.0x	exact text match
Offline exact-repeat	26.19s	2.38s	11.0x	exact token-ID match
Offline partial-prefix reuse	25.52s	1.70s	15.0x	exact token-ID match
Server cross-prefix reuse	25.17s	1.36s	18.5x	exact text match

Shared-prefix concurrency at 1/2/4 requests returned all requested markers exactly. The current artifact is compiled for max_num_seqs=1, so requests queue rather than true multi-sequence batching.

Notes and Limitations

This PR is contrib-scoped and does not modify core NxDI files.
The validated 128K serving path uses MLP-only FP8; sensitive modules and DeltaNet recurrent state remain BF16/FP32 where required for stability.
Qwen3.6 full-attention head_dim is 256, so the stock head_dim<=128 CTE flash-attention path is not used.
vLLM APC is validated for exact-repeat and partial-prefix reuse with the current artifact, but continuous batching and speculative decoding are follow-up work.
Native Qwen MTP speculative decoding is intentionally not included in this baseline PR.

Checklist

aws-reutermj · 2026-05-14T22:16:46Z

Working with our team to evaluate.

m-deepankar-singh added 2 commits May 13, 2026 10:49

Contrib: add Qwen3.6-27B vLLM APC baseline

22d58b7

Docs: align Qwen3.6 README with contrib guidelines

bb94c28

m-deepankar-singh marked this pull request as ready for review May 13, 2026 05:31

Docs: add Qwen3.6 TTFT and TPOT benchmarks

6d6ae62

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Qwen3.6-27B contrib model with vLLM APC baseline#164

Add Qwen3.6-27B contrib model with vLLM APC baseline#164
m-deepankar-singh wants to merge 3 commits into
aws-neuron:mainfrom
m-deepankar-singh:contrib/qwen36-27b-vllm-apc-pr

m-deepankar-singh commented May 13, 2026 •

edited

Loading

Uh oh!

aws-reutermj commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

m-deepankar-singh commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Relationship to PR #140

Architecture

Files

Test Results

Static Checks

Unit Coverage

Long-Context vLLM/APC Validation

APC / Prefix Cache Validation

Notes and Limitations

Checklist

Uh oh!

aws-reutermj commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

m-deepankar-singh commented May 13, 2026 •

edited

Loading