Skip to content

Add Qwen3.6-27B contrib model with vLLM APC baseline#164

Open
m-deepankar-singh wants to merge 3 commits into
aws-neuron:mainfrom
m-deepankar-singh:contrib/qwen36-27b-vllm-apc-pr
Open

Add Qwen3.6-27B contrib model with vLLM APC baseline#164
m-deepankar-singh wants to merge 3 commits into
aws-neuron:mainfrom
m-deepankar-singh:contrib/qwen36-27b-vllm-apc-pr

Conversation

@m-deepankar-singh
Copy link
Copy Markdown

@m-deepankar-singh m-deepankar-singh commented May 13, 2026

Summary

  • Adds a contrib implementation of Qwen3.6-27B, a 27B dense hybrid DeltaNet + GQA model.
  • Builds on Jim Burtoft's Qwen3.6-27B contrib work in PR Contrib: Add Qwen3.6-27B (post-training update of Qwen3.5-27B) #140 and the shared Qwen3.5/Qwen3.6 hybrid architecture pattern.
  • Adds the validated Qwen3.6 dense text/VL model code, DeltaNet NKI kernels, hybrid cache manager, MLP-only FP8 compile path, vLLM Neuron launch helpers, and OpenAI-compatible serving utilities.
  • Adds a vLLM/APC long-context baseline for 128K serving on Trn2 with chunked prefill and prefix-cache validation.

Relationship to PR #140

This PR keeps the same high-level Qwen3.6 architecture described in PR #140: Qwen3.6-27B is a post-training update of Qwen3.5-27B with the same qwen3_5 architecture, 64 layers, and [3 DeltaNet + 1 GQA] pattern. The additional focus here is the production serving path:

  • hybrid cache manager for DeltaNet recurrent/conv state plus GQA KV cache;
  • chunked prefill for long prompts;
  • MLP-only FP8 artifact path for 128K support on trn2.3xlarge;
  • vLLM Neuron registry/launcher support;
  • native vLLM APC validation for repeated and partial-prefix reuse.

Architecture

Feature Value
Layers 64
Layer pattern [3 DeltaNet + 1 GQA] x 16
Hidden size 5120
GQA attention 24 heads, 4 KV heads, head_dim=256
DeltaNet 48 value heads, 16 key heads, k_dim=v_dim=128
Position encoding Partial RoPE, mRoPE-compatible
Vocabulary 248,320
Long-context artifact 131,072 tokens, CTE bucket 512

Files

contrib/models/Qwen3.6-27B/
├── README.md
├── scripts/
│   └── openai_compat_server.py
├── src/
│   ├── __init__.py
│   ├── modeling_qwen35.py
│   ├── modeling_qwen35_vision.py
│   ├── modeling_qwen35_vl.py
│   └── nki_kernels/
│       ├── __init__.py
│       ├── nki_deltanet.py
│       ├── nki_deltanet_chunked.py
│       └── nki_deltanet_fused.py
├── test/
│   ├── integration/
│   │   ├── qwen36_27b_compile_fp8.py
│   │   └── test_model.py
│   └── unit/
│       ├── test_config.py
│       ├── test_deltanet_decay.py
│       ├── test_hybrid_cache_manager.py
│       └── test_weight_conversion.py
└── vllm/
    ├── README.md
    ├── hf_qwen35_config.py
    ├── install_qwen36_vllm.sh
    ├── patch_nxdi_registry.py
    ├── qwen36_chat_proxy.py
    ├── run_offline_inference.py
    ├── serve_qwen36.py
    ├── sitecustomize.py
    └── start_vllm_server.sh

Test Results

Static Checks

  • git diff --check: PASS
  • python3 -m py_compile on model, kernel, vLLM helper, and OpenAI server Python files: PASS
  • bash -n on vLLM shell helpers: PASS

Local unit-test execution on the Mac checkout is blocked because the Neuron runtime packages (neuronx_distributed) are not installed there. Hardware/unit validation below was run on Trn2 with the Neuron inference environment.

Unit Coverage

The contrib includes 57 CPU unit tests:

Test module Tests
test_config.py 26
test_weight_conversion.py 16
test_hybrid_cache_manager.py 13
test_deltanet_decay.py 2

Coverage includes config parsing, Qwen3.6/Qwen3.5 architectural compatibility, weight conversion, q/gate split handling, RMSNorm +1 conversion, hybrid cache allocation, DeltaNet state cache shapes, and decay handling.

Long-Context vLLM/APC Validation

Hardware: trn2.3xlarge, TP=4, LNC=2, SDK 2.29, vLLM Neuron plugin path, MLP-only FP8 artifact, CTE bucket 512.

Scenario Result
128K artifact compile/load PASS
32K and 64K short-after-long state reset PASS
32K and 64K needle retrieval prompts PASS
Prefill throughput 404-428 tok/s from 512 through 64K prompt tokens
Decode throughput 26.3-26.6 tok/s
Decode TPOT ~37.6-38.0 ms/token
Cold 512-token TTFT ~1.2-1.3s, derived from measured prefill plus one decode step
Cold 32K-token TTFT ~76.6-81.1s, derived from measured prefill plus one decode step
Cold 64K-token TTFT ~153-162s, derived from measured prefill plus one decode step
Peak Neuron device memory ~53.25 GB decimal during 64K eval

APC / Prefix Cache Validation

Native vLLM APC was validated with exact greedy output matches.

APC scenario Cold Warm Speedup Result
Server exact-repeat, ~10.8K prompt tokens 26.68s 1.67s 16.0x exact text match
Offline exact-repeat 26.19s 2.38s 11.0x exact token-ID match
Offline partial-prefix reuse 25.52s 1.70s 15.0x exact token-ID match
Server cross-prefix reuse 25.17s 1.36s 18.5x exact text match

Shared-prefix concurrency at 1/2/4 requests returned all requested markers exactly. The current artifact is compiled for max_num_seqs=1, so requests queue rather than true multi-sequence batching.

Notes and Limitations

  • This PR is contrib-scoped and does not modify core NxDI files.
  • The validated 128K serving path uses MLP-only FP8; sensitive modules and DeltaNet recurrent state remain BF16/FP32 where required for stability.
  • Qwen3.6 full-attention head_dim is 256, so the stock head_dim<=128 CTE flash-attention path is not used.
  • vLLM APC is validated for exact-repeat and partial-prefix reuse with the current artifact, but continuous batching and speculative decoding are follow-up work.
  • Native Qwen MTP speculative decoding is intentionally not included in this baseline PR.

Checklist

  • Contrib-only changes under contrib/models/Qwen3.6-27B/
  • Qwen3.6-27B text model implementation
  • DeltaNet NKI kernels
  • Hybrid cache manager tests
  • MLP-only FP8 compile path
  • vLLM Neuron registry and launcher helpers
  • OpenAI-compatible guarded serving helper
  • 128K vLLM/APC hardware validation on Trn2
  • PR body credits Jim Burtoft's PR Contrib: Add Qwen3.6-27B (post-training update of Qwen3.5-27B) #140 as the baseline/reference
  • CI / reviewer hardware validation

@m-deepankar-singh m-deepankar-singh marked this pull request as ready for review May 13, 2026 05:31
@aws-reutermj
Copy link
Copy Markdown

Working with our team to evaluate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants