Add SM70/SM75 forward GDN backend by weicj · Pull Request #17 · QwenLM/FlashQLA

weicj · 2026-05-18T13:30:48Z

Summary

This PR adds an explicit SM70/SM75 forward inference backend entry point for Qwen-style Gated DeltaNet.

The Hopper/SM90 TileLang path remains unchanged and stays the default. The new backend targets the Volta/Turing legacy family through a standard CUDA warp-shuffle implementation. Runtime validation of the standalone kernel was performed on RTX 2080 Ti / SM75; SM70 has compile coverage and still needs V100-class runtime validation.

Changes

Add flash_qla.ops.gated_delta_rule.legacy.chunk_gated_delta_rule_fwd_legacy.
Add a lazy-built CUDA extension for a forward-only legacy Gated DeltaNet backend.
Keep the upstream Hopper/SM90 high-level API unchanged.
Keep the legacy backend explicit instead of silently replacing the existing dispatch path.
Add correctness coverage against a torch reference recurrence for the supported legacy path.
Document the backend scope, limitations, opt-in behavior, and benchmark caveats.

Scope

Supported:

forward inference only
SM70/SM75-class CUDA devices as the intended legacy target family
scalar-gate Gated DeltaNet
Qwen-style grouped-query head mapping
primary optimized shape: D=128
explicit legacy API entry point

Not supported:

backward kernels or training
automatic dispatch from the upstream high-level API
runtime performance claims for SM70 before V100-class validation
SM80/SM86/SM89 support claims
generic support for all pre-Hopper GPUs
automatic default dispatch for non-Hopper devices

Evidence

Runtime validation of the standalone kernel was performed on RTX 2080 Ti / SM75. The same kernel code compiles for SM70, but SM70 runtime and performance validation is not claimed yet.

Standalone kernel timing for a Qwen-like shape:

B=1, T=512, Hq=16, Hv=32, D=128
control recurrent path: about 1.126 ms
optimized legacy path on SM75: about 0.520-0.533 ms
GDN-stage speedup: about 2.1x

GGUF runtime profiling on SM75:

default fused GDN: 406.656 ms
legacy fast path: 195.105 ms
GDN-stage speedup: about 2.08x

Whole-request impact under the same server parameters:

prefill: +7.17%
decode: +0.61%
wall time: -3.49%

The intended claim is limited to an explicit legacy forward path with measured GDN-stage improvement on SM75 and compile coverage for SM70. This PR does not claim a new end-to-end FlashQLA headline result.

Validation

CUDA correctness tests added for supported legacy shapes.
Python syntax check for the legacy wrapper and tests.
Historical standalone RTX 2080 Ti / SM75 smoke benchmark for the source kernel.
SM70 compile check.

weicj added 2 commits May 18, 2026 21:17

docs: describe SM70 SM75 fork scope

eca3e2c

ops: add SM70 SM75 legacy GDN forward backend

3ab27d7

weicj marked this pull request as ready for review May 18, 2026 16:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SM70/SM75 forward GDN backend#17

Add SM70/SM75 forward GDN backend#17
weicj wants to merge 2 commits into
QwenLM:mainfrom
weicj:sm70-sm75-gdn-forward

weicj commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

weicj commented May 18, 2026

Summary

Changes

Scope

Evidence

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant