Skip to content

Add SM70/SM75 forward GDN backend#17

Open
weicj wants to merge 2 commits into
QwenLM:mainfrom
weicj:sm70-sm75-gdn-forward
Open

Add SM70/SM75 forward GDN backend#17
weicj wants to merge 2 commits into
QwenLM:mainfrom
weicj:sm70-sm75-gdn-forward

Conversation

@weicj
Copy link
Copy Markdown

@weicj weicj commented May 18, 2026

Summary

This PR adds an explicit SM70/SM75 forward inference backend entry point for Qwen-style Gated DeltaNet.

The Hopper/SM90 TileLang path remains unchanged and stays the default. The new backend targets the Volta/Turing legacy family through a standard CUDA warp-shuffle implementation. Runtime validation of the standalone kernel was performed on RTX 2080 Ti / SM75; SM70 has compile coverage and still needs V100-class runtime validation.

Changes

  • Add flash_qla.ops.gated_delta_rule.legacy.chunk_gated_delta_rule_fwd_legacy.
  • Add a lazy-built CUDA extension for a forward-only legacy Gated DeltaNet backend.
  • Keep the upstream Hopper/SM90 high-level API unchanged.
  • Keep the legacy backend explicit instead of silently replacing the existing dispatch path.
  • Add correctness coverage against a torch reference recurrence for the supported legacy path.
  • Document the backend scope, limitations, opt-in behavior, and benchmark caveats.

Scope

Supported:

  • forward inference only
  • SM70/SM75-class CUDA devices as the intended legacy target family
  • scalar-gate Gated DeltaNet
  • Qwen-style grouped-query head mapping
  • primary optimized shape: D=128
  • explicit legacy API entry point

Not supported:

  • backward kernels or training
  • automatic dispatch from the upstream high-level API
  • runtime performance claims for SM70 before V100-class validation
  • SM80/SM86/SM89 support claims
  • generic support for all pre-Hopper GPUs
  • automatic default dispatch for non-Hopper devices

Evidence

Runtime validation of the standalone kernel was performed on RTX 2080 Ti / SM75. The same kernel code compiles for SM70, but SM70 runtime and performance validation is not claimed yet.

Standalone kernel timing for a Qwen-like shape:

  • B=1, T=512, Hq=16, Hv=32, D=128
  • control recurrent path: about 1.126 ms
  • optimized legacy path on SM75: about 0.520-0.533 ms
  • GDN-stage speedup: about 2.1x

GGUF runtime profiling on SM75:

  • default fused GDN: 406.656 ms
  • legacy fast path: 195.105 ms
  • GDN-stage speedup: about 2.08x

Whole-request impact under the same server parameters:

  • prefill: +7.17%
  • decode: +0.61%
  • wall time: -3.49%

The intended claim is limited to an explicit legacy forward path with measured GDN-stage improvement on SM75 and compile coverage for SM70. This PR does not claim a new end-to-end FlashQLA headline result.

Validation

  • CUDA correctness tests added for supported legacy shapes.
  • Python syntax check for the legacy wrapper and tests.
  • Historical standalone RTX 2080 Ti / SM75 smoke benchmark for the source kernel.
  • SM70 compile check.

@weicj weicj marked this pull request as ready for review May 18, 2026 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant