Skip to content

Latest commit

 

History

History
59 lines (45 loc) · 2.53 KB

File metadata and controls

59 lines (45 loc) · 2.53 KB

Benchmark Results — Hopper (SM90)

Auto-generated by benchmarks/generate_benchmark_hopper_md.py on 2026-04-05.

GPU: NVIDIA H200 | CUDA: 12.9 | PyTorch: 2.9.1+cu129

FLA baseline: flash-linear-attention v0.4.2

KDA Fused Forward (Kimi Delta Attention)

Fully-fused KDA forward prefill kernel (sm90).

Fixed-Length (H=64, D=128, bf16)

B T FLA Triton (ms) cuLA Fused (ms) Speedup
1 512 0.576 0.230 2.51x
1 1024 0.572 0.248 2.31x
1 4096 0.936 0.899 1.04x
1 8192 1.819 1.758 1.03x
1 16384 3.599 3.521 1.02x
2 512 0.569 0.228 2.49x
2 1024 0.572 0.306 1.87x
2 4096 1.818 1.108 1.64x
2 8192 3.605 2.210 1.63x
2 16384 7.173 4.485 1.60x

Variable-Length (H=64, D=128, bf16)

Config FLA Triton (ms) cuLA Fused (ms) Speedup
uniform 10seqs T=4096 [409..415] avg=409 1.016 0.707 1.44x
random 10seqs T=4096 [24..1201] avg=409 1.008 0.660 1.53x
skewed 10seqs T=4096 [227..2053] avg=409 1.005 0.668 1.50x
uniform 20seqs T=4096 [204..220] avg=204 1.087 0.919 1.18x
random 20seqs T=4096 [5..787] avg=204 1.066 0.736 1.45x
skewed 20seqs T=4096 [107..2063] avg=204 1.038 0.724 1.43x
uniform 10seqs T=8192 [819..821] avg=819 1.855 1.179 1.57x
random 10seqs T=8192 [48..2401] avg=819 1.893 1.215 1.56x
skewed 10seqs T=8192 [455..4097] avg=819 1.906 1.209 1.58x
uniform 20seqs T=8192 [409..421] avg=409 1.961 1.406 1.39x
random 20seqs T=8192 [9..1574] avg=409 1.954 1.283 1.52x
skewed 20seqs T=8192 [215..4107] avg=409 1.957 1.300 1.51x
uniform 10seqs T=16384 [1638..1642] avg=1638 3.646 2.188 1.67x
random 10seqs T=16384 [95..4802] avg=1638 3.646 2.306 1.58x
skewed 10seqs T=16384 [910..8194] avg=1638 3.656 2.335 1.57x
uniform 20seqs T=16384 [819..823] avg=819 3.679 2.355 1.56x
random 20seqs T=16384 [19..3147] avg=819 3.713 2.323 1.60x
skewed 20seqs T=16384 [431..8195] avg=819 3.670 2.384 1.54x

Summary (28 configs): avg=1.58x, min=1.02x, max=2.51x.

To reproduce:

python benchmarks/bench_kda_fused_fwd.py --mode both