Auto-generated by
benchmarks/generate_benchmark_hopper_md.pyon 2026-04-05.
GPU: NVIDIA H200 | CUDA: 12.9 | PyTorch: 2.9.1+cu129
FLA baseline: flash-linear-attention v0.4.2
Fully-fused KDA forward prefill kernel (sm90).
| B | T | FLA Triton (ms) | cuLA Fused (ms) | Speedup |
|---|---|---|---|---|
| 1 | 512 | 0.576 | 0.230 | 2.51x |
| 1 | 1024 | 0.572 | 0.248 | 2.31x |
| 1 | 4096 | 0.936 | 0.899 | 1.04x |
| 1 | 8192 | 1.819 | 1.758 | 1.03x |
| 1 | 16384 | 3.599 | 3.521 | 1.02x |
| 2 | 512 | 0.569 | 0.228 | 2.49x |
| 2 | 1024 | 0.572 | 0.306 | 1.87x |
| 2 | 4096 | 1.818 | 1.108 | 1.64x |
| 2 | 8192 | 3.605 | 2.210 | 1.63x |
| 2 | 16384 | 7.173 | 4.485 | 1.60x |
| Config | FLA Triton (ms) | cuLA Fused (ms) | Speedup |
|---|---|---|---|
| uniform 10seqs T=4096 [409..415] avg=409 | 1.016 | 0.707 | 1.44x |
| random 10seqs T=4096 [24..1201] avg=409 | 1.008 | 0.660 | 1.53x |
| skewed 10seqs T=4096 [227..2053] avg=409 | 1.005 | 0.668 | 1.50x |
| uniform 20seqs T=4096 [204..220] avg=204 | 1.087 | 0.919 | 1.18x |
| random 20seqs T=4096 [5..787] avg=204 | 1.066 | 0.736 | 1.45x |
| skewed 20seqs T=4096 [107..2063] avg=204 | 1.038 | 0.724 | 1.43x |
| uniform 10seqs T=8192 [819..821] avg=819 | 1.855 | 1.179 | 1.57x |
| random 10seqs T=8192 [48..2401] avg=819 | 1.893 | 1.215 | 1.56x |
| skewed 10seqs T=8192 [455..4097] avg=819 | 1.906 | 1.209 | 1.58x |
| uniform 20seqs T=8192 [409..421] avg=409 | 1.961 | 1.406 | 1.39x |
| random 20seqs T=8192 [9..1574] avg=409 | 1.954 | 1.283 | 1.52x |
| skewed 20seqs T=8192 [215..4107] avg=409 | 1.957 | 1.300 | 1.51x |
| uniform 10seqs T=16384 [1638..1642] avg=1638 | 3.646 | 2.188 | 1.67x |
| random 10seqs T=16384 [95..4802] avg=1638 | 3.646 | 2.306 | 1.58x |
| skewed 10seqs T=16384 [910..8194] avg=1638 | 3.656 | 2.335 | 1.57x |
| uniform 20seqs T=16384 [819..823] avg=819 | 3.679 | 2.355 | 1.56x |
| random 20seqs T=16384 [19..3147] avg=819 | 3.713 | 2.323 | 1.60x |
| skewed 20seqs T=16384 [431..8195] avg=819 | 3.670 | 2.384 | 1.54x |
Summary (28 configs): avg=1.58x, min=1.02x, max=2.51x.
To reproduce:
python benchmarks/bench_kda_fused_fwd.py --mode both