🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
-
Updated
Feb 13, 2026 - Cuda
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
An open-source interface to use the multiple-precision solver SDPA-GMP with YALMIP
PyTorch implementation of YOLOv12 with Scaled Dot-Product Attention (SDPA) optimized by FlashAttention for fast and efficient object detection.
Memory-optimized SongGeneration (v2 Large) for 16GB VRAM GPUs. Features 8-bit µ-law KV-caching, fused layers, and SDPA/Triton integration.
A 66M parameter decoder-only transformer language model implemented from scratch in PyTorch. Features a custom SentencePiece tokenizer, RoPE positional embeddings, SwiGLU feed-forward network, per-layer KV cache for efficient autoregressive inference, and a Svelte-based streaming chat interface.
Add a description, image, and links to the sdpa topic page so that developers can more easily learn about it.
To associate your repository with the sdpa topic, visit your repo's landing page and select "manage topics."