Efficient Triton Kernels for LLM Training
-
Updated
Apr 18, 2026 - Python
Efficient Triton Kernels for LLM Training
FlagGems is an operator library for large language models implemented in the Triton Language.
LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.
A light llama-like llm inference framework based on the triton kernel.
Tiled Flash Linear Attention library for fast and efficient mLSTM Kernels.
Production inference for encoder models - ColBERT, GLiNER, ColPali, embeddings etc. - as vLLM plugins for online and in-process deployment
A "standard library" of Triton kernels.
Manifold-Constrained Hyper-Connections with fused Triton kernels for efficient training
Educational resource demonstrating common GPU programming pitfalls and solutions using Triton kernels.
High-performance late-interaction retrieval engine for on-prem AI. ColBERT/ColPali multi-vector search with Rust fused MaxSim, Triton GPU kernels, ROQ quantization, LEMUR routing, WAL-backed CRUD, and a FastAPI server — single machine, CPU or GPU.
Universal AI Runtime — Execute any model on any hardware
Official Code for the paper ELMO : Efficiency via Low-precision and Peak Memory Optimization in Large Output Spaces (in ICML 2025)
KernelHeim – development ground of custom Triton and CUDA kernel functions designed to optimize and accelerate machine learning workloads on NVIDIA GPUs. Inspired by the mythical stronghold of the gods, KernelHeim is a forge where high-performance kernels are crafted to unlock the full potential of the hardware.
🚀 Learn GPU Programming by Implementing FlashAttention from Scratch | 从零实现 FlashAttention,掌握 Triton GPU 编程
A container of various PyTorch neural network modules written in Triton.
💥 Optimize linear attention models with efficient Triton-based implementations in PyTorch, compatible across NVIDIA, AMD, and Intel platforms.
FlashAttention2 Analysis in Triton
Collection of Triton operators for transformer models.
Repository for learning Triton GPU programming
High-performance Triton kernel library for LLM training with 12 fused operators (AttnRes, RMSNorm, RoPE, CrossEntropy, GRPO, JSD, FusedLinear, etc.) — up to 24x faster than PyTorch with 78% memory savings, outperforming Liger-Kernel on RTX 5090
Add a description, image, and links to the triton-kernels topic page so that developers can more easily learn about it.
To associate your repository with the triton-kernels topic, visit your repo's landing page and select "manage topics."