This repository contains benchmarks and examples for Triton's Gluon experimental features, focusing on GEMM (General Matrix Multiply) kernels for NVIDIA Blackwell GPUs.
Our video demo can be found here: https://drive.google.com/drive/folders/1XJUQBkOG8Ufg3N74uQZSvWUHqv9QyRSX?usp=sharing
- NVIDIA Blackwell GPU (Compute Capability 10.0)
- Python 3.12+
- PyTorch with CUDA support
- Triton built from source (see below)
git clone --recurse-submodules <repository-url>
cd gluon-surveypython -m venv .venv --prompt triton
source .venv/bin/activate
pip install -r requirements.txtIMPORTANT: As of 2025/12/15, you MUST build Triton from source to use Gluon kernels. The pre-built pip packages do not include the necessary experimental Gluon features required to compile these kernels.
To build and install Triton from source:
cd triton
pip install -r python/requirements.txt # build-time dependencies
pip install -e .This will build the Triton compiler with support for:
triton.experimental.gluonmodule- Blackwell-specific features (
triton.experimental.gluon.language.nvidia.blackwell) - TMA (Tensor Memory Accelerator) operations
- Gluon JIT compiler
Note: Building from source may take several minutes and requires a C++ compiler and CUDA toolkit installed on your system.
After building Triton, verify that the Gluon module is available:
python -c "from triton.experimental import gluon; print('Gluon is available!')"The gemm/ directory contains several GEMM kernel implementations with varying optimization levels:
0-gemm.py: Blocked GEMM kernel1-gemm.py: Pipelined GEMM kernel2-gemm.py: Persistent GEMM kernel3-gemm.py: Warp-specialized GEMM kerneltt-gemm.py: Triton-TMA GEMM implementationtorch-gemm.py: PyTorch/cuBLAS baseline
Run all kernels with the main driver:
cd gluon-survey/gemm
python main.pyYou can run individual kernel modes by passing them as arguments:
# Run blocked GEMM only
python main.py Blocked
# Run Triton-TMA GEMM
python main.py triton
# Run PyTorch baseline
python main.py torch
# Run multiple modes (comma-separated)
python main.py Blocked,triton,torchAvailable modes:
Blocked- Basic blocked GEMM kernelPipelined- Pipelined GEMM with async operationstriton- Triton-TMA GEMM implementationpersistent- Persistent GEMM kernelwarpspec- Warp-specialized GEMMtorch- PyTorch cuBLAS baseline (eager mode)torch.compile- PyTorch compiled withtorch.compile
The benchmarks will sweep across different K dimensions (512 to 16384) and report:
- Block dimensions (BLOCK_M, BLOCK_N, BLOCK_K)
- Execution time in milliseconds
- Performance in TFLOPs/sec
Mode: Blocked
K, BLOCK_M, BLOCK_N, BLOCK_K, ms, TFLOPs/sec
512, 128, 128, 128, 0.1234, 543.21
1024, 128, 128, 128, 0.2456, 689.45
...
These kernels are designed specifically for NVIDIA Blackwell GPUs (Compute Capability 10.0). The code includes runtime checks and will raise an error if run on incompatible hardware.
This error indicates that Triton was not built from source or the build did not complete successfully. Make sure to:
- Build Triton from source using
cd triton && pip install -e python - Verify the build completed without errors
- Check that you can import the Gluon module
These kernels require a Blackwell GPU. If you see this error, your GPU is not supported. The kernels use Blackwell-specific features like:
- TMA (Tensor Memory Accelerator)
- TCGEN05 MMA operations
- Blackwell tensor memory layout
See LICENSE file for details.