This repository collects CUDA matrix multiplication experiments, from CUDA-core SGEMM kernels to WMMA / TF32 tensor-core kernels.
The main goal is to keep each optimization step measurable: every kernel variant should have a clear hypothesis, benchmark result, and profiling note when useful.
| Path | Description |
|---|---|
matmul/ |
CUDA-core SGEMM optimization experiments. Includes naive, tiled, register-tiled, vectorized, double-buffered, and cp.async versions. |
matmul/README.md |
Detailed CUDA-core benchmark notes, optimization log, summary table, and shape experiments. |
matmul_tensor/ |
Tensor-core matmul experiment using WMMA with TF32 tensor core compute. |
matmul_tensor/README.md |
Tensor-core benchmark notes and current profiling summary. |
See matmul/.
This project tracks a step-by-step CUDA-core SGEMM optimization path:
- global-memory baseline
- shared-memory tiling
- per-thread micro tiling
- shared-memory layout experiments
- column-major layout
- warp tiling and lane mapping
- vectorized memory access
- double buffering
cp.asyncprefetching- tall/wide shape experiments
See matmul_tensor/.
This project currently has WMMA kernel:
floatinput matrices- TF32 tensor core computation
- FP32 accumulation
- column-major matrix layout
128x128x32CTA tile16x16x8WMMA fragments
For each experiment:
- Implement or adjust a kernel variant.
- Run correctness checks.
- Compare against cuBLAS when relevant.
- Profile with Nsight Compute or Nsight Systems.
- Record the result and takeaway in the project README.