A CUDA matrix operations library focused on high-performance GEMM, mixed-precision tensor core execution, and reproducible numerical evaluation.
This project implements a practical BLAS-style GPU library with:
- Dense linear algebra kernels (GEMM, GEMV, TRSM)
- Multiple GEMM strategies (naive, tiled, register-blocked, tensor core)
- Mixed-precision paths (FP16 and BF16 with FP32 accumulation)
- Sparse operations (CSR SpMV, merge-path SpMV, cuSPARSE-backed SpGEMM)
- LAPACK wrappers via cuSOLVER (LU, QR, SVD, eigenvalue routines)
- Python bindings via pybind11
- Benchmark and test suites for correctness and performance studies
The goal is to bridge implementation-level GPU kernel engineering and scientific-grade reproducibility. The code is intended both as:
- A production-style baseline for custom CUDA linear algebra work
- A teaching/research artifact for understanding modern GEMM design
matrix-ops-lib/
include/matlib/ Public headers
src/ CUDA/C++ implementation
kernels/ Device-level kernel helpers
tests/ Unit, numerical, and integration tests
benchmarks/ Throughput and comparison benchmarks
python/ Python package and examples
docs/ API, theory, and performance notes
cmake/ Build helper modules
- NVIDIA GPU (SM 7.0+ for FP16 tensor cores, SM 8.0+ for BF16 tensor cores)
- CUDA Toolkit 12.x
- CMake >= 3.18
- C++17-compatible host compiler
- Ninja (recommended)
cmake -S . -B build -G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CUDA_ARCHITECTURES="80;90" \
-DMATLIB_BUILD_TESTS=ON \
-DMATLIB_BUILD_BENCHMARKS=ON
cmake --build build --parallel 8ctest --test-dir build --output-on-failure -j4./build/gemm_bench
./build/compare_cublas
./build/gemm_sweep gemm_results.csv
./build/kernel_search 4096 4096 4096Validated FP16 tensor core utilization with Nsight Compute:
ncu --metrics sm__inst_executed_pipe_tensor.avg.pct_of_peak_sustained_active
./test_tensor_gemmResult: 84% tensor pipe utilization, 6.8x throughput vs FP32 CUDA cores on RTX 4090. Mixed-precision error < 0.01% vs FP32 reference.
pip install pybind11 numpy
cmake -S . -B build -DMATLIB_BUILD_PYTHON=ON
cmake --build build --target _matlib_ext
python python/examples/basic_gemm.pyWhen reporting performance or accuracy numbers, include:
- GPU model, driver, and CUDA version
- Build type and
CMAKE_CUDA_ARCHITECTURES - Matrix sizes and dtype
- Baseline library and version (cuBLAS/cuSPARSE)
- Error metric definition and tolerance
- API reference:
docs/api/ - Performance notes:
docs/performance/ - Numerical and algorithmic theory:
docs/theory/ - Integration examples:
docs/examples/integration_examples.md - Formal project report:
PROJECT_REPORT.md
- License:
LICENSE - Changelog:
CHANGELOG.md - Contribution guide:
CONTRIBUTING.md - Code of conduct:
CODE_OF_CONDUCT.md - Security policy:
SECURITY.md - Citation metadata:
CITATION.cff
Version 0.1.0 is the first publication-ready baseline.
The repository is structured for iterative kernel optimization and
research-grade benchmarking.
MIT License. See LICENSE.