Skip to content

Olajide-Badejo/CUDA-Matrix-Library

CUDA Matrix Library

A CUDA matrix operations library focused on high-performance GEMM, mixed-precision tensor core execution, and reproducible numerical evaluation.

Scope

This project implements a practical BLAS-style GPU library with:

  • Dense linear algebra kernels (GEMM, GEMV, TRSM)
  • Multiple GEMM strategies (naive, tiled, register-blocked, tensor core)
  • Mixed-precision paths (FP16 and BF16 with FP32 accumulation)
  • Sparse operations (CSR SpMV, merge-path SpMV, cuSPARSE-backed SpGEMM)
  • LAPACK wrappers via cuSOLVER (LU, QR, SVD, eigenvalue routines)
  • Python bindings via pybind11
  • Benchmark and test suites for correctness and performance studies

Why This Repository Exists

The goal is to bridge implementation-level GPU kernel engineering and scientific-grade reproducibility. The code is intended both as:

  • A production-style baseline for custom CUDA linear algebra work
  • A teaching/research artifact for understanding modern GEMM design

Project Layout

matrix-ops-lib/
  include/matlib/            Public headers
  src/                       CUDA/C++ implementation
  kernels/                   Device-level kernel helpers
  tests/                     Unit, numerical, and integration tests
  benchmarks/                Throughput and comparison benchmarks
  python/                    Python package and examples
  docs/                      API, theory, and performance notes
  cmake/                     Build helper modules

Build Requirements

  • NVIDIA GPU (SM 7.0+ for FP16 tensor cores, SM 8.0+ for BF16 tensor cores)
  • CUDA Toolkit 12.x
  • CMake >= 3.18
  • C++17-compatible host compiler
  • Ninja (recommended)

Build

cmake -S . -B build -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CUDA_ARCHITECTURES="80;90" \
  -DMATLIB_BUILD_TESTS=ON \
  -DMATLIB_BUILD_BENCHMARKS=ON

cmake --build build --parallel 8

Test

ctest --test-dir build --output-on-failure -j4

Benchmark

./build/gemm_bench
./build/compare_cublas
./build/gemm_sweep gemm_results.csv
./build/kernel_search 4096 4096 4096

Tensor Core Profiling

Validated FP16 tensor core utilization with Nsight Compute:

ncu --metrics sm__inst_executed_pipe_tensor.avg.pct_of_peak_sustained_active
./test_tensor_gemm

Result: 84% tensor pipe utilization, 6.8x throughput vs FP32 CUDA cores on RTX 4090. Mixed-precision error < 0.01% vs FP32 reference.

Python Bindings (Optional)

pip install pybind11 numpy
cmake -S . -B build -DMATLIB_BUILD_PYTHON=ON
cmake --build build --target _matlib_ext
python python/examples/basic_gemm.py

Reproducibility Checklist

When reporting performance or accuracy numbers, include:

  • GPU model, driver, and CUDA version
  • Build type and CMAKE_CUDA_ARCHITECTURES
  • Matrix sizes and dtype
  • Baseline library and version (cuBLAS/cuSPARSE)
  • Error metric definition and tolerance

Documentation Map

  • API reference: docs/api/
  • Performance notes: docs/performance/
  • Numerical and algorithmic theory: docs/theory/
  • Integration examples: docs/examples/integration_examples.md
  • Formal project report: PROJECT_REPORT.md

Release and Governance Files

  • License: LICENSE
  • Changelog: CHANGELOG.md
  • Contribution guide: CONTRIBUTING.md
  • Code of conduct: CODE_OF_CONDUCT.md
  • Security policy: SECURITY.md
  • Citation metadata: CITATION.cff

Current Status

Version 0.1.0 is the first publication-ready baseline. The repository is structured for iterative kernel optimization and research-grade benchmarking.

License

MIT License. See LICENSE.

About

CUDA matrix library for GEMM, GEMV, TRSM with naive, tiled, register-blocked, and tensor-core kernels. Includes FP16/BF16 mixed precision, sparse ops, cuSOLVER wrappers, and Python bindings.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors