CUDA Matrix Library

A CUDA matrix operations library focused on high-performance GEMM, mixed-precision tensor core execution, and reproducible numerical evaluation.

Scope

This project implements a practical BLAS-style GPU library with:

Dense linear algebra kernels (GEMM, GEMV, TRSM)
Multiple GEMM strategies (naive, tiled, register-blocked, tensor core)
Mixed-precision paths (FP16 and BF16 with FP32 accumulation)
Sparse operations (CSR SpMV, merge-path SpMV, cuSPARSE-backed SpGEMM)
LAPACK wrappers via cuSOLVER (LU, QR, SVD, eigenvalue routines)
Python bindings via pybind11
Benchmark and test suites for correctness and performance studies

Why This Repository Exists

The goal is to bridge implementation-level GPU kernel engineering and scientific-grade reproducibility. The code is intended both as:

A production-style baseline for custom CUDA linear algebra work
A teaching/research artifact for understanding modern GEMM design

Project Layout

matrix-ops-lib/
  include/matlib/            Public headers
  src/                       CUDA/C++ implementation
  kernels/                   Device-level kernel helpers
  tests/                     Unit, numerical, and integration tests
  benchmarks/                Throughput and comparison benchmarks
  python/                    Python package and examples
  docs/                      API, theory, and performance notes
  cmake/                     Build helper modules

Build Requirements

NVIDIA GPU (SM 7.0+ for FP16 tensor cores, SM 8.0+ for BF16 tensor cores)
CUDA Toolkit 12.x
CMake >= 3.18
C++17-compatible host compiler
Ninja (recommended)

Build

cmake -S . -B build -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CUDA_ARCHITECTURES="80;90" \
  -DMATLIB_BUILD_TESTS=ON \
  -DMATLIB_BUILD_BENCHMARKS=ON

cmake --build build --parallel 8

Test

ctest --test-dir build --output-on-failure -j4

Benchmark

./build/gemm_bench
./build/compare_cublas
./build/gemm_sweep gemm_results.csv
./build/kernel_search 4096 4096 4096

Tensor Core Profiling

Validated FP16 tensor core utilization with Nsight Compute:

ncu --metrics sm__inst_executed_pipe_tensor.avg.pct_of_peak_sustained_active
./test_tensor_gemm

Result: 84% tensor pipe utilization, 6.8x throughput vs FP32 CUDA cores on RTX 4090. Mixed-precision error < 0.01% vs FP32 reference.

Python Bindings (Optional)

pip install pybind11 numpy
cmake -S . -B build -DMATLIB_BUILD_PYTHON=ON
cmake --build build --target _matlib_ext
python python/examples/basic_gemm.py

Reproducibility Checklist

When reporting performance or accuracy numbers, include:

GPU model, driver, and CUDA version
Build type and CMAKE_CUDA_ARCHITECTURES
Matrix sizes and dtype
Baseline library and version (cuBLAS/cuSPARSE)
Error metric definition and tolerance

Documentation Map

API reference: docs/api/
Performance notes: docs/performance/
Numerical and algorithmic theory: docs/theory/
Integration examples: docs/examples/integration_examples.md
Formal project report: PROJECT_REPORT.md

Release and Governance Files

License: LICENSE
Changelog: CHANGELOG.md
Contribution guide: CONTRIBUTING.md
Code of conduct: CODE_OF_CONDUCT.md
Security policy: SECURITY.md
Citation metadata: CITATION.cff

Current Status

Version 0.1.0 is the first publication-ready baseline. The repository is structured for iterative kernel optimization and research-grade benchmarking.

License

MIT License. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA Matrix Library

Scope

Why This Repository Exists

Project Layout

Build Requirements

Build

Test

Benchmark

Tensor Core Profiling

Python Bindings (Optional)

Reproducibility Checklist

Documentation Map

Release and Governance Files

Current Status

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github		.github
benchmarks		benchmarks
cmake		cmake
docs		docs
include		include
kernels		kernels
python		python
src		src
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.markdownlint.yml		.markdownlint.yml
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
PROJECT_REPORT.md		PROJECT_REPORT.md
README.md		README.md
SECURITY.md		SECURITY.md
UPDATE.md		UPDATE.md

Folders and files

Latest commit

History

Repository files navigation

CUDA Matrix Library

Scope

Why This Repository Exists

Project Layout

Build Requirements

Build

Test

Benchmark

Tensor Core Profiling

Python Bindings (Optional)

Reproducibility Checklist

Documentation Map

Release and Governance Files

Current Status

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages