This project develops a CUDA-based matrix operations library that combines:
- performant dense kernel execution,
- mixed-precision tensor core acceleration,
- numerical verification against trusted baselines,
- and publication-quality engineering practices.
The target is not only high throughput, but also transparent, auditable design choices suitable for advanced research and industrial integration.
Implemented components include:
- BLAS-like dense operations: GEMM, GEMV, TRSM
- GEMM kernel variants:
- naive baseline
- shared-memory tiled
- register-blocked
- tensor-core (FP16 WMMA)
- tensor-core (BF16 WMMA)
- Runtime dispatch logic selecting a kernel by matrix size, precision, and compute capability
- Sparse kernels: CSR SpMV and merge-path SpMV
- LAPACK wrappers (cuSOLVER): LU, QR, SVD, eigen decomposition
- Python bindings via pybind11
- Tests and benchmarks for correctness, numerical behavior, and performance
Kernel design follows the GPU memory hierarchy:
- Global memory for bulk matrix storage
- Shared memory for tile staging
- Registers for per-thread accumulation
This reduces global traffic and increases arithmetic intensity.
Distinct kernel families are preserved rather than collapsed into one complex meta-kernel. This makes behavior easier to reason about and benchmark.
The dispatcher prefers:
- BF16 tensor core path on SM >= 8.0 for FP32 workloads where safe range and throughput both matter,
- FP16 tensor core on older tensor-core devices,
- register-blocked or tiled kernels for unsupported/smaller cases,
- cuBLAS fallback when appropriate.
BF16 is prioritized on Ampere+ hardware due to FP32-like exponent range, minimizing overflow risk for larger-magnitude inputs while maintaining tensor core throughput.
- cuBLAS serves as dense reference
- cuSPARSE/cuSOLVER provide sparse and factorization references
- relative error metrics are used per operation type
Tests are partitioned into:
- unit tests for kernel-level behavior,
- numerical tests for error/residual behavior,
- integration tests for compatibility and API assumptions.
The benchmark layer supports:
- controlled matrix-size sweeps,
- direct comparisons against cuBLAS,
- export to CSV for offline analysis,
- and autotuning search over candidate tile configurations.
This allows apples-to-apples evaluation and progressive optimization.
To make the repository publication and collaboration ready, the following non-code artifacts are now included:
LICENSE(MIT)CHANGELOG.mdCONTRIBUTING.mdCODE_OF_CONDUCT.mdSECURITY.mdCITATION.cff.editorconfig- GitHub templates and workflow for repository hygiene
These are required for responsible open-source release in academic and industrial contexts.
- Full GPU test execution requires compatible NVIDIA hardware and CUDA runtime.
- Hosted CI runners usually lack GPUs; GPU validation is expected on self-hosted runners.
- Some performance claims are hardware-dependent and must be reported with exact system metadata.
Primary technical risks:
- silent numerical drift during aggressive kernel optimization,
- architecture-specific regressions across SM generations,
- host-device copy overhead dominating small-matrix workloads.
Mitigation:
- strict benchmark + accuracy gating,
- architecture-aware test matrix,
- preserving baseline kernels for differential checks.
- Add deterministic benchmark harness with fixed RNG seeds and JSON manifests.
- Add Nsight Compute metric capture scripts per kernel family.
- Expand CI to include reproducibility artifacts on self-hosted GPU runners.
- Add packaging/publishing workflow for Python extension distribution.
The repository now presents a credible v0.1.0 baseline that combines
algorithmic depth, implementation detail, and release governance expected in
serious industrial or PhD-level engineering portfolios.