Skip to content

Latest commit

 

History

History
143 lines (99 loc) · 4.27 KB

File metadata and controls

143 lines (99 loc) · 4.27 KB

Project Report: CUDA Matrix Library

1. Objective

This project develops a CUDA-based matrix operations library that combines:

  • performant dense kernel execution,
  • mixed-precision tensor core acceleration,
  • numerical verification against trusted baselines,
  • and publication-quality engineering practices.

The target is not only high throughput, but also transparent, auditable design choices suitable for advanced research and industrial integration.

2. Technical Scope

Implemented components include:

  • BLAS-like dense operations: GEMM, GEMV, TRSM
  • GEMM kernel variants:
    • naive baseline
    • shared-memory tiled
    • register-blocked
    • tensor-core (FP16 WMMA)
    • tensor-core (BF16 WMMA)
  • Runtime dispatch logic selecting a kernel by matrix size, precision, and compute capability
  • Sparse kernels: CSR SpMV and merge-path SpMV
  • LAPACK wrappers (cuSOLVER): LU, QR, SVD, eigen decomposition
  • Python bindings via pybind11
  • Tests and benchmarks for correctness, numerical behavior, and performance

3. Architecture and Design Principles

3.1 Hierarchical Memory Use

Kernel design follows the GPU memory hierarchy:

  1. Global memory for bulk matrix storage
  2. Shared memory for tile staging
  3. Registers for per-thread accumulation

This reduces global traffic and increases arithmetic intensity.

3.2 Kernel Specialization

Distinct kernel families are preserved rather than collapsed into one complex meta-kernel. This makes behavior easier to reason about and benchmark.

3.3 Runtime Dispatch

The dispatcher prefers:

  • BF16 tensor core path on SM >= 8.0 for FP32 workloads where safe range and throughput both matter,
  • FP16 tensor core on older tensor-core devices,
  • register-blocked or tiled kernels for unsupported/smaller cases,
  • cuBLAS fallback when appropriate.

4. Numerical Strategy

4.1 Mixed Precision

BF16 is prioritized on Ampere+ hardware due to FP32-like exponent range, minimizing overflow risk for larger-magnitude inputs while maintaining tensor core throughput.

4.2 Validation Baselines

  • cuBLAS serves as dense reference
  • cuSPARSE/cuSOLVER provide sparse and factorization references
  • relative error metrics are used per operation type

4.3 Test Philosophy

Tests are partitioned into:

  • unit tests for kernel-level behavior,
  • numerical tests for error/residual behavior,
  • integration tests for compatibility and API assumptions.

5. Performance Methodology

The benchmark layer supports:

  • controlled matrix-size sweeps,
  • direct comparisons against cuBLAS,
  • export to CSV for offline analysis,
  • and autotuning search over candidate tile configurations.

This allows apples-to-apples evaluation and progressive optimization.

6. Engineering Readiness Enhancements

To make the repository publication and collaboration ready, the following non-code artifacts are now included:

  • LICENSE (MIT)
  • CHANGELOG.md
  • CONTRIBUTING.md
  • CODE_OF_CONDUCT.md
  • SECURITY.md
  • CITATION.cff
  • .editorconfig
  • GitHub templates and workflow for repository hygiene

These are required for responsible open-source release in academic and industrial contexts.

7. Known Constraints

  • Full GPU test execution requires compatible NVIDIA hardware and CUDA runtime.
  • Hosted CI runners usually lack GPUs; GPU validation is expected on self-hosted runners.
  • Some performance claims are hardware-dependent and must be reported with exact system metadata.

8. Risk Register

Primary technical risks:

  • silent numerical drift during aggressive kernel optimization,
  • architecture-specific regressions across SM generations,
  • host-device copy overhead dominating small-matrix workloads.

Mitigation:

  • strict benchmark + accuracy gating,
  • architecture-aware test matrix,
  • preserving baseline kernels for differential checks.

9. Next Milestones

  1. Add deterministic benchmark harness with fixed RNG seeds and JSON manifests.
  2. Add Nsight Compute metric capture scripts per kernel family.
  3. Expand CI to include reproducibility artifacts on self-hosted GPU runners.
  4. Add packaging/publishing workflow for Python extension distribution.

10. Conclusion

The repository now presents a credible v0.1.0 baseline that combines algorithmic depth, implementation detail, and release governance expected in serious industrial or PhD-level engineering portfolios.