Skip to content

Latest commit

 

History

History
70 lines (51 loc) · 3.5 KB

File metadata and controls

70 lines (51 loc) · 3.5 KB

SGEMM Optimization

CI Pages License: MIT CUDA C++

English | 简体中文

This repository is a CUDA SGEMM case study presented as a technical whitepaper and kernel academy. It starts from readable FP32 baselines, climbs through tiled, bank-conflict-aware, double-buffer, and guarded Tensor Core WMMA paths, then frames every performance claim with explicit validation boundaries.

Why it stands out

  • Readable optimization ladder: every kernel stage exists to expose one bottleneck shift.
  • Evidence-first public story: correctness policy, benchmark scope, and local-versus-CI trust boundaries stay attached to every claim.
  • Interview-grade positioning: the Pages site is written so the project can be explained, defended, and audited under technical pressure.
  • Bilingual mirrored docs: English and Chinese routes stay structurally aligned across the full public site.

Quick start

git clone https://github.com/LessUp/sgemm-optimization.git
cd sgemm-optimization

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
./build/bin/sgemm_benchmark -a
ctest --test-dir build

Runtime tests and benchmarks require a local CUDA-capable machine. Hosted CI validates formatting, CUDA compilation, docs-site checks, route integrity, and Pages buildability.

GitHub Pages entry points

The README is the executive summary. The long-form technical narrative lives on Pages.

Goal Entry point
Open English home English Home
Open Chinese home 中文首页
Get oriented quickly Project Guide
Inspect system structure Architecture
Study the kernel ladder Academy
Check what the evidence proves Validation
Trace papers and related repos Research Desk
Read contributor workflow and validation commands CONTRIBUTING.md

Validation boundary

Environment What it can prove
Hosted CI Formatting, CUDA compilation, docs structure, route integrity, Pages buildability
Local CUDA GPU Runtime correctness, fallback behavior, benchmark performance

This split is deliberate. CI catches build and repository-surface issues early, but only local GPU execution can validate runtime behavior and speed claims.

Source map

src/kernels/   CUDA SGEMM implementations
src/utils/     CUDA RAII, verification, benchmark helpers
src/main.cu    benchmark CLI
tests/         Google Test coverage against cuBLAS
docs/          VitePress whitepaper and academy, mirrored under /en and /zh

License

MIT. See LICENSE.md.