SGEMM Optimization

English | 简体中文

This repository is a CUDA SGEMM case study presented as a technical whitepaper and kernel academy. It starts from readable FP32 baselines, climbs through tiled, bank-conflict-aware, double-buffer, and guarded Tensor Core WMMA paths, then frames every performance claim with explicit validation boundaries.

Why it stands out

Readable optimization ladder: every kernel stage exists to expose one bottleneck shift.
Evidence-first public story: correctness policy, benchmark scope, and local-versus-CI trust boundaries stay attached to every claim.
Interview-grade positioning: the Pages site is written so the project can be explained, defended, and audited under technical pressure.
Bilingual mirrored docs: English and Chinese routes stay structurally aligned across the full public site.

Quick start

git clone https://github.com/LessUp/sgemm-optimization.git
cd sgemm-optimization

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
./build/bin/sgemm_benchmark -a
ctest --test-dir build

Runtime tests and benchmarks require a local CUDA-capable machine. Hosted CI validates formatting, CUDA compilation, docs-site checks, route integrity, and Pages buildability.

GitHub Pages entry points

The README is the executive summary. The long-form technical narrative lives on Pages.

Goal	Entry point
Open English home	English Home
Open Chinese home	中文首页
Get oriented quickly	Project Guide
Inspect system structure	Architecture
Study the kernel ladder	Academy
Check what the evidence proves	Validation
Trace papers and related repos	Research Desk
Read contributor workflow and validation commands	CONTRIBUTING.md

Validation boundary

Environment	What it can prove
Hosted CI	Formatting, CUDA compilation, docs structure, route integrity, Pages buildability
Local CUDA GPU	Runtime correctness, fallback behavior, benchmark performance

This split is deliberate. CI catches build and repository-surface issues early, but only local GPU execution can validate runtime behavior and speed claims.

Source map

src/kernels/   CUDA SGEMM implementations
src/utils/     CUDA RAII, verification, benchmark helpers
src/main.cu    benchmark CLI
tests/         Google Test coverage against cuBLAS
docs/          VitePress whitepaper and academy, mirrored under /en and /zh

License

MIT. See LICENSE.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SGEMM Optimization

Why it stands out

Quick start

GitHub Pages entry points

Validation boundary

Source map

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

SGEMM Optimization

Why it stands out

Quick start

GitHub Pages entry points

Validation boundary

Source map

License