1- # GPU-Based Matrix Operations
1+ # GPU-Based Matrix Operations
22
3- High-performance CUDA/C++ implementation of matrix– vector and matrix– matrix operations with a focus on ** memory access patterns** , ** shared-memory optimisation ** , and ** GPU vs CPU benchmarking** .
3+ High-performance CUDA/C++ implementation of matrix- vector and matrix- matrix operations with a focus on memory access patterns, shared-memory optimization , and GPU vs CPU benchmarking.
44
55---
66
@@ -11,31 +11,31 @@ High-performance CUDA/C++ implementation of matrix–vector and matrix–matrix
1111| ` src/matvec_kernels.cu ` | Three MatVec kernels: naive, shared-memory, warp-coalesced |
1212| ` src/matmul_kernels.cu ` | Three MatMul kernels: naive, tiled (TILE=16), variable-tile sweep |
1313| ` src/cpu_ops.cpp ` | Scalar CPU reference (ikj loop order) |
14- | ` src/main.cu ` | Benchmark driver – GFLOPS, speedup, correctness check |
14+ | ` src/main.cu ` | Benchmark driver with GFLOPS, speedup, and correctness checks |
1515| ` include/matrix_ops.cuh ` | Kernel declarations + GPU timer + CUDA error macro |
16- | ` include/cpu_ops.h ` | CPU op declarations |
16+ | ` include/cpu_ops.h ` | CPU operation declarations |
1717
1818---
1919
2020## Key Techniques Demonstrated
2121
2222### Memory Access Patterns
23- - ** Naive global-memory access** — baseline, every read hits DRAM
24- - ** Coalesced access** — 2-D warp layout so threads read consecutive columns → 128-byte transactions
25- - ** Shared memory tiling** — sub-matrices loaded cooperatively; reduces global reads by ` TILE_SIZE ` ×
23+ - Naive global-memory access ( baseline)
24+ - Coalesced access with 2-D warp layout for contiguous reads
25+ - Shared- memory tiling to reduce redundant global reads
2626
27- ### Thread-Block Optimisations
27+ ### Thread-Block Optimizations
2828- 1-D blocks (256 threads) for MatVec
29- - 2-D blocks (TILE× TILE) for MatMul
29+ - 2-D blocks (TILE x TILE) for MatMul
3030- Warp-shuffle (` __shfl_down_sync ` ) for reduction without shared memory
31- - ` #pragma unroll ` on inner tile loops
32- - Tile-size sweep (8×8, 16×16, 32×32) to find the occupancy sweet spot
31+ - ` #pragma unroll ` in tile inner loops
32+ - Tile-size sweep (8x8, 16x16, 32x32)
3333
3434### Benchmarking
35- - CUDA events for microsecond-accurate GPU timing
36- - Multiple warm -up runs then averaged timed runs
37- - GFLOPS = ` 2·M·K·N / (time_s × 10⁹) `
38- - ` max_abs_diff ` correctness check vs CPU reference
35+ - CUDA events for precise GPU timing
36+ - Warm -up runs + averaged timed runs
37+ - GFLOPS formulas for MatVec/MatMul
38+ - ` max_abs_diff ` correctness checks versus CPU reference
3939
4040---
4141
@@ -49,116 +49,120 @@ High-performance CUDA/C++ implementation of matrix–vector and matrix–matrix
4949| CMake * (optional)* | 3.18 |
5050
5151Find your GPU's SM version:
52+
5253``` bash
5354nvidia-smi --query-gpu=compute_cap --format=csv,noheader
5455```
5556
57+ If ` nvidia-smi ` does not expose ` compute_cap ` on your system, check your GPU model and map it to compute capability from NVIDIA documentation.
58+
59+ ### Quick Toolchain Verification
60+
61+ ``` bash
62+ nvcc --version
63+ cmake --version
64+ gh --version
65+ ```
66+
67+ If any command is not recognized, add the corresponding ` bin ` directory to your ` PATH ` and restart your terminal.
68+
5669---
5770
58- ## Build & Run
71+ ## Build and Run
5972
60- ### Option A — Makefile (simplest)
73+ ### Option A: Makefile
6174
6275``` bash
63- # Replace 86 with your GPU's SM version (e.g. 75 for Turing, 89 for Ada )
76+ # Replace 86 with your GPU's SM version (for example 75, 86, or 89 )
6477make SM=86
6578make run SM=86
6679
6780# Debug build
6881make DEBUG=1 SM=86
6982
70- # Profiling build (use with Nsight Compute)
83+ # Profiling build
7184make profile SM=86
7285ncu --set full ./bin/gpu_matrix_ops
7386```
7487
75- ### Option B — CMake
88+ ### Option B: CMake
7689
7790``` bash
7891mkdir build && cd build
7992cmake .. -DCMAKE_BUILD_TYPE=Release -DCUDA_ARCH=" 86"
80- make -j $( nproc )
93+ cmake --build . --config Release -j
8194./bin/gpu_matrix_ops
8295```
8396
8497---
8598
8699## Reproducibility Checklist
87100
88- When sharing performance numbers , include:
101+ When reporting performance results , include:
89102- GPU model and VRAM
90- - CUDA toolkit + driver version
91- - target SM architecture (` SM=.. ` or ` -DCUDA_ARCH=.. ` )
103+ - CUDA toolkit and driver versions
104+ - target SM architecture (` SM=... ` or ` -DCUDA_ARCH=. .. ` )
92105- matrix dimensions and iteration counts
93106- whether fast-math was enabled
94107
95- Save logs under ` results/ ` (see ` results/README.md ` ) so measurements can be audited later .
108+ Save logs under ` results/ ` (see ` results/README.md ` ) for auditability .
96109
97110---
98111
99112## Expected Output (RTX 3080, SM86)
100113
101- ```
102- ════════════════════════════════════════════════════════════════════════
103- GPU-Based Matrix Operations — Benchmark Suite
104- ════════════════════════════════════════════════════════════════════════
105-
106- GPU : NVIDIA GeForce RTX 3080
107- SMs : 68 | Clock : 1.71 GHz | Mem BW: ~760 GB/s
108- VRAM : 10240 MB | Shared mem/block: 48 KB
109-
110- ════════════════════════════════════════════════════════════════════════
111- MATRIX-VECTOR (y = A · x)
112- ════════════════════════════════════════════════════════════════════════
113-
114- Matrix-Vector (8192 × 8192) × (8192 × 1)
115- ────────────────────────────────────────────────────────────────────────
116- CPU (scalar) 47.30 ms 1.16 GFLOPS
117- GPU naive 0.09 ms 611.00 GFLOPS speedup 525.6x err 0.00e+00
118- GPU shared-mem 0.08 ms 702.10 GFLOPS speedup 591.2x err 0.00e+00
119- GPU coalesced 0.06 ms 890.30 GFLOPS speedup 788.5x err 0.00e+00
120-
121- ════════════════════════════════════════════════════════════════════════
122- MATRIX-MATRIX (C = A · B)
123- ════════════════════════════════════════════════════════════════════════
124-
125- Matrix-Matrix (2048 × 2048) · (2048 × 2048)
126- ────────────────────────────────────────────────────────────────────────
127- CPU (scalar ikj) 4821.00 ms 3.56 GFLOPS
128- GPU naive 22.40 ms 765.10 GFLOPS speedup 215.2x
129- GPU tiled (16×16) 4.10 ms 4183.20 GFLOPS speedup 1175.9x
130- GPU tiled (8×8) 6.90 ms 2488.50 GFLOPS
131- GPU tiled (16×16) 4.10 ms 4183.20 GFLOPS ← sweet spot
132- GPU tiled (32×32) 5.60 ms 3064.40 GFLOPS (register pressure ↑)
114+ ``` text
115+ GPU-Based Matrix Operations - Benchmark Suite
116+
117+ GPU : NVIDIA GeForce RTX 3080
118+ SMs : 68 | Clock : 1.71 GHz | Mem BW: ~760 GB/s
119+ VRAM : 10240 MB | Shared mem/block: 48 KB
120+
121+ MATRIX-VECTOR (y = A * x)
122+ Matrix-Vector (8192 x 8192) x (8192 x 1)
123+ CPU (scalar) 47.30 ms 1.16 GFLOPS
124+ GPU naive 0.09 ms 611.00 GFLOPS speedup 525.6x err 0.00e+00
125+ GPU shared-mem 0.08 ms 702.10 GFLOPS speedup 591.2x err 0.00e+00
126+ GPU coalesced 0.06 ms 890.30 GFLOPS speedup 788.5x err 0.00e+00
127+
128+ MATRIX-MATRIX (C = A * B)
129+ Matrix-Matrix (2048 x 2048) * (2048 x 2048)
130+ CPU (scalar ikj) 4821.00 ms 3.56 GFLOPS
131+ GPU naive 22.40 ms 765.10 GFLOPS speedup 215.2x
132+ GPU tiled (16x16) 4.10 ms 4183.20 GFLOPS speedup 1175.9x
133+ GPU tiled (8x8) 6.90 ms 2488.50 GFLOPS
134+ GPU tiled (16x16) 4.10 ms 4183.20 GFLOPS <- sweet spot
135+ GPU tiled (32x32) 5.60 ms 3064.40 GFLOPS (register pressure up)
133136```
134137
135- * ( Actual numbers vary by GPU model and driver version.) *
138+ Actual numbers vary by GPU model, driver version, and power limits.
136139
137140---
138141
139142## Project Structure
140143
141- ```
144+ ``` text
142145gpu-matrix-ops/
143- ├── README.md
144- ├── LICENSE
145- ├── CONTRIBUTING.md
146- ├── CITATION.cff
147- ├── .gitignore
148- ├── CMakeLists.txt
149- ├── Makefile
150- ├── include/
151- │ ├── matrix_ops.cuh ← CUDA headers, GpuTimer, CUDA_CHECK macro
152- │ └── cpu_ops.h ← CPU reference declarations
153- ├── src/
154- │ ├── main.cu ← benchmark driver
155- │ ├── matvec_kernels.cu ← naive / shared / coalesced MatVec kernels
156- │ ├── matmul_kernels.cu ← naive / tiled / variable-tile MatMul kernels
157- │ └── cpu_ops.cpp ← scalar CPU implementations
158- ├── docs/
159- │ └── OPTIMIZATION_NOTES.md
160- └── results/
161- └── README.md
146+ |-- README.md
147+ |-- LICENSE
148+ |-- CONTRIBUTING.md
149+ |-- CITATION.cff
150+ |-- .gitignore
151+ |-- .gitattributes
152+ |-- CMakeLists.txt
153+ |-- Makefile
154+ |-- include/
155+ | |-- matrix_ops.cuh
156+ | `-- cpu_ops.h
157+ |-- src/
158+ | |-- main.cu
159+ | |-- matvec_kernels.cu
160+ | |-- matmul_kernels.cu
161+ | `-- cpu_ops.cpp
162+ |-- docs/
163+ | `-- OPTIMIZATION_NOTES.md
164+ `-- results/
165+ `-- README.md
162166```
163167
164168---
@@ -172,7 +176,7 @@ ncu --set full --target-processes all ./bin/gpu_matrix_ops
172176# Legacy nvprof
173177nvprof --print-gpu-trace ./bin/gpu_matrix_ops
174178
175- # Metrics to watch
179+ # Example metrics
176180ncu --metrics \
177181 l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum.per_second,\
178182 smsp__sass_thread_inst_executed_op_ffma_pred_on.sum,\
@@ -182,6 +186,6 @@ ncu --metrics \
182186
183187---
184188
185- ## Licence
189+ ## License
186190
187- MIT — free to use, modify, and redistribute. See [ ` LICENSE ` ] ( LICENSE ) .
191+ MIT - free to use, modify, and redistribute. See ` LICENSE ` .
0 commit comments