Skip to content

Commit 266e8da

Browse files
docs: clarify prerequisites and normalize benchmark docs
1 parent fa4a9e2 commit 266e8da

2 files changed

Lines changed: 104 additions & 103 deletions

File tree

README.md

Lines changed: 85 additions & 81 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# GPU-Based Matrix Operations
1+
# GPU-Based Matrix Operations
22

3-
High-performance CUDA/C++ implementation of matrixvector and matrixmatrix operations with a focus on **memory access patterns**, **shared-memory optimisation**, and **GPU vs CPU benchmarking**.
3+
High-performance CUDA/C++ implementation of matrix-vector and matrix-matrix operations with a focus on memory access patterns, shared-memory optimization, and GPU vs CPU benchmarking.
44

55
---
66

@@ -11,31 +11,31 @@ High-performance CUDA/C++ implementation of matrix–vector and matrix–matrix
1111
| `src/matvec_kernels.cu` | Three MatVec kernels: naive, shared-memory, warp-coalesced |
1212
| `src/matmul_kernels.cu` | Three MatMul kernels: naive, tiled (TILE=16), variable-tile sweep |
1313
| `src/cpu_ops.cpp` | Scalar CPU reference (ikj loop order) |
14-
| `src/main.cu` | Benchmark driver GFLOPS, speedup, correctness check |
14+
| `src/main.cu` | Benchmark driver with GFLOPS, speedup, and correctness checks |
1515
| `include/matrix_ops.cuh` | Kernel declarations + GPU timer + CUDA error macro |
16-
| `include/cpu_ops.h` | CPU op declarations |
16+
| `include/cpu_ops.h` | CPU operation declarations |
1717

1818
---
1919

2020
## Key Techniques Demonstrated
2121

2222
### Memory Access Patterns
23-
- **Naive global-memory access**baseline, every read hits DRAM
24-
- **Coalesced access** 2-D warp layout so threads read consecutive columns → 128-byte transactions
25-
- **Shared memory tiling** — sub-matrices loaded cooperatively; reduces global reads by `TILE_SIZE`×
23+
- Naive global-memory access (baseline)
24+
- Coalesced access with 2-D warp layout for contiguous reads
25+
- Shared-memory tiling to reduce redundant global reads
2626

27-
### Thread-Block Optimisations
27+
### Thread-Block Optimizations
2828
- 1-D blocks (256 threads) for MatVec
29-
- 2-D blocks (TILE×TILE) for MatMul
29+
- 2-D blocks (TILE x TILE) for MatMul
3030
- Warp-shuffle (`__shfl_down_sync`) for reduction without shared memory
31-
- `#pragma unroll` on inner tile loops
32-
- Tile-size sweep (8×8, 16×16, 32×32) to find the occupancy sweet spot
31+
- `#pragma unroll` in tile inner loops
32+
- Tile-size sweep (8x8, 16x16, 32x32)
3333

3434
### Benchmarking
35-
- CUDA events for microsecond-accurate GPU timing
36-
- Multiple warm-up runs then averaged timed runs
37-
- GFLOPS = `2·M·K·N / (time_s × 10⁹)`
38-
- `max_abs_diff` correctness check vs CPU reference
35+
- CUDA events for precise GPU timing
36+
- Warm-up runs + averaged timed runs
37+
- GFLOPS formulas for MatVec/MatMul
38+
- `max_abs_diff` correctness checks versus CPU reference
3939

4040
---
4141

@@ -49,116 +49,120 @@ High-performance CUDA/C++ implementation of matrix–vector and matrix–matrix
4949
| CMake *(optional)* | 3.18 |
5050

5151
Find your GPU's SM version:
52+
5253
```bash
5354
nvidia-smi --query-gpu=compute_cap --format=csv,noheader
5455
```
5556

57+
If `nvidia-smi` does not expose `compute_cap` on your system, check your GPU model and map it to compute capability from NVIDIA documentation.
58+
59+
### Quick Toolchain Verification
60+
61+
```bash
62+
nvcc --version
63+
cmake --version
64+
gh --version
65+
```
66+
67+
If any command is not recognized, add the corresponding `bin` directory to your `PATH` and restart your terminal.
68+
5669
---
5770

58-
## Build & Run
71+
## Build and Run
5972

60-
### Option AMakefile (simplest)
73+
### Option A: Makefile
6174

6275
```bash
63-
# Replace 86 with your GPU's SM version (e.g. 75 for Turing, 89 for Ada)
76+
# Replace 86 with your GPU's SM version (for example 75, 86, or 89)
6477
make SM=86
6578
make run SM=86
6679

6780
# Debug build
6881
make DEBUG=1 SM=86
6982

70-
# Profiling build (use with Nsight Compute)
83+
# Profiling build
7184
make profile SM=86
7285
ncu --set full ./bin/gpu_matrix_ops
7386
```
7487

75-
### Option B CMake
88+
### Option B: CMake
7689

7790
```bash
7891
mkdir build && cd build
7992
cmake .. -DCMAKE_BUILD_TYPE=Release -DCUDA_ARCH="86"
80-
make -j$(nproc)
93+
cmake --build . --config Release -j
8194
./bin/gpu_matrix_ops
8295
```
8396

8497
---
8598

8699
## Reproducibility Checklist
87100

88-
When sharing performance numbers, include:
101+
When reporting performance results, include:
89102
- GPU model and VRAM
90-
- CUDA toolkit + driver version
91-
- target SM architecture (`SM=..` or `-DCUDA_ARCH=..`)
103+
- CUDA toolkit and driver versions
104+
- target SM architecture (`SM=...` or `-DCUDA_ARCH=...`)
92105
- matrix dimensions and iteration counts
93106
- whether fast-math was enabled
94107

95-
Save logs under `results/` (see `results/README.md`) so measurements can be audited later.
108+
Save logs under `results/` (see `results/README.md`) for auditability.
96109

97110
---
98111

99112
## Expected Output (RTX 3080, SM86)
100113

101-
```
102-
════════════════════════════════════════════════════════════════════════
103-
GPU-Based Matrix Operations — Benchmark Suite
104-
════════════════════════════════════════════════════════════════════════
105-
106-
GPU : NVIDIA GeForce RTX 3080
107-
SMs : 68 | Clock : 1.71 GHz | Mem BW: ~760 GB/s
108-
VRAM : 10240 MB | Shared mem/block: 48 KB
109-
110-
════════════════════════════════════════════════════════════════════════
111-
MATRIX-VECTOR (y = A · x)
112-
════════════════════════════════════════════════════════════════════════
113-
114-
Matrix-Vector (8192 × 8192) × (8192 × 1)
115-
────────────────────────────────────────────────────────────────────────
116-
CPU (scalar) 47.30 ms 1.16 GFLOPS
117-
GPU naive 0.09 ms 611.00 GFLOPS speedup 525.6x err 0.00e+00
118-
GPU shared-mem 0.08 ms 702.10 GFLOPS speedup 591.2x err 0.00e+00
119-
GPU coalesced 0.06 ms 890.30 GFLOPS speedup 788.5x err 0.00e+00
120-
121-
════════════════════════════════════════════════════════════════════════
122-
MATRIX-MATRIX (C = A · B)
123-
════════════════════════════════════════════════════════════════════════
124-
125-
Matrix-Matrix (2048 × 2048) · (2048 × 2048)
126-
────────────────────────────────────────────────────────────────────────
127-
CPU (scalar ikj) 4821.00 ms 3.56 GFLOPS
128-
GPU naive 22.40 ms 765.10 GFLOPS speedup 215.2x
129-
GPU tiled (16×16) 4.10 ms 4183.20 GFLOPS speedup 1175.9x
130-
GPU tiled (8×8) 6.90 ms 2488.50 GFLOPS
131-
GPU tiled (16×16) 4.10 ms 4183.20 GFLOPS ← sweet spot
132-
GPU tiled (32×32) 5.60 ms 3064.40 GFLOPS (register pressure ↑)
114+
```text
115+
GPU-Based Matrix Operations - Benchmark Suite
116+
117+
GPU : NVIDIA GeForce RTX 3080
118+
SMs : 68 | Clock : 1.71 GHz | Mem BW: ~760 GB/s
119+
VRAM : 10240 MB | Shared mem/block: 48 KB
120+
121+
MATRIX-VECTOR (y = A * x)
122+
Matrix-Vector (8192 x 8192) x (8192 x 1)
123+
CPU (scalar) 47.30 ms 1.16 GFLOPS
124+
GPU naive 0.09 ms 611.00 GFLOPS speedup 525.6x err 0.00e+00
125+
GPU shared-mem 0.08 ms 702.10 GFLOPS speedup 591.2x err 0.00e+00
126+
GPU coalesced 0.06 ms 890.30 GFLOPS speedup 788.5x err 0.00e+00
127+
128+
MATRIX-MATRIX (C = A * B)
129+
Matrix-Matrix (2048 x 2048) * (2048 x 2048)
130+
CPU (scalar ikj) 4821.00 ms 3.56 GFLOPS
131+
GPU naive 22.40 ms 765.10 GFLOPS speedup 215.2x
132+
GPU tiled (16x16) 4.10 ms 4183.20 GFLOPS speedup 1175.9x
133+
GPU tiled (8x8) 6.90 ms 2488.50 GFLOPS
134+
GPU tiled (16x16) 4.10 ms 4183.20 GFLOPS <- sweet spot
135+
GPU tiled (32x32) 5.60 ms 3064.40 GFLOPS (register pressure up)
133136
```
134137

135-
*(Actual numbers vary by GPU model and driver version.)*
138+
Actual numbers vary by GPU model, driver version, and power limits.
136139

137140
---
138141

139142
## Project Structure
140143

141-
```
144+
```text
142145
gpu-matrix-ops/
143-
├── README.md
144-
├── LICENSE
145-
├── CONTRIBUTING.md
146-
├── CITATION.cff
147-
├── .gitignore
148-
├── CMakeLists.txt
149-
├── Makefile
150-
├── include/
151-
│ ├── matrix_ops.cuh ← CUDA headers, GpuTimer, CUDA_CHECK macro
152-
│ └── cpu_ops.h ← CPU reference declarations
153-
├── src/
154-
│ ├── main.cu ← benchmark driver
155-
│ ├── matvec_kernels.cu ← naive / shared / coalesced MatVec kernels
156-
│ ├── matmul_kernels.cu ← naive / tiled / variable-tile MatMul kernels
157-
│ └── cpu_ops.cpp ← scalar CPU implementations
158-
├── docs/
159-
│ └── OPTIMIZATION_NOTES.md
160-
└── results/
161-
└── README.md
146+
|-- README.md
147+
|-- LICENSE
148+
|-- CONTRIBUTING.md
149+
|-- CITATION.cff
150+
|-- .gitignore
151+
|-- .gitattributes
152+
|-- CMakeLists.txt
153+
|-- Makefile
154+
|-- include/
155+
| |-- matrix_ops.cuh
156+
| `-- cpu_ops.h
157+
|-- src/
158+
| |-- main.cu
159+
| |-- matvec_kernels.cu
160+
| |-- matmul_kernels.cu
161+
| `-- cpu_ops.cpp
162+
|-- docs/
163+
| `-- OPTIMIZATION_NOTES.md
164+
`-- results/
165+
`-- README.md
162166
```
163167

164168
---
@@ -172,7 +176,7 @@ ncu --set full --target-processes all ./bin/gpu_matrix_ops
172176
# Legacy nvprof
173177
nvprof --print-gpu-trace ./bin/gpu_matrix_ops
174178

175-
# Metrics to watch
179+
# Example metrics
176180
ncu --metrics \
177181
l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum.per_second,\
178182
smsp__sass_thread_inst_executed_op_ffma_pred_on.sum,\
@@ -182,6 +186,6 @@ ncu --metrics \
182186

183187
---
184188

185-
## Licence
189+
## License
186190

187-
MIT free to use, modify, and redistribute. See [`LICENSE`](LICENSE).
191+
MIT - free to use, modify, and redistribute. See `LICENSE`.

results/README.md

Lines changed: 19 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,46 +1,43 @@
1-
# Results
1+
# Results
22

33
This directory stores benchmark output logs.
44

5-
## How to capture results
5+
## How to Capture Results
66

77
```bash
88
# Pipe stdout to a timestamped log
99
./bin/gpu_matrix_ops | tee results/$(date +%Y%m%d_%H%M%S)_benchmark.txt
1010

11-
# Or with nvidia-smi alongside
11+
# Or capture GPU telemetry in parallel
1212
nvidia-smi dmon -s pucvmet -d 1 > results/gpu_power.log &
1313
./bin/gpu_matrix_ops > results/bench.txt
1414
kill %1
1515
```
1616

17-
## Columns explained
17+
## Columns Explained
1818

1919
| Column | Meaning |
2020
|--------|---------|
2121
| `ms` | Average kernel time over N iterations (CUDA events) |
22-
| `GFLOPS` | `2·ops / (time_s · 1e9)` effective throughput |
23-
| `speedup` | `cpu_ms / gpu_ms` — how much faster than scalar CPU |
24-
| `err` | `max(|gpu[i] - cpu[i]|)` numerical correctness check |
22+
| `GFLOPS` | `2*ops / (time_s * 1e9)` effective throughput |
23+
| `speedup` | `cpu_ms / gpu_ms` relative to scalar CPU |
24+
| `err` | `max(|gpu[i] - cpu[i]|)` numerical correctness check |
2525

26-
## What to look for
26+
## What to Look For
2727

28-
- **MatVec coalesced** should beat **MatVec naive** by 1.3–2× on modern GPUs
29-
(coalescing saves ~50 % of bandwidth for the x-vector reads).
30-
- **MatMul tiled 16×16** should beat **MatMul naive** by 5–20× (arithmetic
31-
intensity rises from 1 to 16 FMA/float).
32-
- **Tile-size sweep:** 16×16 typically wins; 32×32 usually loses due to
33-
register pressure reducing occupancy.
34-
- **Speedup vs CPU** of 100–1000× for large matrices is expected.
28+
- MatVec coalesced should beat MatVec naive by about 1.3x to 2x on modern GPUs.
29+
- MatMul tiled 16x16 should beat MatMul naive by about 5x to 20x.
30+
- Tile-size sweep usually favors 16x16; 32x32 may lose due to register pressure.
31+
- Speedups versus scalar CPU can reach 100x to 1000x for large matrices.
3532

36-
## Sample log (RTX 3080, 2048×2048 matmul)
33+
## Sample Log (RTX 3080, 2048x2048 matmul)
3734

38-
```
35+
```text
3936
CPU (scalar ikj) 4821.00 ms 3.56 GFLOPS
4037
GPU naive 22.40 ms 765.10 GFLOPS speedup 215.2x
41-
GPU tiled (16×16) 4.10 ms 4183.20 GFLOPS speedup 1175.9x
42-
── Tile-size sweep ──
43-
GPU tiled ( 8) 6.90 ms 2488.50 GFLOPS
44-
GPU tiled (16×16) 4.10 ms 4183.20 GFLOPS sweet spot
45-
GPU tiled (32×32) 5.60 ms 3064.40 GFLOPS
38+
GPU tiled (16x16) 4.10 ms 4183.20 GFLOPS speedup 1175.9x
39+
-- Tile-size sweep --
40+
GPU tiled ( 8x 8) 6.90 ms 2488.50 GFLOPS
41+
GPU tiled (16x16) 4.10 ms 4183.20 GFLOPS <- sweet spot
42+
GPU tiled (32x32) 5.60 ms 3064.40 GFLOPS
4643
```

0 commit comments

Comments
 (0)