docs: clarify prerequisites and normalize benchmark docs

Olajide-Badejo · Olajide-Badejo · commit 266e8da9a68d · 2026-04-11T21:00:32.000+02:00
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
-# GPU-Based Matrix Operations
+﻿# GPU-Based Matrix Operations
 
-High-performance CUDA/C++ implementation of matrix–vector and matrix–matrix operations with a focus on **memory access patterns**, **shared-memory optimisation**, and **GPU vs CPU benchmarking**.
+High-performance CUDA/C++ implementation of matrix-vector and matrix-matrix operations with a focus on memory access patterns, shared-memory optimization, and GPU vs CPU benchmarking.
 
 ---
 
@@ -11,31 +11,31 @@ High-performance CUDA/C++ implementation of matrix–vector and matrix–matrix
 | `src/matvec_kernels.cu` | Three MatVec kernels: naive, shared-memory, warp-coalesced |
 | `src/matmul_kernels.cu` | Three MatMul kernels: naive, tiled (TILE=16), variable-tile sweep |
 | `src/cpu_ops.cpp` | Scalar CPU reference (ikj loop order) |
-| `src/main.cu` | Benchmark driver – GFLOPS, speedup, correctness check |
+| `src/main.cu` | Benchmark driver with GFLOPS, speedup, and correctness checks |
 | `include/matrix_ops.cuh` | Kernel declarations + GPU timer + CUDA error macro |
-| `include/cpu_ops.h` | CPU op declarations |
+| `include/cpu_ops.h` | CPU operation declarations |
 
 ---
 
 ## Key Techniques Demonstrated
 
 ### Memory Access Patterns
-- **Naive global-memory access** — baseline, every read hits DRAM
-- **Coalesced access** — 2-D warp layout so threads read consecutive columns → 128-byte transactions
-- **Shared memory tiling** — sub-matrices loaded cooperatively; reduces global reads by `TILE_SIZE`×
+- Naive global-memory access (baseline)
+- Coalesced access with 2-D warp layout for contiguous reads
+- Shared-memory tiling to reduce redundant global reads
 
-### Thread-Block Optimisations
+### Thread-Block Optimizations
 - 1-D blocks (256 threads) for MatVec
-- 2-D blocks (TILE×TILE) for MatMul
+- 2-D blocks (TILE x TILE) for MatMul
 - Warp-shuffle (`__shfl_down_sync`) for reduction without shared memory
-- `#pragma unroll` on inner tile loops
-- Tile-size sweep (8×8, 16×16, 32×32) to find the occupancy sweet spot
+- `#pragma unroll` in tile inner loops
+- Tile-size sweep (8x8, 16x16, 32x32)
 
 ### Benchmarking
-- CUDA events for microsecond-accurate GPU timing
-- Multiple warm-up runs then averaged timed runs
-- GFLOPS = `2·M·K·N / (time_s × 10⁹)`
-- `max_abs_diff` correctness check vs CPU reference
+- CUDA events for precise GPU timing
+- Warm-up runs + averaged timed runs
+- GFLOPS formulas for MatVec/MatMul
+- `max_abs_diff` correctness checks versus CPU reference
 
 ---
 
@@ -49,116 +49,120 @@ High-performance CUDA/C++ implementation of matrix–vector and matrix–matrix
 | CMake *(optional)* | 3.18 |
 
 Find your GPU's SM version:
+
 ```bash
 nvidia-smi --query-gpu=compute_cap --format=csv,noheader
 ```
 
+If `nvidia-smi` does not expose `compute_cap` on your system, check your GPU model and map it to compute capability from NVIDIA documentation.
+
+### Quick Toolchain Verification
+
+```bash
+nvcc --version
+cmake --version
+gh --version
+```
+
+If any command is not recognized, add the corresponding `bin` directory to your `PATH` and restart your terminal.
+
 ---
 
-## Build & Run
+## Build and Run
 
-### Option A — Makefile (simplest)
+### Option A: Makefile
 
 ```bash
-# Replace 86 with your GPU's SM version (e.g. 75 for Turing, 89 for Ada)
+# Replace 86 with your GPU's SM version (for example 75, 86, or 89)
 make SM=86
 make run SM=86
 
 # Debug build
 make DEBUG=1 SM=86
 
-# Profiling build (use with Nsight Compute)
+# Profiling build
 make profile SM=86
 ncu --set full ./bin/gpu_matrix_ops
 ```
 
-### Option B — CMake
+### Option B: CMake
 
 ```bash
 mkdir build && cd build
 cmake .. -DCMAKE_BUILD_TYPE=Release -DCUDA_ARCH="86"
-make -j$(nproc)
+cmake --build . --config Release -j
 ./bin/gpu_matrix_ops
 ```
 
 ---
 
 ## Reproducibility Checklist
 
-When sharing performance numbers, include:
+When reporting performance results, include:
 - GPU model and VRAM
-- CUDA toolkit + driver version
-- target SM architecture (`SM=..` or `-DCUDA_ARCH=..`)
+- CUDA toolkit and driver versions
+- target SM architecture (`SM=...` or `-DCUDA_ARCH=...`)
 - matrix dimensions and iteration counts
 - whether fast-math was enabled
 
-Save logs under `results/` (see `results/README.md`) so measurements can be audited later.
+Save logs under `results/` (see `results/README.md`) for auditability.
 
 ---
 
 ## Expected Output (RTX 3080, SM86)
 
-```
-════════════════════════════════════════════════════════════════════════
-  GPU-Based Matrix Operations — Benchmark Suite
-════════════════════════════════════════════════════════════════════════
-
-  GPU  : NVIDIA GeForce RTX 3080
-  SMs  : 68  |  Clock : 1.71 GHz  |  Mem BW: ~760 GB/s
-  VRAM : 10240 MB  |  Shared mem/block: 48 KB
-
-════════════════════════════════════════════════════════════════════════
-  MATRIX-VECTOR  (y = A · x)
-════════════════════════════════════════════════════════════════════════
-
-  Matrix-Vector  (8192 × 8192)  × (8192 × 1)
-  ────────────────────────────────────────────────────────────────────────
-  CPU (scalar)               47.30 ms    1.16 GFLOPS
-  GPU naive                   0.09 ms  611.00 GFLOPS  speedup 525.6x  err 0.00e+00
-  GPU shared-mem              0.08 ms  702.10 GFLOPS  speedup 591.2x  err 0.00e+00
-  GPU coalesced               0.06 ms  890.30 GFLOPS  speedup 788.5x  err 0.00e+00
-
-════════════════════════════════════════════════════════════════════════
-  MATRIX-MATRIX  (C = A · B)
-════════════════════════════════════════════════════════════════════════
-
-  Matrix-Matrix  (2048 × 2048) · (2048 × 2048)
-  ────────────────────────────────────────────────────────────────────────
-  CPU (scalar ikj)         4821.00 ms    3.56 GFLOPS
-  GPU naive                  22.40 ms  765.10 GFLOPS  speedup 215.2x
-  GPU tiled (16×16)           4.10 ms 4183.20 GFLOPS  speedup 1175.9x
-  GPU tiled (8×8)             6.90 ms 2488.50 GFLOPS
-  GPU tiled (16×16)           4.10 ms 4183.20 GFLOPS  ← sweet spot
-  GPU tiled (32×32)           5.60 ms 3064.40 GFLOPS  (register pressure ↑)
+```text
+GPU-Based Matrix Operations - Benchmark Suite
+
+GPU  : NVIDIA GeForce RTX 3080
+SMs  : 68  |  Clock : 1.71 GHz  |  Mem BW: ~760 GB/s
+VRAM : 10240 MB  |  Shared mem/block: 48 KB
+
+MATRIX-VECTOR  (y = A * x)
+Matrix-Vector  (8192 x 8192)  x (8192 x 1)
+CPU (scalar)               47.30 ms    1.16 GFLOPS
+GPU naive                   0.09 ms  611.00 GFLOPS  speedup 525.6x  err 0.00e+00
+GPU shared-mem              0.08 ms  702.10 GFLOPS  speedup 591.2x  err 0.00e+00
+GPU coalesced               0.06 ms  890.30 GFLOPS  speedup 788.5x  err 0.00e+00
+
+MATRIX-MATRIX  (C = A * B)
+Matrix-Matrix  (2048 x 2048) * (2048 x 2048)
+CPU (scalar ikj)         4821.00 ms    3.56 GFLOPS
+GPU naive                  22.40 ms  765.10 GFLOPS  speedup 215.2x
+GPU tiled (16x16)           4.10 ms 4183.20 GFLOPS  speedup 1175.9x
+GPU tiled (8x8)             6.90 ms 2488.50 GFLOPS
+GPU tiled (16x16)           4.10 ms 4183.20 GFLOPS  <- sweet spot
+GPU tiled (32x32)           5.60 ms 3064.40 GFLOPS  (register pressure up)
 ```
 
-*(Actual numbers vary by GPU model and driver version.)*
+Actual numbers vary by GPU model, driver version, and power limits.
 
 ---
 
 ## Project Structure
 
-```
+```text
 gpu-matrix-ops/
-├── README.md
-├── LICENSE
-├── CONTRIBUTING.md
-├── CITATION.cff
-├── .gitignore
-├── CMakeLists.txt
-├── Makefile
-├── include/
-│   ├── matrix_ops.cuh      ← CUDA headers, GpuTimer, CUDA_CHECK macro
-│   └── cpu_ops.h           ← CPU reference declarations
-├── src/
-│   ├── main.cu             ← benchmark driver
-│   ├── matvec_kernels.cu   ← naive / shared / coalesced MatVec kernels
-│   ├── matmul_kernels.cu   ← naive / tiled / variable-tile MatMul kernels
-│   └── cpu_ops.cpp         ← scalar CPU implementations
-├── docs/
-│   └── OPTIMIZATION_NOTES.md
-└── results/
-    └── README.md
+|-- README.md
+|-- LICENSE
+|-- CONTRIBUTING.md
+|-- CITATION.cff
+|-- .gitignore
+|-- .gitattributes
+|-- CMakeLists.txt
+|-- Makefile
+|-- include/
+|   |-- matrix_ops.cuh
+|   `-- cpu_ops.h
+|-- src/
+|   |-- main.cu
+|   |-- matvec_kernels.cu
+|   |-- matmul_kernels.cu
+|   `-- cpu_ops.cpp
+|-- docs/
+|   `-- OPTIMIZATION_NOTES.md
+`-- results/
+    `-- README.md
 ```
 
 ---
@@ -172,7 +176,7 @@ ncu --set full --target-processes all ./bin/gpu_matrix_ops
 # Legacy nvprof
 nvprof --print-gpu-trace ./bin/gpu_matrix_ops
 
-# Metrics to watch
+# Example metrics
 ncu --metrics \
   l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum.per_second,\
   smsp__sass_thread_inst_executed_op_ffma_pred_on.sum,\
@@ -182,6 +186,6 @@ ncu --metrics \
 
 ---
 
-## Licence
+## License
 
-MIT — free to use, modify, and redistribute. See [`LICENSE`](LICENSE).
+MIT - free to use, modify, and redistribute. See `LICENSE`.
diff --git a/results/README.md b/results/README.md
@@ -1,46 +1,43 @@
-# Results
+﻿# Results
 
 This directory stores benchmark output logs.
 
-## How to capture results
+## How to Capture Results
 
 ```bash
 # Pipe stdout to a timestamped log
 ./bin/gpu_matrix_ops | tee results/$(date +%Y%m%d_%H%M%S)_benchmark.txt
 
-# Or with nvidia-smi alongside
+# Or capture GPU telemetry in parallel
 nvidia-smi dmon -s pucvmet -d 1 > results/gpu_power.log &
 ./bin/gpu_matrix_ops > results/bench.txt
 kill %1
 ```
 
-## Columns explained
+## Columns Explained
 
 | Column | Meaning |
 |--------|---------|
 | `ms` | Average kernel time over N iterations (CUDA events) |
-| `GFLOPS` | `2·ops / (time_s · 1e9)` — effective throughput |
-| `speedup` | `cpu_ms / gpu_ms` — how much faster than scalar CPU |
-| `err` | `max(|gpu[i] - cpu[i]|)` — numerical correctness check |
+| `GFLOPS` | `2*ops / (time_s * 1e9)` effective throughput |
+| `speedup` | `cpu_ms / gpu_ms` relative to scalar CPU |
+| `err` | `max(|gpu[i] - cpu[i]|)` numerical correctness check |
 
-## What to look for
+## What to Look For
 
-- **MatVec coalesced** should beat **MatVec naive** by 1.3–2× on modern GPUs
-  (coalescing saves ~50 % of bandwidth for the x-vector reads).
-- **MatMul tiled 16×16** should beat **MatMul naive** by 5–20× (arithmetic
-  intensity rises from 1 to 16 FMA/float).
-- **Tile-size sweep:** 16×16 typically wins; 32×32 usually loses due to
-  register pressure reducing occupancy.
-- **Speedup vs CPU** of 100–1000× for large matrices is expected.
+- MatVec coalesced should beat MatVec naive by about 1.3x to 2x on modern GPUs.
+- MatMul tiled 16x16 should beat MatMul naive by about 5x to 20x.
+- Tile-size sweep usually favors 16x16; 32x32 may lose due to register pressure.
+- Speedups versus scalar CPU can reach 100x to 1000x for large matrices.
 
-## Sample log (RTX 3080, 2048×2048 matmul)
+## Sample Log (RTX 3080, 2048x2048 matmul)
 
-```
+```text
 CPU (scalar ikj)               4821.00 ms    3.56 GFLOPS
 GPU naive                        22.40 ms  765.10 GFLOPS  speedup  215.2x
-GPU tiled (16×16)                 4.10 ms 4183.20 GFLOPS  speedup 1175.9x
-── Tile-size sweep ──
-GPU tiled ( 8× 8)                 6.90 ms 2488.50 GFLOPS
-GPU tiled (16×16)                 4.10 ms 4183.20 GFLOPS  ← sweet spot
-GPU tiled (32×32)                 5.60 ms 3064.40 GFLOPS
+GPU tiled (16x16)                 4.10 ms 4183.20 GFLOPS  speedup 1175.9x
+-- Tile-size sweep --
+GPU tiled ( 8x 8)                 6.90 ms 2488.50 GFLOPS
+GPU tiled (16x16)                 4.10 ms 4183.20 GFLOPS  <- sweet spot
+GPU tiled (32x32)                 5.60 ms 3064.40 GFLOPS
 ```