Course: GPU-Computing-2026
Author: Michael Bernasconi (michael.bernasconi@studenti.unitn.it) - Student ID: 267681
This project implements and analyzes the performance of algorithms for Sparse Matrix-Vector Multiplication (SpMV) in a heterogeneous environment (CPU and GPU). The main goal is to study the impact of different sparse matrix storage formats (such as CSR and COO) on architectural performance, evaluating the efficiency of custom-developed CUDA kernels compared to standard libraries.
The project contains several versions of the SpMV operation:
- CPU SpMV CSR: Parallel implementation on the host using OpenMP.
- GPU SpMV COO: CUDA kernel utilizing the "Coordinate Format".
- GPU SpMV CSR: Native CUDA kernel based on "Compressed Sparse Row".
- GPU SpMV CSR-Vector: CUDA kernel optimized to assign one warp per row, maximizing memory coalescence.
- GPU cuSPARSE Baseline: Reference implementation using the high-performance NVIDIA cuSPARSE library.
The performance analysis recorded during execution using the provided scripts:
- GFLOPS (Giga FLoating-point Operations Per Second): Measure of computational throughput.
- Effective Bandwidth (BW, in GB/s): Measure of the actually utilized memory bandwidth, crucial given the memory-bound nature of the SpMV problem.
- Kernel-Time: Actual time spent resolving the mathematical operation.
- Cache Metrics: Detailed statistics of cache hits and misses (D1, LL) simulated via Cachegrind.
The execution and analysis of the benchmarks were designed for the University Cluster (edu01 node | edu-short partition), with the following specifications:
| Feature | Host (CPU) | Device (GPU) |
|---|---|---|
| Model | Intel(R) Xeon(R) Silver 4309Y | NVIDIA A30 |
| Architecture | Ice Lake (x86_64) | Ampere (Compute Capability 8.0) |
| Cores / SMs | 16 Cores / 32 Threads (2 Sockets) | 56 SMs (3584 CUDA Cores) |
| Clock Frequency | 2.80 GHz (Base) / 3.60 GHz (Max) | 1.44 GHz (Boost) |
| FP32 Performance | 1.84 TFLOPS | 10.3 TFLOPS |
| Memory Bandwidth | 102.4 GB/s | 933.1 GB/s (HBM2) |
| Total Global Memory | N/A | 24 GB |
| L1 Cache | 768 KiB Data / 512 KiB Inst. (Tot) | 192 KiB per SM (Unified L1/Shared) |
| L2 Cache | 20 MiB (16 instances, 1.25 MiB/core) | 24 MiB (Shared LLC) |
| L3 Cache | 24 MiB (Shared LLC) | N/A (L2 serves as LLC) |
| Shared Memory | N/A | 48 KiB (up to 100 KiB per SM) |
| Thread Limits | 2 Threads/Core (Hyper-threading) | 1024 th/block (2048 th/SM) |
| Warp Size | N/A | 32 |
- Cluster Modules:
CUDA/11.8.0 - Compilers:
gcc(-fopenmpsupport for CPU) andnvcc(CUDA compiler, target architecturesm_80). - Cache Profiling: Valgrind (specifically the
Cachegrindtool) simulating CPU cache behavior.
The directory is organized to logically separate sources, compilation logic, benchmarks, and final results:
.
├── bin/ # Final compiled executables (CPU, CUDA, and tools)
├── data/ # Matrix datasets in Matrix Market (.mtx) format downloaded from SuiteSparse
├── deviceQuery/ # Text files and logs extracted on the cluster (CPU, GPU info, NVCC environment)
├── doc/ # Notes, technical memos, and LaTeX support files for the report
├── include/ # Header Files (data structures, format calculation, and timer functions)
├── makefile # Compilation instructions with standard and profile targets (flags -O3, sm_80)
├── obj/ # Object files (.o) generated by the compilation
├── results/ # Python scripts for data analysis/plotting, generated logs, and folders for the Runs (1..5)
├── src/ # Main source code (C/C++ and CUDA) of the various implementations
└── *.sh # Bash scripts for compilation automation, test jobs (SLURM), and Cache profiling
- VPN Connection: Make sure you are connected to the University VPN via Global Protect (vpn-mfa.icts.unitn.it) if you are not inside the university network.
- Access the cluster:
ssh username@baldo.disi.unitn.it
- Clone the repository (directly on the cluster):
git clone https://github.com/Michael-Bernasconi/Sparse-Matrix-Vector-Multiplication.git
- Create and move to the
datafolder to run the downloads:cd ~/Sparse-Matrix-Vector-Multiplication mkdir -p data cd data wget https://suitesparse-collection-website.herokuapp.com/MM/Freescale/FullChip.tar.gz wget https://suitesparse-collection-website.herokuapp.com/MM/PARSEC/Ga41As41H72.tar.gz wget https://suitesparse-collection-website.herokuapp.com/MM/Oberwolfach/bone010.tar.gz wget https://suitesparse-collection-website.herokuapp.com/MM/PARSEC/Si41Ge41H72.tar.gz wget https://suitesparse-collection-website.herokuapp.com/MM/GHS_psdef/ldoor.tar.gz wget https://suitesparse-collection-website.herokuapp.com/MM/Rajat/rajat31.tar.gz wget https://suitesparse-collection-website.herokuapp.com/MM/Sandia/ASIC_680ks.tar.gz wget https://suitesparse-collection-website.herokuapp.com/MM/Rucci/Rucci1.tar.gz wget https://suitesparse-collection-website.herokuapp.com/MM/GHS_indef/boyd2.tar.gz wget https://suitesparse-collection-website.herokuapp.com/MM/Williams/webbase-1M.tar.gz
- Extract and clean up the folders:
for f in *.tar.gz; do tar -xzf "$f"; done mv */*.mtx . rm *.tar.gz rm -rf */
Move to the project root (~/Sparse-Matrix-Vector-Multiplication), load the required modules, and execute the runs via SLURM.
cd ~/Sparse-Matrix-Vector-Multiplication
# 1. Load the correct CUDA module
module load CUDA/11.8.0
# 2. Assign execution permissions to the Bash scripts
chmod +x run_performance.sh run_cache.sh submit_all.sh
# 3. Compile the project
make
# 4. Run the benchmark (submission via SLURM partitions)
# A) To execute ALL matrix files (.mtx) for the 5 scheduled runs:
./submit_all.sh
# B) Alternatively, to test only 1 specific file:
./submit_all.sh data/ASIC_680ks.mtxOnce the jobs on the cluster are finished, move to results to analyze the text logs and generate the aggregated graphs:
cd results
# Extract text results and generate CSV files
python3 analyze-result.py
# Starting from the generated CSVs, create the plots for the metrics (GFLOPS, BW, TTS, Kernel Time, Cache)
python3 gflops-bw-tts-kerneltime-cache.pyThe generated images will be saved inside results/plots/ and the tables in results/tables/.