Sparse Matrix-Vector Multiplication (SpMV)

Course: GPU-Computing-2026
Author: Michael Bernasconi (michael.bernasconi@studenti.unitn.it) - Student ID: 267681

Description

This project implements and analyzes the performance of algorithms for Sparse Matrix-Vector Multiplication (SpMV) in a heterogeneous environment (CPU and GPU). The main goal is to study the impact of different sparse matrix storage formats (such as CSR and COO) on architectural performance, evaluating the efficiency of custom-developed CUDA kernels compared to standard libraries.

Provided Implementations

The project contains several versions of the SpMV operation:

CPU SpMV CSR: Parallel implementation on the host using OpenMP.
GPU SpMV COO: CUDA kernel utilizing the "Coordinate Format".
GPU SpMV CSR: Native CUDA kernel based on "Compressed Sparse Row".
GPU SpMV CSR-Vector: CUDA kernel optimized to assign one warp per row, maximizing memory coalescence.
GPU cuSPARSE Baseline: Reference implementation using the high-performance NVIDIA cuSPARSE library.

Measured Metrics

The performance analysis recorded during execution using the provided scripts:

GFLOPS (Giga FLoating-point Operations Per Second): Measure of computational throughput.
Effective Bandwidth (BW, in GB/s): Measure of the actually utilized memory bandwidth, crucial given the memory-bound nature of the SpMV problem.
Kernel-Time: Actual time spent resolving the mathematical operation.
Cache Metrics: Detailed statistics of cache hits and misses (D1, LL) simulated via Cachegrind.

Target Hardware (UniTN Cluster)

The execution and analysis of the benchmarks were designed for the University Cluster (edu01 node | edu-short partition), with the following specifications:

Feature	Host (CPU)	Device (GPU)
Model	Intel(R) Xeon(R) Silver 4309Y	NVIDIA A30
Architecture	Ice Lake (x86_64)	Ampere (Compute Capability 8.0)
Cores / SMs	16 Cores / 32 Threads (2 Sockets)	56 SMs (3584 CUDA Cores)
Clock Frequency	2.80 GHz (Base) / 3.60 GHz (Max)	1.44 GHz (Boost)
FP32 Performance	1.84 TFLOPS	10.3 TFLOPS
Memory Bandwidth	102.4 GB/s	933.1 GB/s (HBM2)
Total Global Memory	N/A	24 GB
L1 Cache	768 KiB Data / 512 KiB Inst. (Tot)	192 KiB per SM (Unified L1/Shared)
L2 Cache	20 MiB (16 instances, 1.25 MiB/core)	24 MiB (Shared LLC)
L3 Cache	24 MiB (Shared LLC)	N/A (L2 serves as LLC)
Shared Memory	N/A	48 KiB (up to 100 KiB per SM)
Thread Limits	2 Threads/Core (Hyper-threading)	1024 th/block (2048 th/SM)
Warp Size	N/A	32

Software Environment and Dependencies

Cluster Modules: CUDA/11.8.0
Compilers: gcc (-fopenmp support for CPU) and nvcc (CUDA compiler, target architecture sm_80).
Cache Profiling: Valgrind (specifically the Cachegrind tool) simulating CPU cache behavior.

Repository Structure

The directory is organized to logically separate sources, compilation logic, benchmarks, and final results:

.
├── bin/          # Final compiled executables (CPU, CUDA, and tools)
├── data/         # Matrix datasets in Matrix Market (.mtx) format downloaded from SuiteSparse
├── deviceQuery/  # Text files and logs extracted on the cluster (CPU, GPU info, NVCC environment)
├── doc/          # Notes, technical memos, and LaTeX support files for the report
├── include/      # Header Files (data structures, format calculation, and timer functions)
├── makefile      # Compilation instructions with standard and profile targets (flags -O3, sm_80)
├── obj/          # Object files (.o) generated by the compilation
├── results/      # Python scripts for data analysis/plotting, generated logs, and folders for the Runs (1..5)
├── src/          # Main source code (C/C++ and CUDA) of the various implementations
└── *.sh          # Bash scripts for compilation automation, test jobs (SLURM), and Cache profiling

Reproducing Benchmarks on the Cluster

Phase 1: Access and Preparation

VPN Connection: Make sure you are connected to the University VPN via Global Protect (vpn-mfa.icts.unitn.it) if you are not inside the university network.
Access the cluster:
```
ssh username@baldo.disi.unitn.it
```

Clone the repository (directly on the cluster):

git clone https://github.com/Michael-Bernasconi/Sparse-Matrix-Vector-Multiplication.git

Phase 2: Dataset Download and Setup

Create and move to the data folder to run the downloads:

cd ~/Sparse-Matrix-Vector-Multiplication
mkdir -p data
cd data

wget https://suitesparse-collection-website.herokuapp.com/MM/Freescale/FullChip.tar.gz
wget https://suitesparse-collection-website.herokuapp.com/MM/PARSEC/Ga41As41H72.tar.gz
wget https://suitesparse-collection-website.herokuapp.com/MM/Oberwolfach/bone010.tar.gz
wget https://suitesparse-collection-website.herokuapp.com/MM/PARSEC/Si41Ge41H72.tar.gz
wget https://suitesparse-collection-website.herokuapp.com/MM/GHS_psdef/ldoor.tar.gz
wget https://suitesparse-collection-website.herokuapp.com/MM/Rajat/rajat31.tar.gz
wget https://suitesparse-collection-website.herokuapp.com/MM/Sandia/ASIC_680ks.tar.gz
wget https://suitesparse-collection-website.herokuapp.com/MM/Rucci/Rucci1.tar.gz
wget https://suitesparse-collection-website.herokuapp.com/MM/GHS_indef/boyd2.tar.gz
wget https://suitesparse-collection-website.herokuapp.com/MM/Williams/webbase-1M.tar.gz

Extract and clean up the folders:

for f in *.tar.gz; do tar -xzf "$f"; done
mv */*.mtx .
rm *.tar.gz
rm -rf */

Phase 3: Execution and Analysis

Move to the project root (~/Sparse-Matrix-Vector-Multiplication), load the required modules, and execute the runs via SLURM.

cd ~/Sparse-Matrix-Vector-Multiplication

# 1. Load the correct CUDA module
module load CUDA/11.8.0

# 2. Assign execution permissions to the Bash scripts
chmod +x run_performance.sh run_cache.sh submit_all.sh

# 3. Compile the project
make

# 4. Run the benchmark (submission via SLURM partitions)
# A) To execute ALL matrix files (.mtx) for the 5 scheduled runs:
./submit_all.sh 

# B) Alternatively, to test only 1 specific file:
./submit_all.sh data/ASIC_680ks.mtx

Graphs and Tables Generation

Once the jobs on the cluster are finished, move to results to analyze the text logs and generate the aggregated graphs:

cd results

# Extract text results and generate CSV files
python3 analyze-result.py

# Starting from the generated CSVs, create the plots for the metrics (GFLOPS, BW, TTS, Kernel Time, Cache)
python3 gflops-bw-tts-kerneltime-cache.py

The generated images will be saved inside results/plots/ and the tables in results/tables/.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparse Matrix-Vector Multiplication (SpMV)

Description

Provided Implementations

Measured Metrics

Target Hardware (UniTN Cluster)

Software Environment and Dependencies

Repository Structure

Reproducing Benchmarks on the Cluster

Phase 1: Access and Preparation

Phase 2: Dataset Download and Setup

Phase 3: Execution and Analysis

Graphs and Tables Generation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 125 Commits
deviceQuery		deviceQuery
doc/latex		doc/latex
include		include
results		results
src		src
.gitignore		.gitignore
README.md		README.md
makefile		makefile
run_cache.sh		run_cache.sh
run_performance.sh		run_performance.sh
submit_all.sh		submit_all.sh

Folders and files

Latest commit

History

Repository files navigation

Sparse Matrix-Vector Multiplication (SpMV)

Description

Provided Implementations

Measured Metrics

Target Hardware (UniTN Cluster)

Software Environment and Dependencies

Repository Structure

Reproducing Benchmarks on the Cluster

Phase 1: Access and Preparation

Phase 2: Dataset Download and Setup

Phase 3: Execution and Analysis

Graphs and Tables Generation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages