Skip to content

Michael-Bernasconi/Sparse-Matrix-Vector-Multiplication

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

125 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sparse Matrix-Vector Multiplication (SpMV)

Course: GPU-Computing-2026
Author: Michael Bernasconi (michael.bernasconi@studenti.unitn.it) - Student ID: 267681

Description

This project implements and analyzes the performance of algorithms for Sparse Matrix-Vector Multiplication (SpMV) in a heterogeneous environment (CPU and GPU). The main goal is to study the impact of different sparse matrix storage formats (such as CSR and COO) on architectural performance, evaluating the efficiency of custom-developed CUDA kernels compared to standard libraries.

Provided Implementations

The project contains several versions of the SpMV operation:

  • CPU SpMV CSR: Parallel implementation on the host using OpenMP.
  • GPU SpMV COO: CUDA kernel utilizing the "Coordinate Format".
  • GPU SpMV CSR: Native CUDA kernel based on "Compressed Sparse Row".
  • GPU SpMV CSR-Vector: CUDA kernel optimized to assign one warp per row, maximizing memory coalescence.
  • GPU cuSPARSE Baseline: Reference implementation using the high-performance NVIDIA cuSPARSE library.

Measured Metrics

The performance analysis recorded during execution using the provided scripts:

  1. GFLOPS (Giga FLoating-point Operations Per Second): Measure of computational throughput.
  2. Effective Bandwidth (BW, in GB/s): Measure of the actually utilized memory bandwidth, crucial given the memory-bound nature of the SpMV problem.
  3. Kernel-Time: Actual time spent resolving the mathematical operation.
  4. Cache Metrics: Detailed statistics of cache hits and misses (D1, LL) simulated via Cachegrind.

Target Hardware (UniTN Cluster)

The execution and analysis of the benchmarks were designed for the University Cluster (edu01 node | edu-short partition), with the following specifications:

Feature Host (CPU) Device (GPU)
Model Intel(R) Xeon(R) Silver 4309Y NVIDIA A30
Architecture Ice Lake (x86_64) Ampere (Compute Capability 8.0)
Cores / SMs 16 Cores / 32 Threads (2 Sockets) 56 SMs (3584 CUDA Cores)
Clock Frequency 2.80 GHz (Base) / 3.60 GHz (Max) 1.44 GHz (Boost)
FP32 Performance 1.84 TFLOPS 10.3 TFLOPS
Memory Bandwidth 102.4 GB/s 933.1 GB/s (HBM2)
Total Global Memory N/A 24 GB
L1 Cache 768 KiB Data / 512 KiB Inst. (Tot) 192 KiB per SM (Unified L1/Shared)
L2 Cache 20 MiB (16 instances, 1.25 MiB/core) 24 MiB (Shared LLC)
L3 Cache 24 MiB (Shared LLC) N/A (L2 serves as LLC)
Shared Memory N/A 48 KiB (up to 100 KiB per SM)
Thread Limits 2 Threads/Core (Hyper-threading) 1024 th/block (2048 th/SM)
Warp Size N/A 32

Software Environment and Dependencies

  • Cluster Modules: CUDA/11.8.0
  • Compilers: gcc (-fopenmp support for CPU) and nvcc (CUDA compiler, target architecture sm_80).
  • Cache Profiling: Valgrind (specifically the Cachegrind tool) simulating CPU cache behavior.

Repository Structure

The directory is organized to logically separate sources, compilation logic, benchmarks, and final results:

.
├── bin/          # Final compiled executables (CPU, CUDA, and tools)
├── data/         # Matrix datasets in Matrix Market (.mtx) format downloaded from SuiteSparse
├── deviceQuery/  # Text files and logs extracted on the cluster (CPU, GPU info, NVCC environment)
├── doc/          # Notes, technical memos, and LaTeX support files for the report
├── include/      # Header Files (data structures, format calculation, and timer functions)
├── makefile      # Compilation instructions with standard and profile targets (flags -O3, sm_80)
├── obj/          # Object files (.o) generated by the compilation
├── results/      # Python scripts for data analysis/plotting, generated logs, and folders for the Runs (1..5)
├── src/          # Main source code (C/C++ and CUDA) of the various implementations
└── *.sh          # Bash scripts for compilation automation, test jobs (SLURM), and Cache profiling

Reproducing Benchmarks on the Cluster

Phase 1: Access and Preparation

  1. VPN Connection: Make sure you are connected to the University VPN via Global Protect (vpn-mfa.icts.unitn.it) if you are not inside the university network.
  2. Access the cluster:
    ssh username@baldo.disi.unitn.it
  3. Clone the repository (directly on the cluster):
    git clone https://github.com/Michael-Bernasconi/Sparse-Matrix-Vector-Multiplication.git

Phase 2: Dataset Download and Setup

  1. Create and move to the data folder to run the downloads:
    cd ~/Sparse-Matrix-Vector-Multiplication
    mkdir -p data
    cd data
    
    wget https://suitesparse-collection-website.herokuapp.com/MM/Freescale/FullChip.tar.gz
    wget https://suitesparse-collection-website.herokuapp.com/MM/PARSEC/Ga41As41H72.tar.gz
    wget https://suitesparse-collection-website.herokuapp.com/MM/Oberwolfach/bone010.tar.gz
    wget https://suitesparse-collection-website.herokuapp.com/MM/PARSEC/Si41Ge41H72.tar.gz
    wget https://suitesparse-collection-website.herokuapp.com/MM/GHS_psdef/ldoor.tar.gz
    wget https://suitesparse-collection-website.herokuapp.com/MM/Rajat/rajat31.tar.gz
    wget https://suitesparse-collection-website.herokuapp.com/MM/Sandia/ASIC_680ks.tar.gz
    wget https://suitesparse-collection-website.herokuapp.com/MM/Rucci/Rucci1.tar.gz
    wget https://suitesparse-collection-website.herokuapp.com/MM/GHS_indef/boyd2.tar.gz
    wget https://suitesparse-collection-website.herokuapp.com/MM/Williams/webbase-1M.tar.gz
  2. Extract and clean up the folders:
    for f in *.tar.gz; do tar -xzf "$f"; done
    mv */*.mtx .
    rm *.tar.gz
    rm -rf */

Phase 3: Execution and Analysis

Move to the project root (~/Sparse-Matrix-Vector-Multiplication), load the required modules, and execute the runs via SLURM.

cd ~/Sparse-Matrix-Vector-Multiplication

# 1. Load the correct CUDA module
module load CUDA/11.8.0

# 2. Assign execution permissions to the Bash scripts
chmod +x run_performance.sh run_cache.sh submit_all.sh

# 3. Compile the project
make

# 4. Run the benchmark (submission via SLURM partitions)
# A) To execute ALL matrix files (.mtx) for the 5 scheduled runs:
./submit_all.sh 

# B) Alternatively, to test only 1 specific file:
./submit_all.sh data/ASIC_680ks.mtx

Graphs and Tables Generation

Once the jobs on the cluster are finished, move to results to analyze the text logs and generate the aggregated graphs:

cd results

# Extract text results and generate CSV files
python3 analyze-result.py

# Starting from the generated CSVs, create the plots for the metrics (GFLOPS, BW, TTS, Kernel Time, Cache)
python3 gflops-bw-tts-kerneltime-cache.py

The generated images will be saved inside results/plots/ and the tables in results/tables/.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors