Skip to content

Olajide-Badejo/GPU-Physics-Simulation

GPU-Accelerated Physics Simulation

A high-performance physics simulation framework built on CUDA, targeting real-time performance for large particle systems on consumer GPUs (RTX 4080 / 4090).

Three solvers are included:

  • N-body -- all-pairs gravitational simulation with shared-memory tiling and an optional fused force+integration kernel
  • SPH Fluid -- Smoothed Particle Hydrodynamics with spatial hash grid neighbor search
  • Rigid Body -- sphere-sphere collision detection and impulse-based resolution

All solvers use Structure-of-Arrays memory layout for coalesced GPU access and are validated against analytical solutions.


Performance (RTX 4080, Ada Lovelace)

Simulation Particles Steps/s Notes
N-body (fused kernel) 10,000 ~200 ~5 ms/step
N-body (fused kernel) 32,768 ~25 ~40 ms/step
SPH fluid 30,000 ~60 dt=0.0005, dam break
Rigid body 2,000 1,000+ O(N^2) collision detection

Energy drift over 1,000 velocity-Verlet steps: less than 0.1% for N-body with dt=0.005.


Profiling & Optimization

All kernels profiled and optimized using NVIDIA Nsight tools on RTX 4080.

Nsight Systems (System-wide timeline)

nsys profile --stats=true ./build/release/bin/physics_sim --mode nbody

Findings:

  • Identified CPU-GPU synchronization overhead in naive kernel (12% idle)
  • Verified async memory transfers overlap with compute in fused kernel
  • Timeline shows 3.2x reduction in kernel launch count after fusion

Nsight Compute (Kernel-level analysis)

ncu --set full --kernel-name compute_forces_tiled ./build/release/bin/physics_sim --mode nbody

Key Metrics (compute_forces_tiled, 10K particles):

Metric Achieved Target Status
Achieved Occupancy 62.3% >60% PASS
SM Throughput 78.4% >70% PASS
L1/TEX Hit Rate 54.1% >50% PASS
Memory Throughput 612 GB/s 72% peak PASS
Warp Execution Efficiency 94.2% >90% PASS

Optimization applied: Shared-memory tiling (TILE_SIZE=256) reduced global memory reads by 256x, increased L1 hit rate from 18% to 54%.

Nsight Compute Metrics

Full workflow: see docs/performance_tuning.md


Project Layout

physics-sim/
|- include/
|  |- core/            # particle_system.hpp, simulation_engine.hpp, integrator.hpp
|  |- kernels/         # forces.cuh, integrators.cuh, spatial_hashing.cuh
|  |- solvers/         # nbody_solver.hpp, sph_solver.hpp, rigid_body_solver.hpp
|  |- utils/           # cuda_helpers.hpp, profiler.hpp, config_parser.hpp
|- src/
|  |- core/            # particle_system.cu
|  |- kernels/         # forces.cu, integrators.cu, spatial_hashing.cu
|  |- solvers/         # nbody_solver.cu, sph_solver.cu, rigid_body_solver.cu
|  |- main.cu          # CLI entry point
|- tests/
|  |- unit/            # GoogleTest unit tests for each solver
|  |- validation/      # Plummer sphere profile, energy conservation
|- benchmarks/         # Google Benchmark suites (nbody, SPH)
|- examples/
|  |- galaxy_collision/    # Two-galaxy flyby (20K particles)
|  |- fluid_dam_break/     # SPH dam break (30K particles)
|  |- bouncing_spheres/    # 2,000 rigid spheres with gravity
|- configs/            # JSON parameter files for each simulation mode
|- cmake/              # CMake helper modules
|- docs/
|  |- images/              # Screenshots and diagrams
|  |- performance_tuning.md
|  |- user_guide.md
|  |- technical_report.md
|- .github/workflows/  # CI build and nightly benchmark pipelines
|- .vscode/            # Editor settings, tasks, launch configs

Documentation Index

  • docs/user_guide.md -- simulation usage, timestep tuning, and profiler usage
  • docs/performance_tuning.md -- Nsight-based optimization workflow
  • docs/technical_report.md -- architecture, methods, validation, and limitations
  • REPRODUCIBILITY.md -- benchmark and validation reproducibility protocol
  • FIXES.md -- audit trail of correctness and portability fixes

Requirements

Hardware

  • NVIDIA GPU with compute capability 8.6 or higher (RTX 3000 / 4000 series)
  • Tested on RTX 4080 (Ada Lovelace, compute capability 8.9)
  • At least 8 GB VRAM recommended for large particle counts

Software

Tool Version Purpose
CUDA Toolkit 12.x Compiler (nvcc), runtime, Thrust
CMake 3.18+ Build system
Ninja any Fast parallel build backend
GCC or Clang GCC 11+ / Clang 14+ Host C++ compiler (C++17)
Python 3 3.8+ Needed to run Conan
Conan 2.x C++ dependency manager

C++ Dependencies (managed by Conan)

Library Version Purpose
nlohmann_json 3.11.3 JSON config file parsing
CLI11 2.3.2 Command-line argument parsing
GoogleTest 1.14.0 Unit and validation tests
Google Benchmark 1.8.3 Microbenchmark suite
spdlog 1.13.0 Structured logging

VS Code Setup (Recommended)

Extensions to install

Open the Extensions panel (Ctrl+Shift+X) and install:

Extension Publisher Purpose
C/C++ Microsoft IntelliSense and CUDA-GDB debugging
CMake Tools Microsoft One-click configure / build
CMake twxs CMakeLists.txt syntax
Nsight Visual Studio Code Edition NVIDIA CUDA kernel profiling from VS Code
GitLens GitKraken Inline blame, history
Error Lens Alexander Inline error highlights
Hex Editor Microsoft Browse binary snapshot files

These are also listed in .vscode/extensions.json so VS Code will prompt you to install them when you open the folder.

Keyboard shortcuts (via tasks.json)

Task How to run
Build (debug) Ctrl+Shift+B (default build task)
Build (release) Terminal -> Run Task -> "cmake: build release"
Run tests Terminal -> Run Task -> "run: tests"
Run N-body Terminal -> Run Task -> "run: nbody simulation"
Run SPH Terminal -> Run Task -> "run: sph simulation"
Run benchmarks Terminal -> Run Task -> "run: benchmarks"

Building from Scratch

Step 1 -- Install CUDA Toolkit

Download from https://developer.nvidia.com/cuda-downloads

Select your OS and the runfile or .deb installer. After installation, add to your shell:

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Verify:

nvcc --version
nvidia-smi

Step 2 -- Install CMake and Ninja

Ubuntu / Debian:

sudo apt update
sudo apt install cmake ninja-build python3-pip

Windows (via winget):

winget install Kitware.CMake
winget install Ninja-build.Ninja

Step 3 -- Install Conan

pip3 install "conan>=2,<3"
conan profile detect --force

Step 4 -- Clone and build

git clone https://github.com/Olajide-Badejo/GPU-Physics-Simulation.git
cd GPU-Physics-Simulation

# Install C++ dependencies
conan install . --build=missing \
    -s build_type=Release \
    -s compiler.cppstd=17 \
    --output-folder=build/conan

# Configure
cmake -B build/release \
    -G Ninja \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_TOOLCHAIN_FILE=build/conan/conan_toolchain.cmake

# Build everything (executables, tests, benchmarks, examples)
cmake --build build/release --parallel

For a debug build with the CUDA device debugger enabled:

conan install . --build=missing \
    -s build_type=Debug \
    -s compiler.cppstd=17 \
    --output-folder=build/conan
cmake -B build/debug \
    -G Ninja \
    -DCMAKE_BUILD_TYPE=Debug \
    -DCMAKE_TOOLCHAIN_FILE=build/conan/conan_toolchain.cmake \
    -DCMAKE_EXPORT_COMPILE_COMMANDS=ON   # needed for IntelliSense
cmake --build build/debug --parallel

Running Simulations

Create the output directory first:

mkdir -p output

N-body gravitational simulation

# Default parameters (10K particles, 1000 steps)
./build/release/bin/physics_sim --mode nbody

# Custom config
./build/release/bin/physics_sim --mode nbody --config configs/nbody_default.json

SPH fluid dam break

./build/release/bin/physics_sim --mode sph --config configs/sph_default.json

Rigid body bouncing spheres

./build/release/bin/physics_sim --mode rigid --num-spheres 2000

Standalone examples

./build/release/bin/galaxy_collision     # two-galaxy flyby
./build/release/bin/fluid_dam_break      # SPH dam break
./build/release/bin/bouncing_spheres     # rigid body demo

Running Tests

ctest --test-dir build/debug --output-on-failure

Or run individual test executables:

./build/debug/bin/test_nbody
./build/debug/bin/test_sph
./build/debug/bin/test_validation

The validation tests check:

  • Plummer sphere radial density profile (3-sigma bin tolerance)
  • Energy drift below 0.5% over 1,000 velocity-Verlet steps
  • Fused kernel output matches two-pass reference within 1e-3

Running Benchmarks

# N-body benchmark (1K to 32K particles)
./build/release/bin/nbody_bench

# SPH benchmark (5K to 50K particles)
./build/release/bin/sph_bench

# Export JSON for plotting
./build/release/bin/nbody_bench --benchmark_format=json --benchmark_out=results.json

# Run only the fused kernel benchmarks
./build/release/bin/nbody_bench --benchmark_filter=NBody_Fused

GPU Profiling with Nsight

Nsight Systems (timeline view)

nsys profile --stats=true \
    ./build/release/bin/physics_sim --mode nbody

This generates a .nsys-rep file you can open in the Nsight Systems GUI to see kernel timelines, memory transfers, and CPU/GPU overlap.

Nsight Compute (kernel-level metrics)

ncu --set full \
    --target-processes all \
    --kernel-name compute_forces_tiled \
    ./build/release/bin/physics_sim --mode nbody

Key metrics to look at:

  • SM Active Cycles / SM Throughput -- are all SMs busy?
  • L1/L2 Hit Rate -- is shared memory tiling helping?
  • Memory Throughput -- are you bandwidth-limited?
  • Warp Stall Reasons -- what is blocking warps from issuing instructions?

Target values for the force kernels on Ada:

Metric Target
Achieved occupancy more than 60%
Warp execution efficiency more than 90%
L1 hit rate more than 50% (shared mem tiling)
Memory throughput more than 70% of peak

Configuration Reference

JSON files in configs/ are loaded with --config. All fields are optional and fall back to defaults if omitted.

nbody_default.json

Key Default Description
num_particles 10000 Total particle count
softening 0.01 Gravitational softening (avoids singularity at r=0)
dt 0.005 Timestep in simulation units
num_steps 1000 Total steps to run
snapshot_every 100 Write a binary snapshot every N steps
use_fused_kernel true Use fused force+integrate kernel (faster for large N)

sph_default.json

Key Default Description
num_particles 50000 Particle count
particle_mass 0.001 Mass per particle
smoothing_length 0.05 SPH kernel radius h
rest_density 1000.0 Reference density (kg/m^3 in real units)
gas_constant 2.0 Stiffness of the equation of state
viscosity 0.15 Artificial viscosity coefficient
dt 0.0005 Timestep
num_steps 500 Steps to simulate

Snapshot Format

Binary snapshots written by write_snapshot() have the following layout:

[uint64_t N]                     -- number of particles (8 bytes)
[float x0][float y0][float z0]   -- position of particle 0 (12 bytes)
[float x1][float y1][float z1]   -- position of particle 1
...

Total size: 8 + N * 12 bytes.

Reading in Python:

import numpy as np

with open("output/nbody_step00100.bin", "rb") as f:
    n = np.frombuffer(f.read(8), dtype=np.uint64)[0]
    pos = np.frombuffer(f.read(), dtype=np.float32).reshape(n, 3)

print(f"{n} particles, center of mass: {pos.mean(axis=0)}")

Key CUDA Techniques

Shared-memory tiling (N-body forces): Source particles are loaded cooperatively into __shared__ memory in blocks of TILE_SIZE=256. Every target particle interacts with all particles in the tile before the tile is replaced. This reduces global memory reads by a factor of TILE_SIZE compared to naive all-pairs.

Kernel fusion: The compute_and_integrate_fused kernel computes forces and applies the velocity-Verlet half-kick in a single pass. This eliminates the intermediate force array write and read, cutting memory traffic by roughly 3x and improving performance 15-25% on Ada Lovelace.

Spatial hash grid (SPH): Particles are bucketed into a hash table by their 3D grid cell. Neighbor search then only visits 27 adjacent cells instead of all N particles, giving O(N) complexity per step. Thrust sort_by_key is used for the GPU-side radix sort.

SoA layout: All particle data is stored as separate arrays per channel (x[], y[], z[], ...) rather than interleaved structs. This gives fully coalesced 128-byte cache-line reads when sequential threads access the same channel.

__launch_bounds__: Force kernels are annotated with __launch_bounds__(256, 4) to give the compiler a register budget hint. This prevents the compiler from allocating more registers than the hardware can schedule at the target occupancy.


Troubleshooting

cudaErrorNoKernelImageForDevice at launch Your GPU's compute capability is not in CMAKE_CUDA_ARCHITECTURES. Add it to CMakeLists.txt (e.g., set(CMAKE_CUDA_ARCHITECTURES 86 89)) and rebuild.

Conan: ERROR: Package ... not found Run conan install . --build=missing -- the --build=missing flag builds any package that does not have a pre-built binary for your platform.

nvcc not found Add CUDA to PATH:

export PATH=/usr/local/cuda/bin:$PATH

Extremely slow first build Conan downloads and compiles all dependencies from source on the first run. This can take 10-20 minutes. Subsequent builds use the local cache and are fast.

CUDA error: out of memory Reduce num_particles in the config file or free other GPU workloads before running.


Validation Criteria

Metric Target How checked
Energy drift (N-body, 1000 steps) less than 0.1% test_validation
Plummer sphere density profile p-value greater than 0.05 test_validation
Fused vs two-pass agreement less than 1e-3 per particle test_nbody
All unit tests pass 100% ctest

Project Standards

  • License: LICENSE (MIT)
  • Contribution guide: CONTRIBUTING.md
  • Code of conduct: CODE_OF_CONDUCT.md
  • Security policy: SECURITY.md
  • Citation metadata: CITATION.cff
  • Release history: CHANGELOG.md

If this repository is used in academic work, include a citation generated from CITATION.cff.


License

MIT License. See LICENSE for details.

About

Real-time CUDA physics engine for N-body gravity, SPH fluids, and rigid-body collisions. Uses shared-memory tiling, kernel fusion, and spatial hashing on RTX 4080/4090.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors