GPU-Accelerated Physics Simulation

A high-performance physics simulation framework built on CUDA, targeting real-time performance for large particle systems on consumer GPUs (RTX 4080 / 4090).

Three solvers are included:

N-body -- all-pairs gravitational simulation with shared-memory tiling and an optional fused force+integration kernel
SPH Fluid -- Smoothed Particle Hydrodynamics with spatial hash grid neighbor search
Rigid Body -- sphere-sphere collision detection and impulse-based resolution

All solvers use Structure-of-Arrays memory layout for coalesced GPU access and are validated against analytical solutions.

Performance (RTX 4080, Ada Lovelace)

Simulation	Particles	Steps/s	Notes
N-body (fused kernel)	10,000	~200	~5 ms/step
N-body (fused kernel)	32,768	~25	~40 ms/step
SPH fluid	30,000	~60	dt=0.0005, dam break
Rigid body	2,000	1,000+	O(N^2) collision detection

Energy drift over 1,000 velocity-Verlet steps: less than 0.1% for N-body with dt=0.005.

Profiling & Optimization

All kernels profiled and optimized using NVIDIA Nsight tools on RTX 4080.

Nsight Systems (System-wide timeline)

nsys profile --stats=true ./build/release/bin/physics_sim --mode nbody

Findings:

Identified CPU-GPU synchronization overhead in naive kernel (12% idle)
Verified async memory transfers overlap with compute in fused kernel
Timeline shows 3.2x reduction in kernel launch count after fusion

Nsight Compute (Kernel-level analysis)

ncu --set full --kernel-name compute_forces_tiled ./build/release/bin/physics_sim --mode nbody

Key Metrics (compute_forces_tiled, 10K particles):

Metric	Achieved	Target	Status
Achieved Occupancy	62.3%	>60%	PASS
SM Throughput	78.4%	>70%	PASS
L1/TEX Hit Rate	54.1%	>50%	PASS
Memory Throughput	612 GB/s	72% peak	PASS
Warp Execution Efficiency	94.2%	>90%	PASS

Optimization applied: Shared-memory tiling (TILE_SIZE=256) reduced global memory reads by 256x, increased L1 hit rate from 18% to 54%.

Full workflow: see docs/performance_tuning.md

Project Layout

physics-sim/
|- include/
|  |- core/            # particle_system.hpp, simulation_engine.hpp, integrator.hpp
|  |- kernels/         # forces.cuh, integrators.cuh, spatial_hashing.cuh
|  |- solvers/         # nbody_solver.hpp, sph_solver.hpp, rigid_body_solver.hpp
|  |- utils/           # cuda_helpers.hpp, profiler.hpp, config_parser.hpp
|- src/
|  |- core/            # particle_system.cu
|  |- kernels/         # forces.cu, integrators.cu, spatial_hashing.cu
|  |- solvers/         # nbody_solver.cu, sph_solver.cu, rigid_body_solver.cu
|  |- main.cu          # CLI entry point
|- tests/
|  |- unit/            # GoogleTest unit tests for each solver
|  |- validation/      # Plummer sphere profile, energy conservation
|- benchmarks/         # Google Benchmark suites (nbody, SPH)
|- examples/
|  |- galaxy_collision/    # Two-galaxy flyby (20K particles)
|  |- fluid_dam_break/     # SPH dam break (30K particles)
|  |- bouncing_spheres/    # 2,000 rigid spheres with gravity
|- configs/            # JSON parameter files for each simulation mode
|- cmake/              # CMake helper modules
|- docs/
|  |- images/              # Screenshots and diagrams
|  |- performance_tuning.md
|  |- user_guide.md
|  |- technical_report.md
|- .github/workflows/  # CI build and nightly benchmark pipelines
|- .vscode/            # Editor settings, tasks, launch configs

Documentation Index

docs/user_guide.md -- simulation usage, timestep tuning, and profiler usage
docs/performance_tuning.md -- Nsight-based optimization workflow
docs/technical_report.md -- architecture, methods, validation, and limitations
REPRODUCIBILITY.md -- benchmark and validation reproducibility protocol
FIXES.md -- audit trail of correctness and portability fixes

Requirements

Hardware

NVIDIA GPU with compute capability 8.6 or higher (RTX 3000 / 4000 series)
Tested on RTX 4080 (Ada Lovelace, compute capability 8.9)
At least 8 GB VRAM recommended for large particle counts

Software

Tool	Version	Purpose
CUDA Toolkit	12.x	Compiler (nvcc), runtime, Thrust
CMake	3.18+	Build system
Ninja	any	Fast parallel build backend
GCC or Clang	GCC 11+ / Clang 14+	Host C++ compiler (C++17)
Python 3	3.8+	Needed to run Conan
Conan	2.x	C++ dependency manager

C++ Dependencies (managed by Conan)

Library	Version	Purpose
nlohmann_json	3.11.3	JSON config file parsing
CLI11	2.3.2	Command-line argument parsing
GoogleTest	1.14.0	Unit and validation tests
Google Benchmark	1.8.3	Microbenchmark suite
spdlog	1.13.0	Structured logging

VS Code Setup (Recommended)

Extensions to install

Open the Extensions panel (Ctrl+Shift+X) and install:

Extension	Publisher	Purpose
C/C++	Microsoft	IntelliSense and CUDA-GDB debugging
CMake Tools	Microsoft	One-click configure / build
CMake	twxs	CMakeLists.txt syntax
Nsight Visual Studio Code Edition	NVIDIA	CUDA kernel profiling from VS Code
GitLens	GitKraken	Inline blame, history
Error Lens	Alexander	Inline error highlights
Hex Editor	Microsoft	Browse binary snapshot files

These are also listed in .vscode/extensions.json so VS Code will prompt you to install them when you open the folder.

Keyboard shortcuts (via tasks.json)

Task	How to run
Build (debug)	`Ctrl+Shift+B` (default build task)
Build (release)	Terminal -> Run Task -> "cmake: build release"
Run tests	Terminal -> Run Task -> "run: tests"
Run N-body	Terminal -> Run Task -> "run: nbody simulation"
Run SPH	Terminal -> Run Task -> "run: sph simulation"
Run benchmarks	Terminal -> Run Task -> "run: benchmarks"

Building from Scratch

Step 1 -- Install CUDA Toolkit

Download from https://developer.nvidia.com/cuda-downloads

Select your OS and the runfile or .deb installer. After installation, add to your shell:

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Verify:

nvcc --version
nvidia-smi

Step 2 -- Install CMake and Ninja

Ubuntu / Debian:

sudo apt update
sudo apt install cmake ninja-build python3-pip

Windows (via winget):

winget install Kitware.CMake
winget install Ninja-build.Ninja

Step 3 -- Install Conan

pip3 install "conan>=2,<3"
conan profile detect --force

Step 4 -- Clone and build

git clone https://github.com/Olajide-Badejo/GPU-Physics-Simulation.git
cd GPU-Physics-Simulation

# Install C++ dependencies
conan install . --build=missing \
    -s build_type=Release \
    -s compiler.cppstd=17 \
    --output-folder=build/conan

# Configure
cmake -B build/release \
    -G Ninja \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_TOOLCHAIN_FILE=build/conan/conan_toolchain.cmake

# Build everything (executables, tests, benchmarks, examples)
cmake --build build/release --parallel

For a debug build with the CUDA device debugger enabled:

conan install . --build=missing \
    -s build_type=Debug \
    -s compiler.cppstd=17 \
    --output-folder=build/conan
cmake -B build/debug \
    -G Ninja \
    -DCMAKE_BUILD_TYPE=Debug \
    -DCMAKE_TOOLCHAIN_FILE=build/conan/conan_toolchain.cmake \
    -DCMAKE_EXPORT_COMPILE_COMMANDS=ON   # needed for IntelliSense
cmake --build build/debug --parallel

Running Simulations

Create the output directory first:

mkdir -p output

N-body gravitational simulation

# Default parameters (10K particles, 1000 steps)
./build/release/bin/physics_sim --mode nbody

# Custom config
./build/release/bin/physics_sim --mode nbody --config configs/nbody_default.json

SPH fluid dam break

./build/release/bin/physics_sim --mode sph --config configs/sph_default.json

Rigid body bouncing spheres

./build/release/bin/physics_sim --mode rigid --num-spheres 2000

Standalone examples

./build/release/bin/galaxy_collision     # two-galaxy flyby
./build/release/bin/fluid_dam_break      # SPH dam break
./build/release/bin/bouncing_spheres     # rigid body demo

Running Tests

ctest --test-dir build/debug --output-on-failure

Or run individual test executables:

./build/debug/bin/test_nbody
./build/debug/bin/test_sph
./build/debug/bin/test_validation

The validation tests check:

Plummer sphere radial density profile (3-sigma bin tolerance)
Energy drift below 0.5% over 1,000 velocity-Verlet steps
Fused kernel output matches two-pass reference within 1e-3

Running Benchmarks

# N-body benchmark (1K to 32K particles)
./build/release/bin/nbody_bench

# SPH benchmark (5K to 50K particles)
./build/release/bin/sph_bench

# Export JSON for plotting
./build/release/bin/nbody_bench --benchmark_format=json --benchmark_out=results.json

# Run only the fused kernel benchmarks
./build/release/bin/nbody_bench --benchmark_filter=NBody_Fused

GPU Profiling with Nsight

Nsight Systems (timeline view)

nsys profile --stats=true \
    ./build/release/bin/physics_sim --mode nbody

This generates a .nsys-rep file you can open in the Nsight Systems GUI to see kernel timelines, memory transfers, and CPU/GPU overlap.

Nsight Compute (kernel-level metrics)

ncu --set full \
    --target-processes all \
    --kernel-name compute_forces_tiled \
    ./build/release/bin/physics_sim --mode nbody

Key metrics to look at:

SM Active Cycles / SM Throughput -- are all SMs busy?
L1/L2 Hit Rate -- is shared memory tiling helping?
Memory Throughput -- are you bandwidth-limited?
Warp Stall Reasons -- what is blocking warps from issuing instructions?

Target values for the force kernels on Ada:

Metric	Target
Achieved occupancy	more than 60%
Warp execution efficiency	more than 90%
L1 hit rate	more than 50% (shared mem tiling)
Memory throughput	more than 70% of peak

Configuration Reference

JSON files in configs/ are loaded with --config. All fields are optional and fall back to defaults if omitted.

nbody_default.json

Key	Default	Description
`num_particles`	10000	Total particle count
`softening`	0.01	Gravitational softening (avoids singularity at r=0)
`dt`	0.005	Timestep in simulation units
`num_steps`	1000	Total steps to run
`snapshot_every`	100	Write a binary snapshot every N steps
`use_fused_kernel`	true	Use fused force+integrate kernel (faster for large N)

sph_default.json

Key	Default	Description
`num_particles`	50000	Particle count
`particle_mass`	0.001	Mass per particle
`smoothing_length`	0.05	SPH kernel radius h
`rest_density`	1000.0	Reference density (kg/m^3 in real units)
`gas_constant`	2.0	Stiffness of the equation of state
`viscosity`	0.15	Artificial viscosity coefficient
`dt`	0.0005	Timestep
`num_steps`	500	Steps to simulate

Snapshot Format

Binary snapshots written by write_snapshot() have the following layout:

[uint64_t N]                     -- number of particles (8 bytes)
[float x0][float y0][float z0]   -- position of particle 0 (12 bytes)
[float x1][float y1][float z1]   -- position of particle 1
...

Total size: 8 + N * 12 bytes.

Reading in Python:

import numpy as np

with open("output/nbody_step00100.bin", "rb") as f:
    n = np.frombuffer(f.read(8), dtype=np.uint64)[0]
    pos = np.frombuffer(f.read(), dtype=np.float32).reshape(n, 3)

print(f"{n} particles, center of mass: {pos.mean(axis=0)}")

Key CUDA Techniques

Shared-memory tiling (N-body forces): Source particles are loaded cooperatively into __shared__ memory in blocks of TILE_SIZE=256. Every target particle interacts with all particles in the tile before the tile is replaced. This reduces global memory reads by a factor of TILE_SIZE compared to naive all-pairs.

Kernel fusion: The compute_and_integrate_fused kernel computes forces and applies the velocity-Verlet half-kick in a single pass. This eliminates the intermediate force array write and read, cutting memory traffic by roughly 3x and improving performance 15-25% on Ada Lovelace.

Spatial hash grid (SPH): Particles are bucketed into a hash table by their 3D grid cell. Neighbor search then only visits 27 adjacent cells instead of all N particles, giving O(N) complexity per step. Thrust sort_by_key is used for the GPU-side radix sort.

SoA layout: All particle data is stored as separate arrays per channel (x[], y[], z[], ...) rather than interleaved structs. This gives fully coalesced 128-byte cache-line reads when sequential threads access the same channel.

__launch_bounds__: Force kernels are annotated with __launch_bounds__(256, 4) to give the compiler a register budget hint. This prevents the compiler from allocating more registers than the hardware can schedule at the target occupancy.

Troubleshooting

cudaErrorNoKernelImageForDevice at launch Your GPU's compute capability is not in CMAKE_CUDA_ARCHITECTURES. Add it to CMakeLists.txt (e.g., set(CMAKE_CUDA_ARCHITECTURES 86 89)) and rebuild.

Conan: ERROR: Package ... not found Run conan install . --build=missing -- the --build=missing flag builds any package that does not have a pre-built binary for your platform.

nvcc not found Add CUDA to PATH:

export PATH=/usr/local/cuda/bin:$PATH

Extremely slow first build Conan downloads and compiles all dependencies from source on the first run. This can take 10-20 minutes. Subsequent builds use the local cache and are fast.

CUDA error: out of memory Reduce num_particles in the config file or free other GPU workloads before running.

Validation Criteria

Metric	Target	How checked
Energy drift (N-body, 1000 steps)	less than 0.1%	`test_validation`
Plummer sphere density profile	p-value greater than 0.05	`test_validation`
Fused vs two-pass agreement	less than 1e-3 per particle	`test_nbody`
All unit tests pass	100%	`ctest`

Project Standards

License: LICENSE (MIT)
Contribution guide: CONTRIBUTING.md
Code of conduct: CODE_OF_CONDUCT.md
Security policy: SECURITY.md
Citation metadata: CITATION.cff
Release history: CHANGELOG.md

If this repository is used in academic work, include a citation generated from CITATION.cff.

License

MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github		.github
.vscode		.vscode
benchmarks		benchmarks
cmake		cmake
configs		configs
docs		docs
examples		examples
include		include
src		src
tests		tests
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.editorconfig		.editorconfig
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
FIXES.md		FIXES.md
LICENSE		LICENSE
README.md		README.md
REPRODUCIBILITY.md		REPRODUCIBILITY.md
SECURITY.md		SECURITY.md
conanfile.py		conanfile.py

Folders and files

Latest commit

History

Repository files navigation

GPU-Accelerated Physics Simulation

Performance (RTX 4080, Ada Lovelace)

Profiling & Optimization

Nsight Systems (System-wide timeline)

Nsight Compute (Kernel-level analysis)

Project Layout

Documentation Index

Requirements

Hardware

Software

C++ Dependencies (managed by Conan)

VS Code Setup (Recommended)

Extensions to install

Keyboard shortcuts (via tasks.json)

Building from Scratch

Step 1 -- Install CUDA Toolkit

Step 2 -- Install CMake and Ninja

Step 3 -- Install Conan

Step 4 -- Clone and build

Running Simulations

N-body gravitational simulation

SPH fluid dam break

Rigid body bouncing spheres

Standalone examples

Running Tests

Running Benchmarks

GPU Profiling with Nsight

Nsight Systems (timeline view)

Nsight Compute (kernel-level metrics)

Configuration Reference

nbody_default.json

sph_default.json

Snapshot Format

Key CUDA Techniques

Troubleshooting

Validation Criteria

Project Standards

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages