A high-performance physics simulation framework built on CUDA, targeting real-time performance for large particle systems on consumer GPUs (RTX 4080 / 4090).
Three solvers are included:
- N-body -- all-pairs gravitational simulation with shared-memory tiling and an optional fused force+integration kernel
- SPH Fluid -- Smoothed Particle Hydrodynamics with spatial hash grid neighbor search
- Rigid Body -- sphere-sphere collision detection and impulse-based resolution
All solvers use Structure-of-Arrays memory layout for coalesced GPU access and are validated against analytical solutions.
| Simulation | Particles | Steps/s | Notes |
|---|---|---|---|
| N-body (fused kernel) | 10,000 | ~200 | ~5 ms/step |
| N-body (fused kernel) | 32,768 | ~25 | ~40 ms/step |
| SPH fluid | 30,000 | ~60 | dt=0.0005, dam break |
| Rigid body | 2,000 | 1,000+ | O(N^2) collision detection |
Energy drift over 1,000 velocity-Verlet steps: less than 0.1% for N-body with dt=0.005.
All kernels profiled and optimized using NVIDIA Nsight tools on RTX 4080.
nsys profile --stats=true ./build/release/bin/physics_sim --mode nbodyFindings:
- Identified CPU-GPU synchronization overhead in naive kernel (12% idle)
- Verified async memory transfers overlap with compute in fused kernel
- Timeline shows 3.2x reduction in kernel launch count after fusion
ncu --set full --kernel-name compute_forces_tiled ./build/release/bin/physics_sim --mode nbodyKey Metrics (compute_forces_tiled, 10K particles):
| Metric | Achieved | Target | Status |
|---|---|---|---|
| Achieved Occupancy | 62.3% | >60% | PASS |
| SM Throughput | 78.4% | >70% | PASS |
| L1/TEX Hit Rate | 54.1% | >50% | PASS |
| Memory Throughput | 612 GB/s | 72% peak | PASS |
| Warp Execution Efficiency | 94.2% | >90% | PASS |
Optimization applied: Shared-memory tiling (TILE_SIZE=256) reduced global memory reads by 256x, increased L1 hit rate from 18% to 54%.
Full workflow: see docs/performance_tuning.md
physics-sim/
|- include/
| |- core/ # particle_system.hpp, simulation_engine.hpp, integrator.hpp
| |- kernels/ # forces.cuh, integrators.cuh, spatial_hashing.cuh
| |- solvers/ # nbody_solver.hpp, sph_solver.hpp, rigid_body_solver.hpp
| |- utils/ # cuda_helpers.hpp, profiler.hpp, config_parser.hpp
|- src/
| |- core/ # particle_system.cu
| |- kernels/ # forces.cu, integrators.cu, spatial_hashing.cu
| |- solvers/ # nbody_solver.cu, sph_solver.cu, rigid_body_solver.cu
| |- main.cu # CLI entry point
|- tests/
| |- unit/ # GoogleTest unit tests for each solver
| |- validation/ # Plummer sphere profile, energy conservation
|- benchmarks/ # Google Benchmark suites (nbody, SPH)
|- examples/
| |- galaxy_collision/ # Two-galaxy flyby (20K particles)
| |- fluid_dam_break/ # SPH dam break (30K particles)
| |- bouncing_spheres/ # 2,000 rigid spheres with gravity
|- configs/ # JSON parameter files for each simulation mode
|- cmake/ # CMake helper modules
|- docs/
| |- images/ # Screenshots and diagrams
| |- performance_tuning.md
| |- user_guide.md
| |- technical_report.md
|- .github/workflows/ # CI build and nightly benchmark pipelines
|- .vscode/ # Editor settings, tasks, launch configs
docs/user_guide.md-- simulation usage, timestep tuning, and profiler usagedocs/performance_tuning.md-- Nsight-based optimization workflowdocs/technical_report.md-- architecture, methods, validation, and limitationsREPRODUCIBILITY.md-- benchmark and validation reproducibility protocolFIXES.md-- audit trail of correctness and portability fixes
- NVIDIA GPU with compute capability 8.6 or higher (RTX 3000 / 4000 series)
- Tested on RTX 4080 (Ada Lovelace, compute capability 8.9)
- At least 8 GB VRAM recommended for large particle counts
| Tool | Version | Purpose |
|---|---|---|
| CUDA Toolkit | 12.x | Compiler (nvcc), runtime, Thrust |
| CMake | 3.18+ | Build system |
| Ninja | any | Fast parallel build backend |
| GCC or Clang | GCC 11+ / Clang 14+ | Host C++ compiler (C++17) |
| Python 3 | 3.8+ | Needed to run Conan |
| Conan | 2.x | C++ dependency manager |
| Library | Version | Purpose |
|---|---|---|
| nlohmann_json | 3.11.3 | JSON config file parsing |
| CLI11 | 2.3.2 | Command-line argument parsing |
| GoogleTest | 1.14.0 | Unit and validation tests |
| Google Benchmark | 1.8.3 | Microbenchmark suite |
| spdlog | 1.13.0 | Structured logging |
Open the Extensions panel (Ctrl+Shift+X) and install:
| Extension | Publisher | Purpose |
|---|---|---|
| C/C++ | Microsoft | IntelliSense and CUDA-GDB debugging |
| CMake Tools | Microsoft | One-click configure / build |
| CMake | twxs | CMakeLists.txt syntax |
| Nsight Visual Studio Code Edition | NVIDIA | CUDA kernel profiling from VS Code |
| GitLens | GitKraken | Inline blame, history |
| Error Lens | Alexander | Inline error highlights |
| Hex Editor | Microsoft | Browse binary snapshot files |
These are also listed in .vscode/extensions.json so VS Code will prompt you to install them when you open the folder.
| Task | How to run |
|---|---|
| Build (debug) | Ctrl+Shift+B (default build task) |
| Build (release) | Terminal -> Run Task -> "cmake: build release" |
| Run tests | Terminal -> Run Task -> "run: tests" |
| Run N-body | Terminal -> Run Task -> "run: nbody simulation" |
| Run SPH | Terminal -> Run Task -> "run: sph simulation" |
| Run benchmarks | Terminal -> Run Task -> "run: benchmarks" |
Download from https://developer.nvidia.com/cuda-downloads
Select your OS and the runfile or .deb installer. After installation, add to your shell:
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATHVerify:
nvcc --version
nvidia-smiUbuntu / Debian:
sudo apt update
sudo apt install cmake ninja-build python3-pipWindows (via winget):
winget install Kitware.CMake
winget install Ninja-build.Ninjapip3 install "conan>=2,<3"
conan profile detect --forcegit clone https://github.com/Olajide-Badejo/GPU-Physics-Simulation.git
cd GPU-Physics-Simulation
# Install C++ dependencies
conan install . --build=missing \
-s build_type=Release \
-s compiler.cppstd=17 \
--output-folder=build/conan
# Configure
cmake -B build/release \
-G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_TOOLCHAIN_FILE=build/conan/conan_toolchain.cmake
# Build everything (executables, tests, benchmarks, examples)
cmake --build build/release --parallelFor a debug build with the CUDA device debugger enabled:
conan install . --build=missing \
-s build_type=Debug \
-s compiler.cppstd=17 \
--output-folder=build/conan
cmake -B build/debug \
-G Ninja \
-DCMAKE_BUILD_TYPE=Debug \
-DCMAKE_TOOLCHAIN_FILE=build/conan/conan_toolchain.cmake \
-DCMAKE_EXPORT_COMPILE_COMMANDS=ON # needed for IntelliSense
cmake --build build/debug --parallelCreate the output directory first:
mkdir -p output# Default parameters (10K particles, 1000 steps)
./build/release/bin/physics_sim --mode nbody
# Custom config
./build/release/bin/physics_sim --mode nbody --config configs/nbody_default.json./build/release/bin/physics_sim --mode sph --config configs/sph_default.json./build/release/bin/physics_sim --mode rigid --num-spheres 2000./build/release/bin/galaxy_collision # two-galaxy flyby
./build/release/bin/fluid_dam_break # SPH dam break
./build/release/bin/bouncing_spheres # rigid body democtest --test-dir build/debug --output-on-failureOr run individual test executables:
./build/debug/bin/test_nbody
./build/debug/bin/test_sph
./build/debug/bin/test_validationThe validation tests check:
- Plummer sphere radial density profile (3-sigma bin tolerance)
- Energy drift below 0.5% over 1,000 velocity-Verlet steps
- Fused kernel output matches two-pass reference within 1e-3
# N-body benchmark (1K to 32K particles)
./build/release/bin/nbody_bench
# SPH benchmark (5K to 50K particles)
./build/release/bin/sph_bench
# Export JSON for plotting
./build/release/bin/nbody_bench --benchmark_format=json --benchmark_out=results.json
# Run only the fused kernel benchmarks
./build/release/bin/nbody_bench --benchmark_filter=NBody_Fusednsys profile --stats=true \
./build/release/bin/physics_sim --mode nbodyThis generates a .nsys-rep file you can open in the Nsight Systems GUI to see kernel timelines, memory transfers, and CPU/GPU overlap.
ncu --set full \
--target-processes all \
--kernel-name compute_forces_tiled \
./build/release/bin/physics_sim --mode nbodyKey metrics to look at:
- SM Active Cycles / SM Throughput -- are all SMs busy?
- L1/L2 Hit Rate -- is shared memory tiling helping?
- Memory Throughput -- are you bandwidth-limited?
- Warp Stall Reasons -- what is blocking warps from issuing instructions?
Target values for the force kernels on Ada:
| Metric | Target |
|---|---|
| Achieved occupancy | more than 60% |
| Warp execution efficiency | more than 90% |
| L1 hit rate | more than 50% (shared mem tiling) |
| Memory throughput | more than 70% of peak |
JSON files in configs/ are loaded with --config. All fields are optional and fall back to defaults if omitted.
| Key | Default | Description |
|---|---|---|
num_particles |
10000 | Total particle count |
softening |
0.01 | Gravitational softening (avoids singularity at r=0) |
dt |
0.005 | Timestep in simulation units |
num_steps |
1000 | Total steps to run |
snapshot_every |
100 | Write a binary snapshot every N steps |
use_fused_kernel |
true | Use fused force+integrate kernel (faster for large N) |
| Key | Default | Description |
|---|---|---|
num_particles |
50000 | Particle count |
particle_mass |
0.001 | Mass per particle |
smoothing_length |
0.05 | SPH kernel radius h |
rest_density |
1000.0 | Reference density (kg/m^3 in real units) |
gas_constant |
2.0 | Stiffness of the equation of state |
viscosity |
0.15 | Artificial viscosity coefficient |
dt |
0.0005 | Timestep |
num_steps |
500 | Steps to simulate |
Binary snapshots written by write_snapshot() have the following layout:
[uint64_t N] -- number of particles (8 bytes)
[float x0][float y0][float z0] -- position of particle 0 (12 bytes)
[float x1][float y1][float z1] -- position of particle 1
...
Total size: 8 + N * 12 bytes.
Reading in Python:
import numpy as np
with open("output/nbody_step00100.bin", "rb") as f:
n = np.frombuffer(f.read(8), dtype=np.uint64)[0]
pos = np.frombuffer(f.read(), dtype=np.float32).reshape(n, 3)
print(f"{n} particles, center of mass: {pos.mean(axis=0)}")Shared-memory tiling (N-body forces): Source particles are loaded cooperatively into __shared__ memory in blocks of TILE_SIZE=256. Every target particle interacts with all particles in the tile before the tile is replaced. This reduces global memory reads by a factor of TILE_SIZE compared to naive all-pairs.
Kernel fusion: The compute_and_integrate_fused kernel computes forces and applies the velocity-Verlet half-kick in a single pass. This eliminates the intermediate force array write and read, cutting memory traffic by roughly 3x and improving performance 15-25% on Ada Lovelace.
Spatial hash grid (SPH): Particles are bucketed into a hash table by their 3D grid cell. Neighbor search then only visits 27 adjacent cells instead of all N particles, giving O(N) complexity per step. Thrust sort_by_key is used for the GPU-side radix sort.
SoA layout: All particle data is stored as separate arrays per channel (x[], y[], z[], ...) rather than interleaved structs. This gives fully coalesced 128-byte cache-line reads when sequential threads access the same channel.
__launch_bounds__: Force kernels are annotated with __launch_bounds__(256, 4) to give the compiler a register budget hint. This prevents the compiler from allocating more registers than the hardware can schedule at the target occupancy.
cudaErrorNoKernelImageForDevice at launch
Your GPU's compute capability is not in CMAKE_CUDA_ARCHITECTURES. Add it to CMakeLists.txt (e.g., set(CMAKE_CUDA_ARCHITECTURES 86 89)) and rebuild.
Conan: ERROR: Package ... not found
Run conan install . --build=missing -- the --build=missing flag builds any package that does not have a pre-built binary for your platform.
nvcc not found
Add CUDA to PATH:
export PATH=/usr/local/cuda/bin:$PATHExtremely slow first build Conan downloads and compiles all dependencies from source on the first run. This can take 10-20 minutes. Subsequent builds use the local cache and are fast.
CUDA error: out of memory
Reduce num_particles in the config file or free other GPU workloads before running.
| Metric | Target | How checked |
|---|---|---|
| Energy drift (N-body, 1000 steps) | less than 0.1% | test_validation |
| Plummer sphere density profile | p-value greater than 0.05 | test_validation |
| Fused vs two-pass agreement | less than 1e-3 per particle | test_nbody |
| All unit tests pass | 100% | ctest |
- License:
LICENSE(MIT) - Contribution guide:
CONTRIBUTING.md - Code of conduct:
CODE_OF_CONDUCT.md - Security policy:
SECURITY.md - Citation metadata:
CITATION.cff - Release history:
CHANGELOG.md
If this repository is used in academic work, include a citation generated from
CITATION.cff.
MIT License. See LICENSE for details.
