- Germany
- in/olajide-badejo
Pinned Loading
-
GPU-Based-Matrix-Operations
GPU-Based-Matrix-Operations PublicCUDA/C++ matrix-vector and matrix-matrix kernels with naive, shared-memory tiled, and warp-coalesced variants, plus GFLOPS benchmarking vs CPU.
Cuda 1
-
GPU-Physics-Simulation
GPU-Physics-Simulation PublicReal-time CUDA physics engine for N-body gravity, SPH fluids, and rigid-body collisions. Uses shared-memory tiling, kernel fusion, and spatial hashing on RTX 4080/4090.
Cuda
-
ARM-Neon-Conv3x3
ARM-Neon-Conv3x3 PublicARMv8-A NEON 3×3 convolution in C++17. Scalar, NEON naive, and cache/register-blocked variants with runtime dispatch and perf/PMU analysis.
C++
-
GPU-Training-Bench
GPU-Training-Bench PublicPyTorch training throughput benchmark for NVIDIA GPUs. Sweeps batch size, precision, and DataLoader workers, profiles step phases, collects NVML telemetry, and generates JSON + HTML reports.
Python
-
CUDA-Matrix-Library
CUDA-Matrix-Library PublicCUDA matrix library for GEMM, GEMV, TRSM with naive, tiled, register-blocked, and tensor-core kernels. Includes FP16/BF16 mixed precision, sparse ops, cuSOLVER wrappers, and Python bindings.
C++
-
Metal-msl-microbenchmark
Metal-msl-microbenchmark PublicCUDA bandwidth tests ported to Metal MSL. Measures M1 GPU DRAM, threadgroup SRAM, and texture-cache bandwidth with Apple Instruments traces.
Objective-C++
If the problem persists, check the GitHub status page or contact support.