You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Upgrade CUDA codegen with research findings from the CUDA wishlist:
- Two-phase warp-shuffle reductions (O(log N) barriers → 1 per block)
- __nanosleep() for idle spin-wait and barrier power efficiency
- Opt-in libcu++ cuda::atomic_ref with explicit memory ordering
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Phase 2: Cross-warp reduction via shared memory — one `__syncthreads()` call
18
+
- Applies to: `block_reduce_energy` (persistent FDTD), `generate_block_reduce_fn`, `generate_grid_reduce_fn`, `generate_reduce_and_broadcast_fn`, and all inline reduction generators
19
+
- Reduces barrier count from O(log N) to 1 per block reduction (e.g., 9 → 1 for 512-thread blocks)
20
+
21
+
#### CUDA Codegen: `__nanosleep()` Power Efficiency
0 commit comments