Skip to content

Commit a7f42ab

Browse files
mivertowskiclaude
andcommitted
Update CUDA wishlist with current-state research for all 15 feature requests
Comprehensive review against CUDA 12.0–13.1 / Hopper / Blackwell before sending to NVIDIA peers. Key changes: - Feature 8 (Memory Model): rewrite — ordered atomics already exist via libcu++/CCCL; reframe as mapped memory coherence control request - Feature 9 (Multi-GPU): correct that cudaLaunchCooperativeKernelMultiDevice was removed in CUDA 13.0; reference NVSHMEM as current alternative - Add "Current State" sections to Features 1–6, 11, 12, 15 acknowledging partial solutions (DSMEM/clusters, Green Contexts, Graph conditional nodes, cuCheckpointProcess, hardware preemption, warp partitions) - Update Related Work with Thread Block Clusters, CUDA Graphs conditional nodes, Green Contexts, libcu++/CCCL, NVSHMEM Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent d91794c commit a7f42ab

2 files changed

Lines changed: 72 additions & 24 deletions

File tree

10.1 KB
Binary file not shown.

docs/cuda-wishlist-persistent-actors.tex

Lines changed: 72 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -271,7 +271,7 @@ \subsection{The Problem}
271271
}
272272
\end{lstlisting}
273273

274-
This wastes SM cycles and increases power consumption during idle periods.
274+
This wastes SM cycles and increases power consumption during idle periods. CUDA provides \texttt{\_\_nanosleep()} (since CUDA 10) for power-efficient polling, which reduces SM power consumption during spin-waits but does not eliminate polling entirely---there is still no interrupt-driven host$\to$kernel notification mechanism.
275275

276276
\subsection{Proposed Solution: \texttt{cudaKernelNotify()} / \texttt{\_\_kernel\_wait()}}
277277

@@ -357,6 +357,15 @@ \subsection{The Problem}
357357

358358
This is error-prone and requires explicit synchronization.
359359

360+
\subsection{Current State: Hopper DSMEM and Thread Block Clusters}
361+
362+
Hopper (SM 9.0+) introduces \textbf{Distributed Shared Memory (DSMEM)} and \textbf{Thread Block Clusters}, which provide a partial solution: blocks within a cluster (up to 8--16 blocks) can directly access each other's shared memory via \texttt{cluster.map\_shared\_rank()}. This enables mailbox-like semantics within a cluster. However, clusters are limited to a single SM group (typically 8--16 blocks) and do not support grid-wide or cross-kernel messaging. The gap remains for:
363+
\begin{itemize}[leftmargin=*,topsep=2pt,itemsep=1pt]
364+
\item Grid-wide K2K messaging beyond cluster boundaries
365+
\item Addressable actor mailboxes with routing
366+
\item Variable-topology messaging (graph adjacency, not just spatial neighbors)
367+
\end{itemize}
368+
360369
\subsection{Proposed Solution: Native Block Mailboxes}
361370

362371
\begin{lstlisting}[style=cstyle,title={Kernel setup - declare mailbox topology}]
@@ -472,6 +481,10 @@ \subsection{The Problem}
472481
All blocks must wait for slowest block to finish
473482
\end{Verbatim}
474483

484+
\subsection{Current State: Device Graph Launch and Green Contexts}
485+
486+
CUDA 12.0+ provides \textbf{Device Graph Launch}, allowing kernels to launch CUDA graphs from the device side, enabling some dynamic work generation. CUDA 13.1 introduces \textbf{Green Contexts}, which partition GPU resources (SMs, memory bandwidth) into isolated contexts. These provide building blocks for scheduling but do not offer work-stealing queues, dynamic load balancing within a persistent kernel, or priority-based scheduling across blocks.
487+
475488
\subsection{Proposed Solution: Work Stealing Queues}
476489

477490
\begin{lstlisting}[style=cstyle,title={Device-wide work queue}]
@@ -557,6 +570,8 @@ \subsection{The Problem}
557570
\item Accept high latency for other GPU work
558571
\end{enumerate}
559572

573+
CUDA does support hardware-level compute preemption (since Pascal/SM 6.0), which the driver uses for context switching and watchdog timers. However, this preemption is \textbf{not user-controllable}---there are no APIs to request preemption, set priorities, or define safe preemption points. The hardware preemption also does not support saving/restoring application-level state.
574+
560575
\subsection{Proposed Solution: Cooperative Preemption}
561576

562577
\begin{lstlisting}[style=cstyle,title={Mark preemption-safe points in kernel}]
@@ -608,6 +623,8 @@ \subsection{The Problem}
608623
\item Hours of simulation lost on hardware fault
609624
\end{itemize}
610625

626+
CUDA provides \texttt{cuCheckpointProcessSave/Restore} (driver API) for process-level GPU memory checkpointing, and CUPTI provides checkpoint/replay for profiler instrumentation. However, these save \textbf{GPU memory allocations only}---they do not capture kernel execution state (registers, program counter, shared memory) and cannot checkpoint a running persistent kernel at an application-defined safe point.
627+
611628
\subsection{Proposed Solution: Native Checkpointing}
612629

613630
\begin{lstlisting}[style=cstyle,title={Device side - declare checkpointable state}]
@@ -709,6 +726,10 @@ \subsection{Current Limitations}
709726
\item No partial sync (e.g., sync only my neighbors)
710727
\end{itemize}
711728

729+
\subsection{Current State: Hopper Thread Block Clusters}
730+
731+
Hopper (SM 9.0+) introduces \texttt{cluster\_group} with \texttt{cluster.sync()}, providing synchronization across a cluster of blocks (up to 16). This is a significant step toward hierarchical sync, but clusters are fixed at launch time, limited in size, and do not support the dynamic, topology-aware, or named barrier patterns proposed below. Note: \texttt{cudaLaunchCooperativeKernelMultiDevice} was removed in CUDA 13.0, so multi-GPU grid sync is no longer available through cooperative groups.
732+
712733
\subsection{Proposed Extensions}
713734

714735
\subsubsection{Hierarchical Sync Groups}
@@ -798,38 +819,40 @@ \subsection{Proposed Solution: Non-Intrusive Inspection}
798819
\newpage
799820

800821
% Feature Request 8
801-
\section{Feature Request 8: Memory Model Enhancements}
822+
\section{Feature Request 8: Mapped Memory Coherence Control}
802823

803-
\subsection{Sequentially Consistent Atomics Option}
824+
\subsection{Current State: libcu++/CCCL Already Provides Ordered Atomics}
804825

805-
Current CUDA atomics are relaxed by default. For actor mailboxes, we need SC:
826+
NVIDIA's libcu++ (part of CCCL) already provides C++ standard memory ordering for atomics and fences. The original motivation for this feature---sequentially consistent atomics and ordered fences---is \textbf{already available}:
806827

807-
\begin{lstlisting}[style=cstyle]
808-
// Current: Relaxed by default (can reorder)
809-
atomicAdd(&counter, 1);
828+
\begin{lstlisting}[style=cstyle,title={Existing: libcu++ ordered atomics (CCCL)}]
829+
#include <cuda/atomic>
810830

811-
// Proposed: Explicit memory order
812-
atomicAdd_sc(&counter, 1); // Sequentially consistent
813-
atomicAdd_acq_rel(&counter, 1); // Acquire-release
814-
atomicAdd_relaxed(&counter, 1); // Explicit relaxed
815-
\end{lstlisting}
831+
// Sequentially consistent atomic (already works)
832+
cuda::atomic<int, cuda::thread_scope_device> counter;
833+
counter.fetch_add(1, cuda::memory_order_seq_cst);
816834

817-
\subsection{System-Scope Fences with Ordering}
835+
// Acquire-release semantics (already works)
836+
counter.fetch_add(1, cuda::memory_order_acq_rel);
818837

819-
\begin{lstlisting}[style=cstyle]
820-
// Current: Single fence type
821-
__threadfence_system();
838+
// Scoped fences (already works)
839+
cuda::atomic_thread_fence(cuda::memory_order_acquire,
840+
cuda::thread_scope_system);
841+
cuda::atomic_thread_fence(cuda::memory_order_release,
842+
cuda::thread_scope_system);
822843

823-
// Proposed: Explicit ordering
824-
__threadfence_system_acquire(); // Acquire semantics
825-
__threadfence_system_release(); // Release semantics
826-
__threadfence_system_seq_cst(); // Full sequential consistency
844+
// Memory Synchronization Domains (Hopper+, already available)
845+
// Fine-grained coherence control via domain-scoped fences
827846
\end{lstlisting}
828847

829-
\subsection{Mapped Memory Coherence Control}
848+
RingKernel currently uses legacy CUDA intrinsics (\texttt{atomicAdd}, \texttt{\_\_threadfence\_system}) which lack explicit ordering. Migrating to libcu++ would address the atomics and fence requirements. \textbf{The remaining gap is mapped memory coherence control}, which libcu++ does not address.
849+
850+
\subsection{Remaining Request: Explicit Mapped Memory Coherence Domains}
851+
852+
Persistent actors continuously poll mapped memory for host commands. Current mapped memory provides implicit coherence that is either always-on (expensive) or relies on volatile semantics with no formal guarantees. Explicit coherence domains would allow actors to request visibility only when needed:
830853

831854
\begin{lstlisting}[style=cstyle]
832-
// Current: Implicit coherence (expensive)
855+
// Current: Implicit coherence (expensive, always-on)
833856
float* mapped = cudaHostAlloc(..., cudaHostAllocMapped);
834857

835858
// Proposed: Explicit coherence domains
@@ -846,6 +869,8 @@ \subsection{Mapped Memory Coherence Control}
846869
}
847870
\end{lstlisting}
848871

872+
This is distinct from Memory Synchronization Domains (Hopper+), which control L2 cache partitioning between domains but do not provide explicit acquire/release semantics for mapped CPU$\leftrightarrow$GPU memory regions.
873+
849874
\newpage
850875

851876
% Feature Request 9
@@ -860,6 +885,12 @@ \subsection{The Problem}
860885
\item Complex synchronization
861886
\end{itemize}
862887

888+
\subsection{Current State: Multi-GPU Cooperative Launch Was Removed}
889+
890+
\texttt{cudaLaunchCooperativeKernelMultiDevice} was deprecated in CUDA 11.3 and \textbf{removed entirely in CUDA 13.0}. NVIDIA's current recommendation is \textbf{NVSHMEM} for multi-GPU communication, which provides one-sided put/get operations and collective synchronization across GPUs. However, NVSHMEM is designed for bulk-synchronous SPMD programs, not persistent actor systems with irregular messaging patterns.
891+
892+
The gap: there is no CUDA-native way to launch a single persistent kernel that spans multiple GPUs with unified block addressing and cross-GPU mailbox routing.
893+
863894
\subsection{Proposed Solution: Unified Multi-GPU Kernel}
864895

865896
\begin{lstlisting}[style=cstyle,title={Launch spans multiple GPUs}]
@@ -869,7 +900,7 @@ \subsection{Proposed Solution: Unified Multi-GPU Kernel}
869900
config.blocksPerDevice = 512;
870901
config.topology = CUDA_TOPOLOGY_RING; // or MESH, TREE, CUSTOM
871902

872-
cudaLaunchCooperativeKernelMultiDevice(&kernel, config);
903+
cudaLaunchMultiGpuPersistentKernel(&kernel, config);
873904
\end{lstlisting}
874905

875906
\begin{lstlisting}[style=cstyle,title={In kernel - seamless multi-GPU}]
@@ -976,6 +1007,10 @@ \subsection{The Problem}
9761007

9771008
RustGraph uses \texttt{grid.sync()} plus a shared atomic counter to detect quiescence. For PageRank on 100K nodes, convergence typically takes 15--30 iterations, meaning 15--30 full grid synchronizations just for the convergence check.
9781009

1010+
\subsection{Current State: Graph Conditional Nodes and Block-Level Reduction}
1011+
1012+
CUDA 12.4+ introduces \textbf{Graph conditional WHILE nodes}, which enable GPU-resident loops that repeat a subgraph until a device-side condition is met---without host involvement. Combined with block-level \texttt{cg::reduce()} for cooperative groups, this provides a partial solution for convergence loops in graph-launched workloads. However, this requires structuring work as CUDA Graphs rather than persistent kernels, and the reduction is block-scoped, not grid-wide. For persistent actor systems that run a continuous BSP tick loop, there is no hardware-accelerated grid-wide convergence primitive.
1013+
9791014
\subsection{Proposed Solution: Hardware Convergence Counter}
9801015

9811016
\begin{lstlisting}[style=cstyle,title={Hardware convergence detection}]
@@ -1039,6 +1074,10 @@ \subsection{The Problem}
10391074
grid.sync() waits for ALL 100 blocks, even though only 2 are doing work.
10401075
\end{Verbatim}
10411076

1077+
\subsection{Current State: Cluster-Scoped Partial Sync on Hopper}
1078+
1079+
Hopper's thread block clusters support partial synchronization within a cluster using bitmasked \texttt{mbarrier} operations---a block can wait on a subset of cluster members (up to 16 blocks). This enables partial sync at the cluster level but does not extend to grid-wide predicate-based synchronization. For graphs with 100+ blocks where only a subset is active, cluster-level partial sync is insufficient.
1080+
10421081
\subsection{Proposed Solution: Predicate-Based Partial Sync}
10431082

10441083
\begin{lstlisting}[style=cstyle,title={Sync only blocks that match a predicate}]
@@ -1226,6 +1265,10 @@ \subsection{The Problem}
12261265

12271266
This is error-prone and requires manual warp-level programming for every algorithm.
12281267

1268+
\subsection{Current State: Cooperative Groups Warp Partitions}
1269+
1270+
CUDA cooperative groups provide \texttt{tiled\_partition<N>} for static warp subdivision and \texttt{labeled\_partition} for dynamic grouping based on a label value. \texttt{coalesced\_threads()} captures the active thread mask. These provide the building blocks for sub-warp computation but require manual orchestration---there is no concept of assigning actors to variable-sized thread groups based on workload, no \texttt{\_\_actor\_context()}, and no automatic load-balanced mapping of actors to hardware threads.
1271+
12291272
\subsection{Proposed Solution: Sub-Block Actor Assignment}
12301273

12311274
\begin{lstlisting}[style=cstyle,title={Hardware sub-block actor assignment}]
@@ -1297,7 +1340,7 @@ \section{Implementation Priority Matrix}
12971340
Checkpointing + Live Snapshots & High & Very High & \textbf{P2} \\
12981341
Extended Cooperative Groups & Medium & Medium & \textbf{P2} \\
12991342
Debugging Tools & Medium & Low & \textbf{P2} \\
1300-
Memory Model Enhancements & Medium & Medium & \textbf{P3} \\
1343+
Mapped Memory Coherence Control & Medium & Medium & \textbf{P3} \\
13011344
Multi-GPU Kernels & Very High & Very High & \textbf{P3} \\
13021345
Dynamic Actor Registry & Medium & High & \textbf{P3} \\
13031346
\bottomrule
@@ -1393,6 +1436,11 @@ \section{Related Work}
13931436

13941437
\begin{itemize}[leftmargin=*]
13951438
\item \textbf{NVIDIA Cooperative Groups} (CUDA 9+): Foundation for grid-wide sync
1439+
\item \textbf{Thread Block Clusters / DSMEM} (Hopper, CUDA 12+): Cluster-scoped shared memory and partial sync (up to 16 blocks)
1440+
\item \textbf{CUDA Graphs + Conditional Nodes} (CUDA 12.4+): Device-side control flow, GPU-resident loops
1441+
\item \textbf{Green Contexts} (CUDA 13.1): GPU resource partitioning across workloads
1442+
\item \textbf{libcu++/CCCL}: C++ standard atomics and memory ordering for CUDA
1443+
\item \textbf{NVSHMEM}: Multi-GPU communication (replacement for removed multi-device cooperative launch)
13961444
\item \textbf{AMD HIP Persistent Kernels}: Similar exploration in ROCm ecosystem
13971445
\item \textbf{Vulkan/SPIR-V}: Different approach via command buffers
13981446
\item \textbf{SYCL}: Exploring persistent execution model

0 commit comments

Comments
 (0)