You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update CUDA wishlist with current-state research for all 15 feature requests
Comprehensive review against CUDA 12.0–13.1 / Hopper / Blackwell before
sending to NVIDIA peers. Key changes:
- Feature 8 (Memory Model): rewrite — ordered atomics already exist via
libcu++/CCCL; reframe as mapped memory coherence control request
- Feature 9 (Multi-GPU): correct that cudaLaunchCooperativeKernelMultiDevice
was removed in CUDA 13.0; reference NVSHMEM as current alternative
- Add "Current State" sections to Features 1–6, 11, 12, 15 acknowledging
partial solutions (DSMEM/clusters, Green Contexts, Graph conditional nodes,
cuCheckpointProcess, hardware preemption, warp partitions)
- Update Related Work with Thread Block Clusters, CUDA Graphs conditional
nodes, Green Contexts, libcu++/CCCL, NVSHMEM
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: docs/cuda-wishlist-persistent-actors.tex
+72-24Lines changed: 72 additions & 24 deletions
Original file line number
Diff line number
Diff line change
@@ -271,7 +271,7 @@ \subsection{The Problem}
271
271
}
272
272
\end{lstlisting}
273
273
274
-
This wastes SM cycles and increases power consumption during idle periods.
274
+
This wastes SM cycles and increases power consumption during idle periods. CUDA provides \texttt{\_\_nanosleep()} (since CUDA 10) for power-efficient polling, which reduces SM power consumption during spin-waits but does not eliminate polling entirely---there is still no interrupt-driven host$\to$kernel notification mechanism.
This is error-prone and requires explicit synchronization.
359
359
360
+
\subsection{Current State: Hopper DSMEM and Thread Block Clusters}
361
+
362
+
Hopper (SM 9.0+) introduces \textbf{Distributed Shared Memory (DSMEM)} and \textbf{Thread Block Clusters}, which provide a partial solution: blocks within a cluster (up to 8--16 blocks) can directly access each other's shared memory via \texttt{cluster.map\_shared\_rank()}. This enables mailbox-like semantics within a cluster. However, clusters are limited to a single SM group (typically 8--16 blocks) and do not support grid-wide or cross-kernel messaging. The gap remains for:
\subsection{Current State: Device Graph Launch and Green Contexts}
485
+
486
+
CUDA 12.0+ provides \textbf{Device Graph Launch}, allowing kernels to launch CUDA graphs from the device side, enabling some dynamic work generation. CUDA 13.1 introduces \textbf{Green Contexts}, which partition GPU resources (SMs, memory bandwidth) into isolated contexts. These provide building blocks for scheduling but do not offer work-stealing queues, dynamic load balancing within a persistent kernel, or priority-based scheduling across blocks.
487
+
475
488
\subsection{Proposed Solution: Work Stealing Queues}
476
489
477
490
\begin{lstlisting}[style=cstyle,title={Device-wide work queue}]
@@ -557,6 +570,8 @@ \subsection{The Problem}
557
570
\item Accept high latency for other GPU work
558
571
\end{enumerate}
559
572
573
+
CUDA does support hardware-level compute preemption (since Pascal/SM 6.0), which the driver uses for context switching and watchdog timers. However, this preemption is \textbf{not user-controllable}---there are no APIs to request preemption, set priorities, or define safe preemption points. The hardware preemption also does not support saving/restoring application-level state.
\begin{lstlisting}[style=cstyle,title={Mark preemption-safe points in kernel}]
@@ -608,6 +623,8 @@ \subsection{The Problem}
608
623
\item Hours of simulation lost on hardware fault
609
624
\end{itemize}
610
625
626
+
CUDA provides \texttt{cuCheckpointProcessSave/Restore} (driver API) for process-level GPU memory checkpointing, and CUPTI provides checkpoint/replay for profiler instrumentation. However, these save \textbf{GPU memory allocations only}---they do not capture kernel execution state (registers, program counter, shared memory) and cannot checkpoint a running persistent kernel at an application-defined safe point.
Hopper (SM 9.0+) introduces \texttt{cluster\_group} with \texttt{cluster.sync()}, providing synchronization across a cluster of blocks (up to 16). This is a significant step toward hierarchical sync, but clusters are fixed at launch time, limited in size, and do not support the dynamic, topology-aware, or named barrier patterns proposed below. Note: \texttt{cudaLaunchCooperativeKernelMultiDevice} was removed in CUDA 13.0, so multi-GPU grid sync is no longer available through cooperative groups.
Current CUDA atomics are relaxed by default. For actor mailboxes, we need SC:
826
+
NVIDIA's libcu++ (part of CCCL) already provides C++ standard memory ordering for atomics and fences. The original motivation for this feature---sequentially consistent atomics and ordered fences---is \textbf{already available}:
// Fine-grained coherence control via domain-scoped fences
827
846
\end{lstlisting}
828
847
829
-
\subsection{Mapped Memory Coherence Control}
848
+
RingKernel currently uses legacy CUDA intrinsics (\texttt{atomicAdd}, \texttt{\_\_threadfence\_system}) which lack explicit ordering. Migrating to libcu++ would address the atomics and fence requirements. \textbf{The remaining gap is mapped memory coherence control}, which libcu++ does not address.
Persistent actors continuously poll mapped memory for host commands. Current mapped memory provides implicit coherence that is either always-on (expensive) or relies on volatile semantics with no formal guarantees. Explicit coherence domains would allow actors to request visibility only when needed:
This is distinct from Memory Synchronization Domains (Hopper+), which control L2 cache partitioning between domains but do not provide explicit acquire/release semantics for mapped CPU$\leftrightarrow$GPU memory regions.
873
+
849
874
\newpage
850
875
851
876
% Feature Request 9
@@ -860,6 +885,12 @@ \subsection{The Problem}
860
885
\item Complex synchronization
861
886
\end{itemize}
862
887
888
+
\subsection{Current State: Multi-GPU Cooperative Launch Was Removed}
889
+
890
+
\texttt{cudaLaunchCooperativeKernelMultiDevice} was deprecated in CUDA 11.3 and \textbf{removed entirely in CUDA 13.0}. NVIDIA's current recommendation is \textbf{NVSHMEM} for multi-GPU communication, which provides one-sided put/get operations and collective synchronization across GPUs. However, NVSHMEM is designed for bulk-synchronous SPMD programs, not persistent actor systems with irregular messaging patterns.
891
+
892
+
The gap: there is no CUDA-native way to launch a single persistent kernel that spans multiple GPUs with unified block addressing and cross-GPU mailbox routing.
RustGraph uses \texttt{grid.sync()} plus a shared atomic counter to detect quiescence. For PageRank on 100K nodes, convergence typically takes 15--30 iterations, meaning 15--30 full grid synchronizations just for the convergence check.
978
1009
1010
+
\subsection{Current State: Graph Conditional Nodes and Block-Level Reduction}
1011
+
1012
+
CUDA 12.4+ introduces \textbf{Graph conditional WHILE nodes}, which enable GPU-resident loops that repeat a subgraph until a device-side condition is met---without host involvement. Combined with block-level \texttt{cg::reduce()} for cooperative groups, this provides a partial solution for convergence loops in graph-launched workloads. However, this requires structuring work as CUDA Graphs rather than persistent kernels, and the reduction is block-scoped, not grid-wide. For persistent actor systems that run a continuous BSP tick loop, there is no hardware-accelerated grid-wide convergence primitive.
grid.sync() waits for ALL 100 blocks, even though only 2 are doing work.
1040
1075
\end{Verbatim}
1041
1076
1077
+
\subsection{Current State: Cluster-Scoped Partial Sync on Hopper}
1078
+
1079
+
Hopper's thread block clusters support partial synchronization within a cluster using bitmasked \texttt{mbarrier} operations---a block can wait on a subset of cluster members (up to 16 blocks). This enables partial sync at the cluster level but does not extend to grid-wide predicate-based synchronization. For graphs with 100+ blocks where only a subset is active, cluster-level partial sync is insufficient.
\begin{lstlisting}[style=cstyle,title={Sync only blocks that match a predicate}]
@@ -1226,6 +1265,10 @@ \subsection{The Problem}
1226
1265
1227
1266
This is error-prone and requires manual warp-level programming for every algorithm.
1228
1267
1268
+
\subsection{Current State: Cooperative Groups Warp Partitions}
1269
+
1270
+
CUDA cooperative groups provide \texttt{tiled\_partition<N>} for static warp subdivision and \texttt{labeled\_partition} for dynamic grouping based on a label value. \texttt{coalesced\_threads()} captures the active thread mask. These provide the building blocks for sub-warp computation but require manual orchestration---there is no concept of assigning actors to variable-sized thread groups based on workload, no \texttt{\_\_actor\_context()}, and no automatic load-balanced mapping of actors to hardware threads.
1271
+
1229
1272
\subsection{Proposed Solution: Sub-Block Actor Assignment}
1230
1273
1231
1274
\begin{lstlisting}[style=cstyle,title={Hardware sub-block actor assignment}]
0 commit comments