feat(cuda): add NVLink topology detection (v1.1)

mivertowski · claude · mivertowski · commit 38e5cbd4ab39 · 2026-04-17T22:48:01.000+02:00
Per spec section 4.3, foundation for multi-GPU runtime.

NvlinkTopology:
- adjacency matrix with per-link bandwidth (GB/s)
- shortest-hop distance matrix (Floyd-Warshall + BFS)
- probe() via cuDeviceCanAccessPeer for P2P detection
- Optional nvml-wrapper for NVLink bandwidth enrichment
  (NvLinkState + NvLinkVersion per link)
- Graceful fallback to single_gpu() when CUDA/NVML unavailable

API:
- probe() -&gt; Result&lt;Self&gt;
- bandwidth(from, to) -&gt; u32
- direct_link_exists(from, to) -&gt; bool
- co_located_candidates(gpu, min_bw) -&gt; Vec&lt;u32&gt;
- shortest_path(from, to) -&gt; Option&lt;Vec&lt;u32&gt;&gt;
- hops(from, to) -&gt; Option&lt;u32&gt;
- Test constructors: single_gpu(), disconnected(n), from_adjacency()

Bandwidth table: NVLink 1.0=20, 2.0/3.0/4.0=25, 5.0=50 GB/s.
P2P fallback without NVML info: 25 GB/s (NVLink 2.0 baseline).

Feature gate: new optional `multi-gpu` feature (adds nvml-wrapper).
Basic P2P detection works with just `cuda`; bandwidth info needs `multi-gpu`.

14 unit tests: single-GPU, 2-GPU linear, 4-GPU ring, 8-GPU hypercube,
disconnected pairs, bandwidth queries, matrix symmetry, diagonals,
co-location ordering, multi-hop paths.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/crates/ringkernel-cuda/Cargo.toml b/crates/ringkernel-cuda/Cargo.toml
@@ -44,6 +44,10 @@ sha2 = { version = "0.10", optional = true }
 # Temp directory (for cache tests)
 tempfile = { workspace = true, optional = true }
 
+# NVIDIA Management Library bindings (for NVLink topology detection).
+# Optional — when disabled, topology probe falls back to cudarc's P2P detection.
+nvml-wrapper = { version = "0.12", optional = true }
+
 [dev-dependencies]
 tokio = { workspace = true, features = ["test-util", "macros", "rt-multi-thread"] }
 
@@ -60,3 +64,6 @@ cooperative = ["cuda"]
 profiling = ["cuda", "serde", "serde_json"]
 # PTX compilation cache - eliminates first-tick kernel compilation overhead
 ptx-cache = ["cuda", "sha2", "tempfile"]
+# Multi-GPU topology detection via NVML (NvlinkTopology::probe uses nvml-wrapper
+# when enabled, otherwise falls back to cudarc P2P-only detection).
+multi-gpu = ["cuda", "nvml-wrapper"]
diff --git a/crates/ringkernel-cuda/src/lib.rs b/crates/ringkernel-cuda/src/lib.rs
@@ -54,6 +54,8 @@ mod memory;
 #[cfg(feature = "cuda")]
 pub mod memory_pool;
 #[cfg(feature = "cuda")]
+pub mod multi_gpu;
+#[cfg(feature = "cuda")]
 pub mod persistent;
 #[cfg(feature = "cuda")]
 pub mod phases;
diff --git a/crates/ringkernel-cuda/src/multi_gpu/mod.rs b/crates/ringkernel-cuda/src/multi_gpu/mod.rs
@@ -0,0 +1,20 @@
+//! Multi-GPU runtime infrastructure.
+//!
+//! This module provides the primitives needed to coordinate persistent
+//! actors across multiple CUDA devices:
+//!
+//! - [`topology`]: NVLink / P2P topology detection (adjacency matrix,
+//!   bandwidth, shortest-hop path) used for placement decisions.
+//! - `runtime`: multi-device `CudaRuntime` facade that dispatches launches
+//!   across devices (coming in v1.1 — see spec section 4.1).
+//! - `migration`: 3-phase actor migration protocol (quiesce, transfer, swap)
+//!   for rebalancing actors across GPUs (coming in v1.1 — see spec section 3.1).
+//!
+//! Today only `topology` is wired up; the other submodules will be added as
+//! the v1.1 multi-GPU work lands.
+
+pub mod topology;
+
+// Coming in v1.1 — see docs/superpowers/specs/2026-04-17-v1.1-vyngraph-gaps.md
+// pub mod runtime;
+// pub mod migration;
diff --git a/crates/ringkernel-cuda/src/multi_gpu/topology.rs b/crates/ringkernel-cuda/src/multi_gpu/topology.rs