Skip to content

Commit 38e5cbd

Browse files
mivertowskiclaude
andcommitted
feat(cuda): add NVLink topology detection (v1.1)
Per spec section 4.3, foundation for multi-GPU runtime. NvlinkTopology: - adjacency matrix with per-link bandwidth (GB/s) - shortest-hop distance matrix (Floyd-Warshall + BFS) - probe() via cuDeviceCanAccessPeer for P2P detection - Optional nvml-wrapper for NVLink bandwidth enrichment (NvLinkState + NvLinkVersion per link) - Graceful fallback to single_gpu() when CUDA/NVML unavailable API: - probe() -> Result<Self> - bandwidth(from, to) -> u32 - direct_link_exists(from, to) -> bool - co_located_candidates(gpu, min_bw) -> Vec<u32> - shortest_path(from, to) -> Option<Vec<u32>> - hops(from, to) -> Option<u32> - Test constructors: single_gpu(), disconnected(n), from_adjacency() Bandwidth table: NVLink 1.0=20, 2.0/3.0/4.0=25, 5.0=50 GB/s. P2P fallback without NVML info: 25 GB/s (NVLink 2.0 baseline). Feature gate: new optional `multi-gpu` feature (adds nvml-wrapper). Basic P2P detection works with just `cuda`; bandwidth info needs `multi-gpu`. 14 unit tests: single-GPU, 2-GPU linear, 4-GPU ring, 8-GPU hypercube, disconnected pairs, bandwidth queries, matrix symmetry, diagonals, co-location ordering, multi-hop paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent f6f1d6b commit 38e5cbd

4 files changed

Lines changed: 821 additions & 0 deletions

File tree

crates/ringkernel-cuda/Cargo.toml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,10 @@ sha2 = { version = "0.10", optional = true }
4444
# Temp directory (for cache tests)
4545
tempfile = { workspace = true, optional = true }
4646

47+
# NVIDIA Management Library bindings (for NVLink topology detection).
48+
# Optional — when disabled, topology probe falls back to cudarc's P2P detection.
49+
nvml-wrapper = { version = "0.12", optional = true }
50+
4751
[dev-dependencies]
4852
tokio = { workspace = true, features = ["test-util", "macros", "rt-multi-thread"] }
4953

@@ -60,3 +64,6 @@ cooperative = ["cuda"]
6064
profiling = ["cuda", "serde", "serde_json"]
6165
# PTX compilation cache - eliminates first-tick kernel compilation overhead
6266
ptx-cache = ["cuda", "sha2", "tempfile"]
67+
# Multi-GPU topology detection via NVML (NvlinkTopology::probe uses nvml-wrapper
68+
# when enabled, otherwise falls back to cudarc P2P-only detection).
69+
multi-gpu = ["cuda", "nvml-wrapper"]

crates/ringkernel-cuda/src/lib.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,8 @@ mod memory;
5454
#[cfg(feature = "cuda")]
5555
pub mod memory_pool;
5656
#[cfg(feature = "cuda")]
57+
pub mod multi_gpu;
58+
#[cfg(feature = "cuda")]
5759
pub mod persistent;
5860
#[cfg(feature = "cuda")]
5961
pub mod phases;
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
//! Multi-GPU runtime infrastructure.
2+
//!
3+
//! This module provides the primitives needed to coordinate persistent
4+
//! actors across multiple CUDA devices:
5+
//!
6+
//! - [`topology`]: NVLink / P2P topology detection (adjacency matrix,
7+
//! bandwidth, shortest-hop path) used for placement decisions.
8+
//! - `runtime`: multi-device `CudaRuntime` facade that dispatches launches
9+
//! across devices (coming in v1.1 — see spec section 4.1).
10+
//! - `migration`: 3-phase actor migration protocol (quiesce, transfer, swap)
11+
//! for rebalancing actors across GPUs (coming in v1.1 — see spec section 3.1).
12+
//!
13+
//! Today only `topology` is wired up; the other submodules will be added as
14+
//! the v1.1 multi-GPU work lands.
15+
16+
pub mod topology;
17+
18+
// Coming in v1.1 — see docs/superpowers/specs/2026-04-17-v1.1-vyngraph-gaps.md
19+
// pub mod runtime;
20+
// pub mod migration;

0 commit comments

Comments
 (0)