|
| 1 | +# RingKernel v1.1 — NC80adis_H100_v5 Session Guide |
| 2 | + |
| 3 | +> **For the Claude Code instance running on the Azure NC80adis_H100_v5 VM (2× H100 NVL).** |
| 4 | +> Read this before starting any work. |
| 5 | +
|
| 6 | +## Context |
| 7 | + |
| 8 | +RingKernel v1.0.0 shipped (CUDA-focused, H100-verified, 8,698x vs traditional launch, 11 CI jobs green). Tag: `v1.0.0`, commit history on `main`. |
| 9 | + |
| 10 | +**v1.1 adds multi-GPU + VynGraph NSAI integration.** All pre-hardware code is landed on `main`. Your job: validate it on 2× H100 NVL hardware and finalize features that need NVLink. |
| 11 | + |
| 12 | +**Budget-conscious mandate:** The user approved NC80adis_H100_v5 (2× H100, ~$8/hr) over ND96 ($80k/month out of budget). Work efficiently — deallocate the VM when idle. Target session length: minimize GPU-hours. |
| 13 | + |
| 14 | +## Primary Goal |
| 15 | + |
| 16 | +**Formally prove the GPU-native persistent actor paradigm works across 2× H100 with NVLink**, producing: |
| 17 | +1. Passing test matrix (8 items, spec §5.3) |
| 18 | +2. TLC model checking reports for all 6 TLA+ specs |
| 19 | +3. Benchmark numbers with statistical rigor (see `docs/benchmarks/METHODOLOGY.md`) |
| 20 | +4. Honest gap documentation if anything doesn't hold |
| 21 | + |
| 22 | +## Key Documents — Read First |
| 23 | + |
| 24 | +1. **`CLAUDE.md`** — project architecture, build commands, gotchas |
| 25 | +2. **`docs/superpowers/specs/2026-04-17-v1.1-vyngraph-gaps.md`** — v1.1 master spec (5 gaps, 3-phase migration, formal verification plan, 8-test hardware matrix in §5.3) |
| 26 | +3. **`docs/benchmarks/METHODOLOGY.md`** — statistical protocol (CI, Cohen's d, MAD outlier detection) |
| 27 | +4. **`docs/benchmarks/h100-b200-baseline.md`** — v1.0 single-GPU baseline (reference only) |
| 28 | +5. **`docs/verification/README.md`** — TLA+ model checking instructions |
| 29 | +6. **`docs/superpowers/GPU_VM_SESSION_GUIDE.md`** — prior H100 session guide (for context on what worked last time) |
| 30 | + |
| 31 | +## What's Already Done on `main` |
| 32 | + |
| 33 | +| Component | Status | Key commit | |
| 34 | +|-----------|--------|------------| |
| 35 | +| Core runtime (H2K/K2H mapped memory, persistent kernels) | v1.0 proven | earlier | |
| 36 | +| PROV-O provenance (8 relations, envelope opt-in) | Done | `f6f1d6b` | |
| 37 | +| NVLink topology detection (probe, bandwidth, paths) | Done | `38e5cbd` | |
| 38 | +| Multi-tenant K2K isolation (per-tenant sub-brokers) | Done | `e33ffd3` | |
| 39 | +| Live introspection streaming (IntrospectionStream, EWMA) | Done | `fe1535e` | |
| 40 | +| Hot rule reload (CompiledRule artifact API) | Done | `a1226a5` | |
| 41 | +| Multi-GPU runtime facade + migration orchestrator | Done (simulated) | `84f0b61` | |
| 42 | +| GPU-side tenant enforcement + migration kernels | Done | `febb724` | |
| 43 | +| TLA+ specs (6 models: hlc, k2k, migration, multi_gpu, tenants, lifecycle) | Done | `fd95dd8` | |
| 44 | +| Tests | **1,612 passing, 0 failures, 97 hardware-gated** | | |
| 45 | + |
| 46 | +## Hardware-Phase Work (Your Job) |
| 47 | + |
| 48 | +The pre-hardware code uses **bookkeeping and simulation** where it needs GPU operations. Your task: replace simulated paths with real CUDA calls, then validate. |
| 49 | + |
| 50 | +### Task 1 — VM Setup & Sanity Check |
| 51 | + |
| 52 | +```bash |
| 53 | +# After SSH login: |
| 54 | +cd ~ |
| 55 | +git clone https://github.com/mivertowski/RustCompute.git RingKernel |
| 56 | +cd RingKernel |
| 57 | + |
| 58 | +# Setup (Rust, CUDA, etc.) |
| 59 | +./scripts/setup-gpu-vm.sh |
| 60 | + |
| 61 | +# Verify NVLink is exposed |
| 62 | +nvidia-smi topo -m # Expect "NV18" (or similar NVX) between GPU 0 and GPU 1 |
| 63 | +nvidia-smi nvlink --status # Expect both links "Active" |
| 64 | + |
| 65 | +# Lock clocks for consistent benchmarks |
| 66 | +sudo nvidia-smi -pm 1 |
| 67 | +sudo nvidia-smi -lgc $(nvidia-smi --query-gpu=clocks.max.graphics --format=csv,noheader,nounits | head -1) |
| 68 | +sudo nvidia-smi -c EXCLUSIVE_PROCESS |
| 69 | + |
| 70 | +# Build with multi-GPU feature |
| 71 | +export RINGKERNEL_CUDA_ARCH=sm_90 |
| 72 | +cargo build --workspace --features "cuda,cooperative,multi-gpu" --release |
| 73 | + |
| 74 | +# Run all non-hardware tests (should match baseline: 1,612 pass) |
| 75 | +cargo test --workspace --release |
| 76 | +``` |
| 77 | + |
| 78 | +**If NVLink is NOT exposed on NC80adis:** That's fine for correctness testing — fallback is PCIe Gen5. Note it in the test report; migration speed will be lower but all correctness properties still hold. |
| 79 | + |
| 80 | +### Task 2 — Enable Peer Access (Real CUDA P2P) |
| 81 | + |
| 82 | +File: `crates/ringkernel-cuda/src/multi_gpu/runtime.rs` |
| 83 | + |
| 84 | +Currently `enable_peer_access` is bookkeeping-only: |
| 85 | +```rust |
| 86 | +pub fn enable_peer_access(&self, from: u32, to: u32) -> Result<()> { |
| 87 | + // Currently: validate topology + insert into HashSet |
| 88 | + // TODO (hardware phase): call cuCtxEnablePeerAccess |
| 89 | +} |
| 90 | +``` |
| 91 | + |
| 92 | +**Wire the real call:** |
| 93 | +```rust |
| 94 | +use cudarc::driver::sys; |
| 95 | + |
| 96 | +pub fn enable_peer_access(&self, from: u32, to: u32) -> Result<()> { |
| 97 | + self.validate_pair(from, to)?; |
| 98 | + |
| 99 | + // SAFETY: cuCtxEnablePeerAccess requires contexts on both devices |
| 100 | + unsafe { |
| 101 | + let target_ctx = self.devices[to as usize].context(); |
| 102 | + let result = sys::cuCtxEnablePeerAccess(target_ctx, 0); |
| 103 | + // Handle CUDA_ERROR_PEER_ACCESS_ALREADY_ENABLED gracefully |
| 104 | + } |
| 105 | + |
| 106 | + self.peer_access.write().insert((from, to)); |
| 107 | + Ok(()) |
| 108 | +} |
| 109 | +``` |
| 110 | + |
| 111 | +Add `disable_peer_access` mirror using `cuCtxDisablePeerAccess`. Run the 3 `#[ignore]` multi_gpu runtime tests. |
| 112 | + |
| 113 | +### Task 3 — Real NVLink P2P Transfers for Migration |
| 114 | + |
| 115 | +File: `crates/ringkernel-cuda/src/multi_gpu/migration.rs` (Phase 2 "Transfer") |
| 116 | + |
| 117 | +Currently simulated on host. Replace with `cuMemcpyPeer` (or `cudaMemcpyPeerAsync` via stream) once peer access is enabled. |
| 118 | + |
| 119 | +The staging buffer CRC32 should match pre- and post-transfer — the existing test asserts this on simulated path; verify it holds with real P2P. |
| 120 | + |
| 121 | +### Task 4 — Run the 8-Test Hardware Matrix |
| 122 | + |
| 123 | +From spec §5.3: |
| 124 | + |
| 125 | +```bash |
| 126 | +# Each test: 3 trials for statistical rigor |
| 127 | +cargo test -p ringkernel-cuda --features "cuda,cooperative,multi-gpu" --release \ |
| 128 | + --test multi_gpu_migration_proof -- --ignored --test-threads=1 |
| 129 | + |
| 130 | +cargo test -p ringkernel-cuda --features "cuda,cooperative,multi-gpu" --release \ |
| 131 | + --test multi_gpu_nvlink_k2k_proof -- --ignored --test-threads=1 |
| 132 | +# ... etc |
| 133 | +``` |
| 134 | + |
| 135 | +**You may need to write some of these integration tests** — check `crates/ringkernel-cuda/tests/` for existing proof tests (actor_lifecycle_proof, streaming_pipeline_proof) as templates. Required tests per spec §5.3: |
| 136 | + |
| 137 | +1. Migration 1M msgs — move actor with 1M in-flight — <100ms, zero loss |
| 138 | +2. Migration loop — 100 back-and-forth migrations — no leak, checksum stable |
| 139 | +3. NVLink K2K latency — cross-GPU latency — p99 < 5us |
| 140 | +4. NVLink K2K throughput — >10M/s sustained 60s |
| 141 | +5. Multi-tenant isolation — 4 tenants, 1000 cross-tenant attempts — 0 leaks, all audited |
| 142 | +6. Provenance chain — 10-step NSAI chain — PROV-O attribution verified |
| 143 | +7. Rule reload under load — swap at 100K msg/s — <1s quiescence, no loss |
| 144 | +8. Full stress — all features + 60s sustained — all invariants hold |
| 145 | + |
| 146 | +Document results in `docs/benchmarks/v1.1-2x-h100-results.md` with 95% CI. |
| 147 | + |
| 148 | +### Task 5 — TLC Model Checking |
| 149 | + |
| 150 | +```bash |
| 151 | +cd docs/verification/ |
| 152 | + |
| 153 | +# Option A: Native TLC (install Java 17 + tla2tools.jar) |
| 154 | +./tlc.sh |
| 155 | + |
| 156 | +# Option B: Docker |
| 157 | +docker run --rm -v $(pwd):/workspace pmer/tla \ |
| 158 | + tlc /workspace/migration.tla -config /workspace/migration.cfg |
| 159 | +``` |
| 160 | + |
| 161 | +Run each spec. Bounded state spaces are sized to complete in seconds/minutes. Document any invariant violations — those are real bugs to file. |
| 162 | + |
| 163 | +Write a report: `docs/verification/v1.1-tlc-report.md` with: |
| 164 | +- Spec name |
| 165 | +- State space explored (distinct states, queue size) |
| 166 | +- Invariants checked (all should be OK) |
| 167 | +- Runtime |
| 168 | +- Any counterexamples found |
| 169 | + |
| 170 | +### Task 6 — Benchmarks (Paper-Quality) |
| 171 | + |
| 172 | +Use the academic harness from v1.0: |
| 173 | +```bash |
| 174 | +./scripts/run-academic-benchmarks.sh |
| 175 | +``` |
| 176 | + |
| 177 | +Then run multi-GPU-specific benchmarks (you may need to add them — model after existing criterion files in `crates/ringkernel/benches/`): |
| 178 | +- Cross-GPU K2K latency vs single-GPU K2K |
| 179 | +- Migration latency vs actor size |
| 180 | +- Tenant isolation overhead (single vs 4 tenants) |
| 181 | +- Provenance overhead (with/without) |
| 182 | + |
| 183 | +Each: 100 samples × 10 trials, compute 95% CI, Cohen's d where comparing. Fill in `docs/benchmarks/v1.1-2x-h100-results.md`. |
| 184 | + |
| 185 | +### Task 7 — Update CHANGELOG and ROADMAP |
| 186 | + |
| 187 | +```markdown |
| 188 | +## [1.1.0] - 2026-04-?? |
| 189 | + |
| 190 | +### Headline Results (2× H100 NVL via NVLink) |
| 191 | + |
| 192 | +- Multi-GPU actor migration: <X>ms for <Y>M messages (zero loss) |
| 193 | +- NVLink K2K: <X>us p99 latency (<Y>x vs host-mediated) |
| 194 | +- Tenant isolation: 0 cross-tenant leaks in <N> attempts |
| 195 | +- Formal properties proven: <list> |
| 196 | + |
| 197 | +### Added / Changed / Removed |
| 198 | +... |
| 199 | +``` |
| 200 | + |
| 201 | +## Decision Points (Escalate to User If Unclear) |
| 202 | + |
| 203 | +1. **NVLink absent on NC80adis** — proceed with PCIe fallback, note degradation in report |
| 204 | +2. **TLC finds a counterexample** — STOP; file as bug, attach trace |
| 205 | +3. **Migration kernels fail on sm_90 but work on sm_75** — investigate, likely cooperative groups edge case |
| 206 | +4. **Benchmark shows regression vs v1.0** — don't ship; bisect the commit |
| 207 | +5. **Cross-tenant leak detected** — CRITICAL — STOP the release |
| 208 | + |
| 209 | +## Memory & Safety Notes |
| 210 | + |
| 211 | +- `CudaContext` is not `Send` by default; multi-GPU runtime uses `Arc<CudaRuntime>` per device |
| 212 | +- P2P direct memory access requires both contexts to have enabled peer access (symmetric) |
| 213 | +- `cuMemcpyPeer` is asynchronous; await via stream sync before marking transfer complete |
| 214 | +- `K2KRouteEntry` is now 72 bytes (was 64) — anything that size_of::<> compares needs updating |
| 215 | + |
| 216 | +## Gotchas |
| 217 | + |
| 218 | +- `rust-toolchain.toml` is set to `stable` (was `nightly` pre-v1.0) — if a crate needs nightly, add opt-in feature |
| 219 | +- `clippy::unwrap_used` is `-D warnings` on lib/bins; tests allow it |
| 220 | +- Local Cargo.toml has `.cargo/audit.toml` with documented ignores — keep them, don't silently widen |
| 221 | +- `cargo-audit` CI runs; security regressions will block merge |
| 222 | +- `rustsec/audit-check@v2` was replaced with direct `cargo audit` due to permissions issue |
| 223 | +- `SIMD` feature is opt-in now (needs nightly) — don't re-enable by default |
| 224 | + |
| 225 | +## Commit Convention |
| 226 | + |
| 227 | +Same as v1.0: |
| 228 | +``` |
| 229 | +feat(area): description |
| 230 | +fix(area): description |
| 231 | +docs(area): description |
| 232 | +test(area): description |
| 233 | +
|
| 234 | +Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
| 235 | +``` |
| 236 | + |
| 237 | +Breaking changes: `feat!:` prefix. |
| 238 | + |
| 239 | +## Cost-Efficient Workflow |
| 240 | + |
| 241 | +```bash |
| 242 | +# Start session |
| 243 | +az vm start --resource-group ringkernel-gpu --name ringkernel-h100v2 |
| 244 | + |
| 245 | +# Work, commit, push frequently (you can iterate offline between sessions) |
| 246 | +# ... |
| 247 | + |
| 248 | +# End session (stops compute billing, keeps disk for next time) |
| 249 | +az vm deallocate --resource-group ringkernel-gpu --name ringkernel-h100v2 |
| 250 | + |
| 251 | +# Final cleanup (delete VM, disk, everything) |
| 252 | +az group delete --name ringkernel-gpu --yes |
| 253 | +``` |
| 254 | + |
| 255 | +**Per-hour cost discipline:** Build and unit tests are <2 min. Reserve GPU time for actual hardware validation runs. Between runs, commit + deallocate. |
| 256 | + |
| 257 | +## Success Criteria |
| 258 | + |
| 259 | +The user will consider v1.1 shippable when: |
| 260 | + |
| 261 | +1. ✅ 8/8 hardware matrix tests pass with 95% CI documented |
| 262 | +2. ✅ 6/6 TLC models pass with no counterexamples |
| 263 | +3. ✅ No regression vs v1.0 single-GPU benchmarks |
| 264 | +4. ✅ Cross-tenant leak count = 0 across all tests |
| 265 | +5. ✅ CHANGELOG.md has v1.1.0 entry with concrete numbers |
| 266 | +6. ✅ `cargo test --workspace` green on stable Rust 1.95+ |
| 267 | +7. ✅ `cargo clippy --workspace --lib --bins -- -D warnings` green |
| 268 | +8. ✅ `cargo audit` green (or any new advisories justified in `.cargo/audit.toml`) |
| 269 | +9. ✅ Tag `v1.1.0`, push, publish via `./scripts/publish.sh` |
| 270 | + |
| 271 | +## Starting Prompt for the Other Claude |
| 272 | + |
| 273 | +Paste this as the first message on the VM: |
| 274 | + |
| 275 | +> Read `docs/superpowers/GPU_VM_SESSION_GUIDE_V1_1.md` and follow the task priority order. Start with Task 1 (VM setup), then Task 2 (enable real peer access), then Task 3 (wire real NVLink P2P transfers for migration). After that, run Task 4 (8-test matrix) and Task 5 (TLC models) in parallel if possible. All results must follow the statistical methodology in `docs/benchmarks/METHODOLOGY.md`. Be cost-conscious — this is NC80adis_H100_v5 at ~$8/hr. Deallocate the VM when idle. |
| 276 | +
|
| 277 | +Good luck. The v1.0 handover worked — aim for the same outcome. |
0 commit comments