Skip to content

Commit 8ebf2a4

Browse files
mivertowskiclaude
andcommitted
docs: handover guide for v1.1 NC80adis_H100_v5 session
Onboarding document for the Claude Code instance running on 2× H100 NVL. Covers: - Context: what's on main, test count (1,612 passing) - 7 tasks in priority order (VM setup → peer access → P2P transfers → 8-test matrix → TLC model checking → benchmarks → CHANGELOG/tag) - Decision points (NVLink absent, TLC counterexample, tenant leak → STOP) - Memory/safety notes (CudaContext Send, cuMemcpyPeer semantics) - Gotchas inherited from v1.0 (rust-toolchain stable, clippy config, audit.toml, cargo-audit replacement) - Cost-efficient workflow (deallocate when idle, az commands) - Success criteria (8/8 matrix, 6/6 TLC, no regression, 0 leaks) - Starting prompt for the other Claude instance Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent febb724 commit 8ebf2a4

1 file changed

Lines changed: 277 additions & 0 deletions

File tree

Lines changed: 277 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,277 @@
1+
# RingKernel v1.1 — NC80adis_H100_v5 Session Guide
2+
3+
> **For the Claude Code instance running on the Azure NC80adis_H100_v5 VM (2× H100 NVL).**
4+
> Read this before starting any work.
5+
6+
## Context
7+
8+
RingKernel v1.0.0 shipped (CUDA-focused, H100-verified, 8,698x vs traditional launch, 11 CI jobs green). Tag: `v1.0.0`, commit history on `main`.
9+
10+
**v1.1 adds multi-GPU + VynGraph NSAI integration.** All pre-hardware code is landed on `main`. Your job: validate it on 2× H100 NVL hardware and finalize features that need NVLink.
11+
12+
**Budget-conscious mandate:** The user approved NC80adis_H100_v5 (2× H100, ~$8/hr) over ND96 ($80k/month out of budget). Work efficiently — deallocate the VM when idle. Target session length: minimize GPU-hours.
13+
14+
## Primary Goal
15+
16+
**Formally prove the GPU-native persistent actor paradigm works across 2× H100 with NVLink**, producing:
17+
1. Passing test matrix (8 items, spec §5.3)
18+
2. TLC model checking reports for all 6 TLA+ specs
19+
3. Benchmark numbers with statistical rigor (see `docs/benchmarks/METHODOLOGY.md`)
20+
4. Honest gap documentation if anything doesn't hold
21+
22+
## Key Documents — Read First
23+
24+
1. **`CLAUDE.md`** — project architecture, build commands, gotchas
25+
2. **`docs/superpowers/specs/2026-04-17-v1.1-vyngraph-gaps.md`** — v1.1 master spec (5 gaps, 3-phase migration, formal verification plan, 8-test hardware matrix in §5.3)
26+
3. **`docs/benchmarks/METHODOLOGY.md`** — statistical protocol (CI, Cohen's d, MAD outlier detection)
27+
4. **`docs/benchmarks/h100-b200-baseline.md`** — v1.0 single-GPU baseline (reference only)
28+
5. **`docs/verification/README.md`** — TLA+ model checking instructions
29+
6. **`docs/superpowers/GPU_VM_SESSION_GUIDE.md`** — prior H100 session guide (for context on what worked last time)
30+
31+
## What's Already Done on `main`
32+
33+
| Component | Status | Key commit |
34+
|-----------|--------|------------|
35+
| Core runtime (H2K/K2H mapped memory, persistent kernels) | v1.0 proven | earlier |
36+
| PROV-O provenance (8 relations, envelope opt-in) | Done | `f6f1d6b` |
37+
| NVLink topology detection (probe, bandwidth, paths) | Done | `38e5cbd` |
38+
| Multi-tenant K2K isolation (per-tenant sub-brokers) | Done | `e33ffd3` |
39+
| Live introspection streaming (IntrospectionStream, EWMA) | Done | `fe1535e` |
40+
| Hot rule reload (CompiledRule artifact API) | Done | `a1226a5` |
41+
| Multi-GPU runtime facade + migration orchestrator | Done (simulated) | `84f0b61` |
42+
| GPU-side tenant enforcement + migration kernels | Done | `febb724` |
43+
| TLA+ specs (6 models: hlc, k2k, migration, multi_gpu, tenants, lifecycle) | Done | `fd95dd8` |
44+
| Tests | **1,612 passing, 0 failures, 97 hardware-gated** | |
45+
46+
## Hardware-Phase Work (Your Job)
47+
48+
The pre-hardware code uses **bookkeeping and simulation** where it needs GPU operations. Your task: replace simulated paths with real CUDA calls, then validate.
49+
50+
### Task 1 — VM Setup & Sanity Check
51+
52+
```bash
53+
# After SSH login:
54+
cd ~
55+
git clone https://github.com/mivertowski/RustCompute.git RingKernel
56+
cd RingKernel
57+
58+
# Setup (Rust, CUDA, etc.)
59+
./scripts/setup-gpu-vm.sh
60+
61+
# Verify NVLink is exposed
62+
nvidia-smi topo -m # Expect "NV18" (or similar NVX) between GPU 0 and GPU 1
63+
nvidia-smi nvlink --status # Expect both links "Active"
64+
65+
# Lock clocks for consistent benchmarks
66+
sudo nvidia-smi -pm 1
67+
sudo nvidia-smi -lgc $(nvidia-smi --query-gpu=clocks.max.graphics --format=csv,noheader,nounits | head -1)
68+
sudo nvidia-smi -c EXCLUSIVE_PROCESS
69+
70+
# Build with multi-GPU feature
71+
export RINGKERNEL_CUDA_ARCH=sm_90
72+
cargo build --workspace --features "cuda,cooperative,multi-gpu" --release
73+
74+
# Run all non-hardware tests (should match baseline: 1,612 pass)
75+
cargo test --workspace --release
76+
```
77+
78+
**If NVLink is NOT exposed on NC80adis:** That's fine for correctness testing — fallback is PCIe Gen5. Note it in the test report; migration speed will be lower but all correctness properties still hold.
79+
80+
### Task 2 — Enable Peer Access (Real CUDA P2P)
81+
82+
File: `crates/ringkernel-cuda/src/multi_gpu/runtime.rs`
83+
84+
Currently `enable_peer_access` is bookkeeping-only:
85+
```rust
86+
pub fn enable_peer_access(&self, from: u32, to: u32) -> Result<()> {
87+
// Currently: validate topology + insert into HashSet
88+
// TODO (hardware phase): call cuCtxEnablePeerAccess
89+
}
90+
```
91+
92+
**Wire the real call:**
93+
```rust
94+
use cudarc::driver::sys;
95+
96+
pub fn enable_peer_access(&self, from: u32, to: u32) -> Result<()> {
97+
self.validate_pair(from, to)?;
98+
99+
// SAFETY: cuCtxEnablePeerAccess requires contexts on both devices
100+
unsafe {
101+
let target_ctx = self.devices[to as usize].context();
102+
let result = sys::cuCtxEnablePeerAccess(target_ctx, 0);
103+
// Handle CUDA_ERROR_PEER_ACCESS_ALREADY_ENABLED gracefully
104+
}
105+
106+
self.peer_access.write().insert((from, to));
107+
Ok(())
108+
}
109+
```
110+
111+
Add `disable_peer_access` mirror using `cuCtxDisablePeerAccess`. Run the 3 `#[ignore]` multi_gpu runtime tests.
112+
113+
### Task 3 — Real NVLink P2P Transfers for Migration
114+
115+
File: `crates/ringkernel-cuda/src/multi_gpu/migration.rs` (Phase 2 "Transfer")
116+
117+
Currently simulated on host. Replace with `cuMemcpyPeer` (or `cudaMemcpyPeerAsync` via stream) once peer access is enabled.
118+
119+
The staging buffer CRC32 should match pre- and post-transfer — the existing test asserts this on simulated path; verify it holds with real P2P.
120+
121+
### Task 4 — Run the 8-Test Hardware Matrix
122+
123+
From spec §5.3:
124+
125+
```bash
126+
# Each test: 3 trials for statistical rigor
127+
cargo test -p ringkernel-cuda --features "cuda,cooperative,multi-gpu" --release \
128+
--test multi_gpu_migration_proof -- --ignored --test-threads=1
129+
130+
cargo test -p ringkernel-cuda --features "cuda,cooperative,multi-gpu" --release \
131+
--test multi_gpu_nvlink_k2k_proof -- --ignored --test-threads=1
132+
# ... etc
133+
```
134+
135+
**You may need to write some of these integration tests** — check `crates/ringkernel-cuda/tests/` for existing proof tests (actor_lifecycle_proof, streaming_pipeline_proof) as templates. Required tests per spec §5.3:
136+
137+
1. Migration 1M msgs — move actor with 1M in-flight — <100ms, zero loss
138+
2. Migration loop — 100 back-and-forth migrations — no leak, checksum stable
139+
3. NVLink K2K latency — cross-GPU latency — p99 < 5us
140+
4. NVLink K2K throughput — >10M/s sustained 60s
141+
5. Multi-tenant isolation — 4 tenants, 1000 cross-tenant attempts — 0 leaks, all audited
142+
6. Provenance chain — 10-step NSAI chain — PROV-O attribution verified
143+
7. Rule reload under load — swap at 100K msg/s — <1s quiescence, no loss
144+
8. Full stress — all features + 60s sustained — all invariants hold
145+
146+
Document results in `docs/benchmarks/v1.1-2x-h100-results.md` with 95% CI.
147+
148+
### Task 5 — TLC Model Checking
149+
150+
```bash
151+
cd docs/verification/
152+
153+
# Option A: Native TLC (install Java 17 + tla2tools.jar)
154+
./tlc.sh
155+
156+
# Option B: Docker
157+
docker run --rm -v $(pwd):/workspace pmer/tla \
158+
tlc /workspace/migration.tla -config /workspace/migration.cfg
159+
```
160+
161+
Run each spec. Bounded state spaces are sized to complete in seconds/minutes. Document any invariant violations — those are real bugs to file.
162+
163+
Write a report: `docs/verification/v1.1-tlc-report.md` with:
164+
- Spec name
165+
- State space explored (distinct states, queue size)
166+
- Invariants checked (all should be OK)
167+
- Runtime
168+
- Any counterexamples found
169+
170+
### Task 6 — Benchmarks (Paper-Quality)
171+
172+
Use the academic harness from v1.0:
173+
```bash
174+
./scripts/run-academic-benchmarks.sh
175+
```
176+
177+
Then run multi-GPU-specific benchmarks (you may need to add them — model after existing criterion files in `crates/ringkernel/benches/`):
178+
- Cross-GPU K2K latency vs single-GPU K2K
179+
- Migration latency vs actor size
180+
- Tenant isolation overhead (single vs 4 tenants)
181+
- Provenance overhead (with/without)
182+
183+
Each: 100 samples × 10 trials, compute 95% CI, Cohen's d where comparing. Fill in `docs/benchmarks/v1.1-2x-h100-results.md`.
184+
185+
### Task 7 — Update CHANGELOG and ROADMAP
186+
187+
```markdown
188+
## [1.1.0] - 2026-04-??
189+
190+
### Headline Results (2× H100 NVL via NVLink)
191+
192+
- Multi-GPU actor migration: <X>ms for <Y>M messages (zero loss)
193+
- NVLink K2K: <X>us p99 latency (<Y>x vs host-mediated)
194+
- Tenant isolation: 0 cross-tenant leaks in <N> attempts
195+
- Formal properties proven: <list>
196+
197+
### Added / Changed / Removed
198+
...
199+
```
200+
201+
## Decision Points (Escalate to User If Unclear)
202+
203+
1. **NVLink absent on NC80adis** — proceed with PCIe fallback, note degradation in report
204+
2. **TLC finds a counterexample** — STOP; file as bug, attach trace
205+
3. **Migration kernels fail on sm_90 but work on sm_75** — investigate, likely cooperative groups edge case
206+
4. **Benchmark shows regression vs v1.0** — don't ship; bisect the commit
207+
5. **Cross-tenant leak detected** — CRITICAL — STOP the release
208+
209+
## Memory & Safety Notes
210+
211+
- `CudaContext` is not `Send` by default; multi-GPU runtime uses `Arc<CudaRuntime>` per device
212+
- P2P direct memory access requires both contexts to have enabled peer access (symmetric)
213+
- `cuMemcpyPeer` is asynchronous; await via stream sync before marking transfer complete
214+
- `K2KRouteEntry` is now 72 bytes (was 64) — anything that size_of::<> compares needs updating
215+
216+
## Gotchas
217+
218+
- `rust-toolchain.toml` is set to `stable` (was `nightly` pre-v1.0) — if a crate needs nightly, add opt-in feature
219+
- `clippy::unwrap_used` is `-D warnings` on lib/bins; tests allow it
220+
- Local Cargo.toml has `.cargo/audit.toml` with documented ignores — keep them, don't silently widen
221+
- `cargo-audit` CI runs; security regressions will block merge
222+
- `rustsec/audit-check@v2` was replaced with direct `cargo audit` due to permissions issue
223+
- `SIMD` feature is opt-in now (needs nightly) — don't re-enable by default
224+
225+
## Commit Convention
226+
227+
Same as v1.0:
228+
```
229+
feat(area): description
230+
fix(area): description
231+
docs(area): description
232+
test(area): description
233+
234+
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
235+
```
236+
237+
Breaking changes: `feat!:` prefix.
238+
239+
## Cost-Efficient Workflow
240+
241+
```bash
242+
# Start session
243+
az vm start --resource-group ringkernel-gpu --name ringkernel-h100v2
244+
245+
# Work, commit, push frequently (you can iterate offline between sessions)
246+
# ...
247+
248+
# End session (stops compute billing, keeps disk for next time)
249+
az vm deallocate --resource-group ringkernel-gpu --name ringkernel-h100v2
250+
251+
# Final cleanup (delete VM, disk, everything)
252+
az group delete --name ringkernel-gpu --yes
253+
```
254+
255+
**Per-hour cost discipline:** Build and unit tests are <2 min. Reserve GPU time for actual hardware validation runs. Between runs, commit + deallocate.
256+
257+
## Success Criteria
258+
259+
The user will consider v1.1 shippable when:
260+
261+
1. ✅ 8/8 hardware matrix tests pass with 95% CI documented
262+
2. ✅ 6/6 TLC models pass with no counterexamples
263+
3. ✅ No regression vs v1.0 single-GPU benchmarks
264+
4. ✅ Cross-tenant leak count = 0 across all tests
265+
5. ✅ CHANGELOG.md has v1.1.0 entry with concrete numbers
266+
6.`cargo test --workspace` green on stable Rust 1.95+
267+
7.`cargo clippy --workspace --lib --bins -- -D warnings` green
268+
8.`cargo audit` green (or any new advisories justified in `.cargo/audit.toml`)
269+
9. ✅ Tag `v1.1.0`, push, publish via `./scripts/publish.sh`
270+
271+
## Starting Prompt for the Other Claude
272+
273+
Paste this as the first message on the VM:
274+
275+
> Read `docs/superpowers/GPU_VM_SESSION_GUIDE_V1_1.md` and follow the task priority order. Start with Task 1 (VM setup), then Task 2 (enable real peer access), then Task 3 (wire real NVLink P2P transfers for migration). After that, run Task 4 (8-test matrix) and Task 5 (TLC models) in parallel if possible. All results must follow the statistical methodology in `docs/benchmarks/METHODOLOGY.md`. Be cost-conscious — this is NC80adis_H100_v5 at ~$8/hr. Deallocate the VM when idle.
276+
277+
Good luck. The v1.0 handover worked — aim for the same outcome.

0 commit comments

Comments
 (0)