Skip to content

Commit eebb4f1

Browse files
mivertowskiclaude
andcommitted
docs: add v1.0.0 entry to CHANGELOG
Comprehensive release notes covering: - Headline H100 results (8,698x vs traditional) - Added: Hopper features, actor framework, error handling - Changed: cudarc upgrade, TLS fix, tracing migration - Removed (BREAKING): wgpu/metal/wavesim3d crates, 4,739 lines - Fixed: CudaRuntime bridge, CLI validation - Migration guide from 0.4.x Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent d8d5b70 commit eebb4f1

1 file changed

Lines changed: 95 additions & 0 deletions

File tree

CHANGELOG.md

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,101 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [1.0.0] - 2026-04-16
11+
12+
First production-grade release. Focuses exclusively on NVIDIA CUDA. H100-verified with paper-quality benchmarks.
13+
14+
### Headline Results (NVIDIA H100 NVL)
15+
16+
- **8,698x faster** than traditional `cuLaunchKernel`
17+
- **3,005x faster** than CUDA Graph replay
18+
- **5.54M ops/s** sustained throughput (CV 0.05%, 60 seconds)
19+
- **0.628 us** cluster.sync() (2.98x vs grid.sync())
20+
- **116.9x** faster async memory alloc vs `cuMemAlloc`
21+
- All benchmarks with 95% CI, Cohen's d, Welch's t-test
22+
23+
### Added
24+
25+
#### Hopper (H100) Architecture Support
26+
- Thread Block Clusters via `cuLaunchKernelEx` with `CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION`
27+
- Distributed Shared Memory (DSMEM) for intra-cluster K2K messaging
28+
- TMA (Tensor Memory Accelerator) async copy configuration
29+
- Green Contexts for SM partitioning via `cuGreenCtxCreate`
30+
- Async memory pool (`cuMemAllocAsync`)
31+
- `GpuArchitecture::blackwell()` preset for B200 (sm_100)
32+
33+
#### Runtime & API Improvements
34+
- `CudaRuntime::launch()` now bridges to `PersistentSimulation` for real GPU execution
35+
when `mode=Persistent && cooperative=true`
36+
- Architecture auto-detection via `RINGKERNEL_CUDA_ARCH` env var
37+
- Multi-arch PTX compilation fallback (sm_75/sm_80/sm_89/sm_90)
38+
- libcu++ ordered atomics enabled by default for persistent kernels
39+
- `cargo-audit` security scanning in CI
40+
- Feature matrix CI jobs (no features / cpu / enterprise)
41+
42+
#### Actor Framework
43+
- GPU actor lifecycle (create/destroy/restart/supervise) in single persistent kernel
44+
- Supervision trees with cascading kill, escalation, tree_view
45+
- Named actor registry with wildcard service discovery
46+
- Credit-based backpressure with watermarks and flow metrics
47+
- Dead letter queue with replay, filter, TTL expiry
48+
- Memory pressure handling (budgets, levels, mitigation strategies)
49+
- Idempotency dedup cache with TTL
50+
- `GracefulShutdown` with SIGTERM/SIGINT handling
51+
- `CheckpointManager` for periodic actor state snapshots
52+
- Dynamic scheduling framework (scheduler warp pattern + work stealing codegen)
53+
- Hot config reload with versioning and audit trail
54+
55+
#### Error Handling & Safety
56+
- Typed error enums across all application crates (AccNet, WaveSim, TxMon, ProcInt)
57+
- Zero bare `.unwrap()` in production code
58+
- `clippy::unwrap_used` warning lint on 12 crates
59+
- Graceful shutdown handler
60+
- 24 unsafe blocks documented with `// SAFETY:` comments in CUDA code
61+
62+
#### Documentation & Benchmarks
63+
- `docs/benchmarks/ACADEMIC_PROOF.md` — 15-section paper with 95% CI
64+
- `docs/benchmarks/METHODOLOGY.md` — statistical protocol (8 experiments)
65+
- `docs/benchmarks/h100-b200-baseline.md` — H100 results populated
66+
- `benches/academic_harness.rs` — statistical framework (percentiles, Cohen's d, Welch's t-test)
67+
- `scripts/run-academic-benchmarks.sh` — automated benchmark suite
68+
69+
### Changed
70+
71+
- Upgraded `cudarc` from 0.18.2 to 0.19.3
72+
- TLS PEM certificate parsing implemented (was placeholder returning empty vectors)
73+
- CloudWatch audit sink implemented with AWS SDK (feature-gated)
74+
- OTLP export via dedicated `otel` feature flag
75+
- `println!/eprintln!` migrated to structured `tracing` (64 instances across 10 crates)
76+
- XOR crypto fallback emits `#[deprecated]` warning
77+
- Bumped all 19 crates from 0.4.2 to 1.0.0
78+
79+
### Removed (BREAKING)
80+
81+
- **`ringkernel-wgpu`** — WebGPU backend (no persistent kernel support)
82+
- **`ringkernel-wgpu-codegen`** — WGSL transpiler (17 unimplemented intrinsics due to spec limits)
83+
- **`ringkernel-metal`** — Apple Metal backend (no persistent kernel support)
84+
- **`ringkernel-wavesim3d`** — 3D showcase (hard dependency on wgpu for rendering)
85+
- `wgpu`, `metal`, `all-backends` features from all remaining crates
86+
- `persistent-wgpu` feature from `ringkernel-ecosystem`
87+
- `Backend::WebGpu` and `Backend::Metal` re-exports (enum variants kept as `#[doc(hidden)]` for future use)
88+
- 4,739 lines of dead backend code
89+
- `docs/14-wgpu-codegen.md` and `docs/PRODUCTION_READINESS_ROADMAP.md` (superseded)
90+
91+
### Fixed
92+
93+
- `CudaRuntime::launch()` no longer loads a trivial template kernel; launches real cooperative persistent kernels when requested
94+
- `ringkernel-accnet` and `ringkernel-procint` migrated from cudarc 0.11 API to 0.19.3
95+
- CLI project name validation (unsafe unwrap removed)
96+
- All WGSL transpiler marker `unimplemented!()` calls now have descriptive error messages
97+
98+
### Migration from 0.4.x
99+
100+
- Remove `wgpu`, `metal`, `all-backends` features from `Cargo.toml`
101+
- Replace `ringkernel-wavesim3d` usage with `ringkernel-wavesim` (2D) or custom CUDA code
102+
- Update `ringkernel = "0.4"` to `ringkernel = "1.0"`
103+
- `Result<_, String>` in application crates replaced with typed error enums
104+
10105
## [0.4.2] - 2026-02-06
11106

12107
### Added

0 commit comments

Comments
 (0)