@@ -7,6 +7,101 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77
88## [ Unreleased]
99
10+ ## [ 1.0.0] - 2026-04-16
11+
12+ First production-grade release. Focuses exclusively on NVIDIA CUDA. H100-verified with paper-quality benchmarks.
13+
14+ ### Headline Results (NVIDIA H100 NVL)
15+
16+ - ** 8,698x faster** than traditional ` cuLaunchKernel `
17+ - ** 3,005x faster** than CUDA Graph replay
18+ - ** 5.54M ops/s** sustained throughput (CV 0.05%, 60 seconds)
19+ - ** 0.628 us** cluster.sync() (2.98x vs grid.sync())
20+ - ** 116.9x** faster async memory alloc vs ` cuMemAlloc `
21+ - All benchmarks with 95% CI, Cohen's d, Welch's t-test
22+
23+ ### Added
24+
25+ #### Hopper (H100) Architecture Support
26+ - Thread Block Clusters via ` cuLaunchKernelEx ` with ` CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION `
27+ - Distributed Shared Memory (DSMEM) for intra-cluster K2K messaging
28+ - TMA (Tensor Memory Accelerator) async copy configuration
29+ - Green Contexts for SM partitioning via ` cuGreenCtxCreate `
30+ - Async memory pool (` cuMemAllocAsync ` )
31+ - ` GpuArchitecture::blackwell() ` preset for B200 (sm_100)
32+
33+ #### Runtime & API Improvements
34+ - ` CudaRuntime::launch() ` now bridges to ` PersistentSimulation ` for real GPU execution
35+ when ` mode=Persistent && cooperative=true `
36+ - Architecture auto-detection via ` RINGKERNEL_CUDA_ARCH ` env var
37+ - Multi-arch PTX compilation fallback (sm_75/sm_80/sm_89/sm_90)
38+ - libcu++ ordered atomics enabled by default for persistent kernels
39+ - ` cargo-audit ` security scanning in CI
40+ - Feature matrix CI jobs (no features / cpu / enterprise)
41+
42+ #### Actor Framework
43+ - GPU actor lifecycle (create/destroy/restart/supervise) in single persistent kernel
44+ - Supervision trees with cascading kill, escalation, tree_view
45+ - Named actor registry with wildcard service discovery
46+ - Credit-based backpressure with watermarks and flow metrics
47+ - Dead letter queue with replay, filter, TTL expiry
48+ - Memory pressure handling (budgets, levels, mitigation strategies)
49+ - Idempotency dedup cache with TTL
50+ - ` GracefulShutdown ` with SIGTERM/SIGINT handling
51+ - ` CheckpointManager ` for periodic actor state snapshots
52+ - Dynamic scheduling framework (scheduler warp pattern + work stealing codegen)
53+ - Hot config reload with versioning and audit trail
54+
55+ #### Error Handling & Safety
56+ - Typed error enums across all application crates (AccNet, WaveSim, TxMon, ProcInt)
57+ - Zero bare ` .unwrap() ` in production code
58+ - ` clippy::unwrap_used ` warning lint on 12 crates
59+ - Graceful shutdown handler
60+ - 24 unsafe blocks documented with ` // SAFETY: ` comments in CUDA code
61+
62+ #### Documentation & Benchmarks
63+ - ` docs/benchmarks/ACADEMIC_PROOF.md ` — 15-section paper with 95% CI
64+ - ` docs/benchmarks/METHODOLOGY.md ` — statistical protocol (8 experiments)
65+ - ` docs/benchmarks/h100-b200-baseline.md ` — H100 results populated
66+ - ` benches/academic_harness.rs ` — statistical framework (percentiles, Cohen's d, Welch's t-test)
67+ - ` scripts/run-academic-benchmarks.sh ` — automated benchmark suite
68+
69+ ### Changed
70+
71+ - Upgraded ` cudarc ` from 0.18.2 to 0.19.3
72+ - TLS PEM certificate parsing implemented (was placeholder returning empty vectors)
73+ - CloudWatch audit sink implemented with AWS SDK (feature-gated)
74+ - OTLP export via dedicated ` otel ` feature flag
75+ - ` println!/eprintln! ` migrated to structured ` tracing ` (64 instances across 10 crates)
76+ - XOR crypto fallback emits ` #[deprecated] ` warning
77+ - Bumped all 19 crates from 0.4.2 to 1.0.0
78+
79+ ### Removed (BREAKING)
80+
81+ - ** ` ringkernel-wgpu ` ** — WebGPU backend (no persistent kernel support)
82+ - ** ` ringkernel-wgpu-codegen ` ** — WGSL transpiler (17 unimplemented intrinsics due to spec limits)
83+ - ** ` ringkernel-metal ` ** — Apple Metal backend (no persistent kernel support)
84+ - ** ` ringkernel-wavesim3d ` ** — 3D showcase (hard dependency on wgpu for rendering)
85+ - ` wgpu ` , ` metal ` , ` all-backends ` features from all remaining crates
86+ - ` persistent-wgpu ` feature from ` ringkernel-ecosystem `
87+ - ` Backend::WebGpu ` and ` Backend::Metal ` re-exports (enum variants kept as ` #[doc(hidden)] ` for future use)
88+ - 4,739 lines of dead backend code
89+ - ` docs/14-wgpu-codegen.md ` and ` docs/PRODUCTION_READINESS_ROADMAP.md ` (superseded)
90+
91+ ### Fixed
92+
93+ - ` CudaRuntime::launch() ` no longer loads a trivial template kernel; launches real cooperative persistent kernels when requested
94+ - ` ringkernel-accnet ` and ` ringkernel-procint ` migrated from cudarc 0.11 API to 0.19.3
95+ - CLI project name validation (unsafe unwrap removed)
96+ - All WGSL transpiler marker ` unimplemented!() ` calls now have descriptive error messages
97+
98+ ### Migration from 0.4.x
99+
100+ - Remove ` wgpu ` , ` metal ` , ` all-backends ` features from ` Cargo.toml `
101+ - Replace ` ringkernel-wavesim3d ` usage with ` ringkernel-wavesim ` (2D) or custom CUDA code
102+ - Update ` ringkernel = "0.4" ` to ` ringkernel = "1.0" `
103+ - ` Result<_, String> ` in application crates replaced with typed error enums
104+
10105## [ 0.4.2] - 2026-02-06
11106
12107### Added
0 commit comments