All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
Work landed while v1.1.0 is still being validated; these items fulfill deferred v1.2 roadmap goals that are implementable on the current 2x H100 NVL hardware. They do not require a new minor version on their own — they are additive and behind opt-in feature flags where they carry a link dependency.
cluster_dsmem_work_stealCUDA kernel — blocks within a cluster share a DSMEM-hosted task counter viacluster.map_shared_rank; every block steals tasks atomically with no host round-trip and no global memory traffic for the coordinator.grid_hbm_work_stealCUDA kernel — every block in the grid steals from a single HBM counter; completes the intra-block → intra-cluster → grid hierarchy.tests/hierarchical_work_steal.rs— two integration tests audit work-conservation (sum of per-block tallies equals task count) and load distribution on 1,009 / 8,191 prime-sized workloads.
- New
nvshmemCargo feature onringkernel-cuda(opt-in, off by default; requireslibnvshmem3-dev-cuda-12or a manual NVSHMEM install). multi_gpu::nvshmem::NvshmemHeapRAII wrapper exposingattach/malloc/free/put/get/barrier_all/fence/my_pe/n_peson top of the stable NVSHMEM host ABI (libnvshmem_host.so).build.rsadds-L/usr/lib/x86_64-linux-gnu/nvshmem/12+-Wl,-rpath,...automatically; overridable withNVSHMEM_LIB_DIR.- Bootstrap (MPI /
nvshmrun/ unique-ID) is left to the caller; the wrapper refuses to attach whennvshmem_n_pes() <= 0so mis-configured runs fail fast rather than crash the process.
GpuArchitecture::blackwell()expanded with feature queries:supports_cluster_launch_control,supports_fp8,supports_fp6,supports_fp4,supports_nvlink5,supports_tee.GpuArchitecture::rubin()preset added for the post-Blackwell tier (compute cap 12.x placeholder; updated when silicon is available).GpuArchitecture::from_compute_capabilityroutes 10.x / 11.x to Blackwell and 12.x to Rubin.ringkernel-ir::ScalarTypegainsBF16,FP8E4M3,FP8E5M2,FP6E3M2,FP6E2M3,FP4E2M1. Each carries amin_compute_capability()so the codegen backend can reject kernels that ask for types the target GPU can't execute.- Lowering updated in
lower_cuda.rs(emits__nv_bfloat16,__nv_fp8_*,__nv_fp6_*,__nv_fp4_*),lower_msl.rs(usesbfloat/floatfallback),lower_wgsl.rs(f32fallback). build.rsalready compiles multi-arch fallback including sm_100; runtime validation still waits for B200 hardware.
- Cache-line padding:
head,tail, producer-side stats, and consumer-side stats each live on their own 128-byte line. Before this, everytry_enqueueinvalidated the consumer's cached view oftail(and vice versa) — every op paid a cross-core cache coherence round-trip. The 128-byte line matches AMD Zen 4 / Intel spatial-prefetching pair width and aligns to Hopper L2 lines. - Stats split: producer counters (
enqueued,dropped,max_depth) and consumer counters (dequeued) no longer share a cache line. Before: consumer'sfetch_add(dequeued)invalidated the producer's cached line that also heldenqueued. update_max_depthCAS loop → singlefetch_max: the old compare-and-swap loop is replaced byAtomicU64::fetch_max(Rust 1.45+) — one atomic RMW instead of a potentially-spinning CAS loop under contention.- New
tests/spsc_two_thread_throughput.rsbenchmark (dedicated producer/consumer threads) measures the actual concurrent throughput. Single-threadedsustained_throughput.rs(used in paper Exp 4) only measures round-trip latency and does not observe false sharing.
- Workspace deps: 8 crates (
ringkernel-accnet,ringkernel-cli,ringkernel-graph,ringkernel-montecarlo,ringkernel-procint,ringkernel-txmon,ringkernel-wavesim,ringkernel-wavesim3d) migrated from hardcodedversion = "1.1.0", path = "../..."to{ workspace = true }. Root[workspace.dependencies]gains aringkernelentry so the facade crate is also usable that way. Future version bumps touch one line, not twenty. MockBackend::launches/deliveriesmethods annotated#[allow(dead_code)]with "kept for manual test debugging" — the dead-code warning was signal noise.cargo fmt --allacross all code edited this session.cargo clippy --workspace --lib --bins -- -D warnings(matching CI invocation) is clean. Pre-existing clippy warnings in cuda-gated hopper modules (async_mem, cluster, green_ctx) not triggered — they are outside CI's default scope.
cargo test --workspace --release --exclude ringkernel-txmon: 1,617 tests pass, 0 failures (up from 1,590 in v1.1.0 because of new delta-checkpoint, Blackwell-capability, hierarchical work-stealing, and HBM-tier direct-measurement unit tests).
Second release. Adds multi-GPU runtime, VynGraph NSAI integration points, and paper-quality experimental validation on 2× H100 NVL (Azure NC80adis_H100_v5).
- NVLink P2P migration: 8.7× faster than host-staging at 16 MiB payload (69 us P2P vs 597 us host-staged, 200 trials +/- 95% CI).
- cuCtxEnablePeerAccess / cuMemcpyPeerAsync wired on real 2-GPU hardware — the runtime's multi-GPU facade now performs real CUDA P2P rather than host-only simulation. CRC32 byte-for-byte integrity verified on every migration.
- Formal verification: 6/6 TLA+ specs pass under TLC with no counter-
examples:
hlc,k2k_delivery,migration,multi_gpu_k2k,tenant_isolation,actor_lifecycle. One model-level bug (migration.ChecksumMatchmis-stated) caught and fixed during the run; the real implementation was already correct. - Cross-tenant leak count: 0 across 13 multi-tenant isolation tests.
- Lifecycle rule overhead: Spawn/Activate/Quiesce/Terminate/Restart all within 23 +/- 5 ns mean, p99 = 30 ns (sub-100 ns as claimed).
- Sustained throughput: 5.10 M ops/s over 4 x 60s trials, CV 0.66%, degradation first->last 3 windows = -0.3%, p99 = 110 ns (flat).
- No regression vs v1.0 baseline on single-GPU paths.
ringkernel-cuda::multi_gpu::MultiGpuRuntime— per-deviceCudaRuntimefacade withPlacementHint::{Auto, Pinned, WithActor, NvlinkPreferred}.- Real
cuCtxEnablePeerAccess/cuCtxDisablePeerAccess— previously bookkeeping-only; now invokes the driver when both backends are live CUDA contexts, withCUDA_ERROR_PEER_ACCESS_ALREADY_ENABLEDhandled gracefully. - Real
cuMemcpyPeerAsyncin the 3-phase migration protocol's transfer phase. Falls back to host-only simulation when no peer access is available (mock backends in tests, or single-GPU hardware). - NVLink topology probe (
NvlinkTopology::probe) via NVML — produces the adjacency matrix and bandwidth used byPlacementHint::NvlinkPreferred. MigrationControllerwith global buffer budget, rate limiting, and concurrency cap.- Migration kernel PTX (capture / restore / drain) compiled at build time
when
nvccis present, with graceful unavailability otherwise.
- PROV-O provenance header (8 relation kinds) attachable to every K2K envelope. Opt-in per send, chain walk with depth bound and cycle detection, ECDSA/P-256 signature verification hook.
- Multi-tenant K2K isolation via per-tenant sub-brokers with
AuditTag, per-tenant quotas, audit sink for cross-tenant attempts, andLegacyTenant::Unspecifiedfast path. - Live introspection streaming (
IntrospectionStream) with EWMA decay and drop-tolerant ring buffer for high-frequency telemetry. - Hot rule reload with
CompiledRuleartifact API — higher version numbers activate immediately, lower/equal rejected, quiescence of in-flight evaluators guaranteed under load. - GPU-side tenant enforcement kernels and migration kernels compiled from
src/cuda/*.cuwhennvccis available.
- Six TLA+ specifications in
docs/verification/:hlc.tla,k2k_delivery.tla,migration.tla,multi_gpu_k2k.tla,tenant_isolation.tla,actor_lifecycle.tla. docs/verification/tlc.shwrapper;docs/paper/experiments/05-tlc-stats/pipeline that runs every spec and produces a CSV summary.DefaultParent/DefaultActorGpuoperators in the .tla files so the .cfg files stay TLC-parser-clean.
- Academic paper Persistent GPU Actors (
docs/paper/, 13 sections + appendix), built withmake-> 48-pagemain.pdf. - Six-experiment pipeline in
docs/paper/experiments/(tier latency, snapshot/restart, lifecycle, sustained, TLC, NVLink migration) with per-experimentrun.sh+extract.py, top-levelrun_all.sh, and reproducibility manifest (manifest.json) capturing commit, driver, CUDA, Rust, GPU. - Paper-aligned integration tests:
paper_tier_latency,paper_snapshot_restart,paper_lifecycle_overhead,paper_nvlink_migration,sustained_throughput.
ringkernel-cuda::multi_gpu::runtime::GpuBackendtrait gains acu_context(&self) -> Option<usize>method (defaultNone) so the runtime can drive CUDA P2P when backends are realCudaRuntimeinstances. Mock backends used in unit tests keep returningNone.migration.tla— introduced explicitcaptured_statevariable so theChecksumMatchinvariant holds under late-arriving messages during transfer (this mirrored real impl; spec was lagging).- TLC
.cfgfiles — addedCHECK_DEADLOCK FALSEto every spec, since all six bounded models reach a legitimate terminal state when theirMaxMsgs/MaxEvents/MaxStepsbound saturates.
- New
cluster_hbm_k2kCUDA kernel (cross-cluster K2K via global memory withgrid.sync()) wired intopaper_tier_latencyas thehbmtier. Previously only SMEM and DSMEM were measured directly; HBM is now a first-class tier with 1000 trials per payload, giving a clean monotonic SMEM < DSMEM < HBM latency hierarchy across all payload sizes.
- New
paper_multi_gpu_k2k_bwmicro-benchmark (256 back-to-backcuMemcpyPeerAsyncper size, 32-round warmup) measured on 2x H100 NVL: 2.3 GB/s @ 4 KiB, 32 GB/s @ 64 KiB, 179 GB/s @ 1 MiB, 258 GB/s @ 16 MiB sustained — ~81% of the 318 GB/s theoretical peak of a 12-link NVLink bundle. Complements Experiment 6's one-shot latency data. Output CSV atdocs/paper/experiments/results/<ts>/exp6b_mgpu_bw/mgpu_bw.csv.
Checkpoint::delta_from(base, new)returns a checkpoint with only chunks whose(type, id)identity's data differs frombase, plus any chunks new innew.Checkpoint::applied_with_delta(base, delta)re-materializes the full checkpoint and verifies the recorded parent digest matches the supplied base (catches wrong-base application).Checkpoint::content_digest()is the stable CRC32 over ordered(identity, bytes)used for delta parent tracking.DELTA_PARENT_DIGEST_KEYis the well-known metadata custom key.
- New
warp_work_stealCUDA kernel — warps within a block atomically decrement a shared task counter and process stolen stripes; lane 0 reports each warp's tally viastats[]so the host can audit that work is conserved (sum of per-warp tallies equalstotal_tasks). tests/warp_work_steal.rs— two integration tests verify (1) every task is processed exactly once and (2) uneven task counts don't starve individual warps. Intra-cluster and cross-cluster stealing (DSMEM-backed and HBM-backed) remain future work for v1.2.
- NVSHMEM symmetric heap — still deferred to v1.2.
cuMemcpyPeerpath is feature-complete for migration; NVSHMEM would add symmetric heap semantics for in-kernel all-reduce etc., which is a larger integration. - Multi-GPU linear scaling beyond 2 GPUs — genuinely hardware-bound on NC80adis_H100_v5 (2 GPUs). Deferred to a 4 × H100 / 8 × H100 SKU in v1.2.
- Intra-cluster and cross-cluster work stealing — the v1.1 primitive covers intra-block. The DSMEM / HBM tiers of the hierarchy defined in the v1.2 roadmap are next.
First production-grade release. Focuses exclusively on NVIDIA CUDA. H100-verified with paper-quality benchmarks.
- 8,698x faster than traditional
cuLaunchKernel - 3,005x faster than CUDA Graph replay
- 5.54M ops/s sustained throughput (CV 0.05%, 60 seconds)
- 0.628 us cluster.sync() (2.98x vs grid.sync())
- 116.9x faster async memory alloc vs
cuMemAlloc - All benchmarks with 95% CI, Cohen's d, Welch's t-test
- Thread Block Clusters via
cuLaunchKernelExwithCU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION - Distributed Shared Memory (DSMEM) for intra-cluster K2K messaging
- TMA (Tensor Memory Accelerator) async copy configuration
- Green Contexts for SM partitioning via
cuGreenCtxCreate - Async memory pool (
cuMemAllocAsync) GpuArchitecture::blackwell()preset for B200 (sm_100)
CudaRuntime::launch()now bridges toPersistentSimulationfor real GPU execution whenmode=Persistent && cooperative=true- Architecture auto-detection via
RINGKERNEL_CUDA_ARCHenv var - Multi-arch PTX compilation fallback (sm_75/sm_80/sm_89/sm_90)
- libcu++ ordered atomics enabled by default for persistent kernels
cargo-auditsecurity scanning in CI- Feature matrix CI jobs (no features / cpu / enterprise)
- GPU actor lifecycle (create/destroy/restart/supervise) in single persistent kernel
- Supervision trees with cascading kill, escalation, tree_view
- Named actor registry with wildcard service discovery
- Credit-based backpressure with watermarks and flow metrics
- Dead letter queue with replay, filter, TTL expiry
- Memory pressure handling (budgets, levels, mitigation strategies)
- Idempotency dedup cache with TTL
GracefulShutdownwith SIGTERM/SIGINT handlingCheckpointManagerfor periodic actor state snapshots- Dynamic scheduling framework (scheduler warp pattern + work stealing codegen)
- Hot config reload with versioning and audit trail
- Typed error enums across all application crates (AccNet, WaveSim, TxMon, ProcInt)
- Zero bare
.unwrap()in production code clippy::unwrap_usedwarning lint on 12 crates- Graceful shutdown handler
- 24 unsafe blocks documented with
// SAFETY:comments in CUDA code
docs/benchmarks/ACADEMIC_PROOF.md— 15-section paper with 95% CIdocs/benchmarks/METHODOLOGY.md— statistical protocol (8 experiments)docs/benchmarks/h100-b200-baseline.md— H100 results populatedbenches/academic_harness.rs— statistical framework (percentiles, Cohen's d, Welch's t-test)scripts/run-academic-benchmarks.sh— automated benchmark suite
- Upgraded
cudarcfrom 0.18.2 to 0.19.3 - TLS PEM certificate parsing implemented (was placeholder returning empty vectors)
- CloudWatch audit sink implemented with AWS SDK (feature-gated)
- OTLP export via dedicated
otelfeature flag println!/eprintln!migrated to structuredtracing(64 instances across 10 crates)- XOR crypto fallback emits
#[deprecated]warning - Bumped all 19 crates from 0.4.2 to 1.0.0
ringkernel-wgpu— WebGPU backend (no persistent kernel support)ringkernel-wgpu-codegen— WGSL transpiler (17 unimplemented intrinsics due to spec limits)ringkernel-metal— Apple Metal backend (no persistent kernel support)ringkernel-wavesim3d— 3D showcase (hard dependency on wgpu for rendering)wgpu,metal,all-backendsfeatures from all remaining cratespersistent-wgpufeature fromringkernel-ecosystemBackend::WebGpuandBackend::Metalre-exports (enum variants kept as#[doc(hidden)]for future use)- 4,739 lines of dead backend code
docs/14-wgpu-codegen.mdanddocs/PRODUCTION_READINESS_ROADMAP.md(superseded)
CudaRuntime::launch()no longer loads a trivial template kernel; launches real cooperative persistent kernels when requestedringkernel-accnetandringkernel-procintmigrated from cudarc 0.11 API to 0.19.3- CLI project name validation (unsafe unwrap removed)
- All WGSL transpiler marker
unimplemented!()calls now have descriptive error messages
- Remove
wgpu,metal,all-backendsfeatures fromCargo.toml - Replace
ringkernel-wavesim3dusage withringkernel-wavesim(2D) or custom CUDA code - Update
ringkernel = "0.4"toringkernel = "1.0" Result<_, String>in application crates replaced with typed error enums
- Two-phase warp-shuffle reduction replaces tree reduction in all generated CUDA code
- Phase 1: Intra-warp
__shfl_down_sync(0xFFFFFFFF, val, offset)— zero__syncthreads()calls - Phase 2: Cross-warp reduction via shared memory — one
__syncthreads()call - Applies to:
block_reduce_energy(persistent FDTD),generate_block_reduce_fn,generate_grid_reduce_fn,generate_reduce_and_broadcast_fn, and all inline reduction generators - Reduces barrier count from O(log N) to 1 per block reduction (e.g., 9 → 1 for 512-thread blocks)
- Phase 1: Intra-warp
PersistentFdtdConfig::idle_sleep_ns(default 1000ns): configurable idle sleep duration- Persistent FDTD idle spin-wait now uses
__nanosleep()instead of volatile counter loop - Software grid barrier spin-loop uses
__nanosleep(100)to reduce power consumption - Builder:
with_idle_sleep(ns)to customize sleep duration
PersistentFdtdConfig::use_libcupp_atomics(default false): opt-incuda::atomic_refsupport- When enabled, H2K/K2H queue operations use
memory_order_acquire/memory_order_releaseinstead of__threadfence_system()pairs - Software barrier uses
cuda::thread_scope_device(narrower than system scope) withmemory_order_acq_rel - Compile-time guard:
#if __CUDACC_VER_MAJOR__ < 11error for CUDA toolkit version check - Builder:
with_libcupp_atomics(true)to enable
block_reduce_energyin persistent FDTD now uses warp-shuffle instead of shared-memory tree reduction- All standalone reduction helpers in
reduction_intrinsics.rsupgraded to warp-shuffle pattern
0.4.1 - 2026-02-06
- proptest integration in
ringkernel-corefor queue and HLC invariants- Queue: capacity power-of-2 invariant, length bounds, FIFO ordering, stats consistency, enqueue/dequeue roundtrip, partitioned routing determinism (6 tests)
- HLC: total ordering (reflexive, antisymmetric, transitive), zero-is-minimum, pack/unpack round-trip, tick strictly increasing, update causality preservation (7 tests)
- 13 new property-based tests, increasing total test count from 1403 to 1416
webconvenience feature combiningaxum,tower, andgrpcdataconvenience feature combiningarrowandpolarsmonitoringconvenience feature combiningtracing-integrationandprometheus
- Shared DSL marker functions extracted to
ringkernel-codegen/src/dsl_common.rs- 27 identical functions deduplicated: thread/block indices, synchronization primitives, math functions
- Both
ringkernel-cuda-codegenandringkernel-wgpu-codegennow re-export from the shared module - ~300 lines of duplicate code eliminated
unavailable_backend!macro inringkernel-core/src/backend_stub.rs- Single macro generates the full
RingKernelRuntimestub for disabled backends - Applied to
ringkernel-cuda,ringkernel-wgpu, andringkernel-metal - ~100 lines of triplicated stub code eliminated
- Single macro generates the full
- Replaced
eprintln!with structuredtracingmacros in library code across 6 crates:ringkernel-core/src/observability.rs— OTLP stub loggingringkernel-ir/src/optimize.rs— optimization iteration warningringkernel-accnet/src/cuda/runtime.rs— GPU init, fallback, and error loggingringkernel-accnet/src/gui/app.rs— backend status loggingringkernel-wavesim3d/src/simulation/block_actor_backend.rs— cooperative kernel fallbackringkernel-wavesim3d/src/simulation/persistent_backend.rs— grid size info
- Added
// SAFETY:comments to allunsafeblocks in GPU backend code (~80+ blocks):ringkernel-accnet/src/cuda/executor.rs(5 blocks)ringkernel-graph/src/gpu/cuda.rs(18 blocks)ringkernel-montecarlo/src/gpu/cuda.rs(11 blocks)ringkernel-wavesim/src/simulation/cuda_compute.rs(4 blocks)ringkernel-wavesim/src/simulation/cuda_packed.rs(6 blocks)ringkernel-wavesim3d/src/simulation/gpu_backend.rs(8 blocks)ringkernel-wavesim3d/src/simulation/block_actor_backend.rs(21 blocks)ringkernel-wavesim3d/src/simulation/actor_backend.rs(16 blocks)ringkernel-wavesim3d/src/visualization/renderer.rs(1 block)
- Added
#[inline]annotations to queue hot-path methods (try_enqueue,try_dequeue,len,is_empty,is_full,capacity) - Added
#[inline]to HLC timestamp operations (tick,update,cmp,partial_cmp) - Added
#[inline]to control block state accessors - Eliminated unnecessary
clone()in queue retry loop
- Tenant suspension now correctly sets
activeflag (was no-op) - Handler registration returns
Resultinstead of panicking on duplicate ID - TLS session resumption stores actual session ticket data
- CloudWatch audit sink returns explicit
Errinstead of silently dropping events
0.4.0 - 2026-01-25
This release extracts ~7,000+ lines of proven GPU infrastructure from RustGraph into RingKernel, making these capabilities available to all RingKernel users.
-
PyO3-based Python wrapper providing Pythonic access to RingKernel
- Full async/await support with
pyo3-async-runtimesand tokio integration - Sync fallbacks for all async operations (
create_sync,launch_sync, etc.) - Type stubs (
.pyifiles) for IDE support and static type checking - Python 3.8+ compatibility via
abi3-py38
- Full async/await support with
-
Core Runtime API:
RingKernel.create()/create_sync()- Create runtime with backend selectionKernelHandle- Launch, activate, deactivate, terminate kernelsLaunchOptions- Configure queue capacity, block size, priorityMessageId,MessageEnvelope- Message handling primitivesHlcTimestamp,HlcClock- Hybrid Logical Clock supportK2KBroker,K2KEndpoint- Kernel-to-kernel messagingQueueStats- Queue monitoring and statistics
-
CUDA Support (feature-gated via
cuda):CudaDevice- Device enumeration and propertiesGpuMemoryPool- Stratified GPU memory pool managementStreamManager- Multi-stream execution managementProfilingSession- GPU profiling and metrics collection
-
Benchmark Framework (feature-gated via
benchmark):BenchmarkSuite,BenchmarkConfig- Comprehensive benchmarkingBenchmarkResult- Results with throughput and timing- Regression detection with baseline comparison
- Multiple report formats (Markdown, JSON, LaTeX)
-
Hybrid Dispatcher:
HybridDispatcher- Automatic CPU/GPU workload routingHybridConfig,ProcessingMode- Configuration with adaptive thresholdsHybridStats- Execution statistics and threshold learning
-
Resource Management:
ResourceGuard- Memory limit enforcement with safety marginsReservationGuard- RAII wrapper for guaranteed allocationsMemoryEstimate- Workload memory estimation
import ringkernel
import asyncio
async def main():
runtime = await ringkernel.RingKernel.create(backend="cpu")
kernel = await runtime.launch("processor", ringkernel.LaunchOptions())
await kernel.terminate()
await runtime.shutdown()
asyncio.run(main())PtxCache- Disk-based PTX compilation cache for faster kernel loading- SHA-256 content-based hashing for cache keys
- Compute capability-aware caching (separate cache per GPU architecture)
- Thread-safe with atomic file operations
- Environment variable support:
RINGKERNEL_PTX_CACHE_DIR PtxCacheStatsfor hit/miss trackingPtxCacheErrorwith descriptive error types- Default cache location:
~/.cache/ringkernel/ptx/
use ringkernel_cuda::compile::{PtxCache, PtxCacheStats};
let cache = PtxCache::new()?; // Uses default directory
let hash = PtxCache::hash_source(cuda_source);
// Check cache first
if let Some(ptx) = cache.get(&hash, "sm_89")? {
// Use cached PTX
} else {
let ptx = compile_ptx(cuda_source)?;
cache.put(&hash, "sm_89", &ptx)?;
}
println!("Cache stats: {:?}", cache.stats());GpuStratifiedPool- Size-stratified memory pool for GPU VRAM- 6 size classes: 256B, 1KB, 4KB, 16KB, 64KB, 256KB
- O(1) allocation from free lists per bucket
- Large allocation fallback for oversized requests
- Thread-safe with atomic counters
GpuPoolConfigwith presets:for_graph_analytics(),for_simulation()GpuPoolDiagnosticsfor monitoring utilizationwarm_bucket()for pre-allocationcompact()for memory defragmentation
use ringkernel_cuda::memory_pool::{GpuStratifiedPool, GpuPoolConfig, GpuSizeClass};
let config = GpuPoolConfig::for_graph_analytics(); // 256B-heavy
let mut pool = GpuStratifiedPool::new(&device, config)?;
// Warm the small buffer bucket
pool.warm_bucket(GpuSizeClass::Size1KB, 100)?;
// Allocate (O(1) for pooled sizes)
let ptr = pool.allocate(512)?; // Uses 1KB bucket
pool.deallocate(ptr, 512)?;
println!("Diagnostics: {:?}", pool.diagnostics());-
StreamManager- Multi-stream CUDA execution for compute/transfer overlap- Configurable compute streams (1-8) with priority support
- Dedicated transfer stream for async DMA
- Event-based inter-stream synchronization
StreamConfigwith presets:minimal(),performance()StreamIdenum:Compute(usize),Transfer,Defaultrecord_event()/stream_wait_event()for dependenciesevent_elapsed_ms()for timing measurements
-
StreamPool- Load-balanced stream assignmentassign_workload()for explicit assignmentleast_utilized()for automatic load balancing- Utilization tracking with atomic counters
StreamPoolStatsfor monitoring
-
OverlapMetrics- Compute/transfer overlap measurement- Overlap ratio calculation
- Transfer/compute time tracking
use ringkernel_cuda::stream::{StreamManager, StreamConfig, StreamId};
let config = StreamConfig::performance(); // 4 compute + transfer
let mut manager = StreamManager::new(&device, config)?;
// Launch kernel on compute stream
let compute_stream = manager.cuda_stream(StreamId::Compute(0))?;
// ... launch kernel ...
// Record event for synchronization
manager.record_event("kernel_done", StreamId::Compute(0))?;
// Transfer stream waits for kernel
manager.stream_wait_event(StreamId::Transfer, "kernel_done")?;
// Timing
let elapsed = manager.event_elapsed_ms("start", "kernel_done")?;-
Benchmarkabletrait - Generic interface for benchmarkable workloadsname()/code()for identificationexecute()for workload execution- Supports custom workload sizes
-
BenchmarkSuite- Comprehensive benchmark orchestrationrun()/run_all_sizes()for execution- Baseline comparison with
set_baseline()/compare_to_baseline() - Multiple report formats: Markdown, LaTeX, JSON
-
BenchmarkConfig- Benchmark configuration- Warmup/measurement iterations
- Convergence thresholds
- Configurable workload sizes
- Presets:
quick(),comprehensive(),ci()
-
BenchmarkResult- Detailed benchmark results- Throughput (ops/s), total time, iterations
- Per-measurement timing data
- Custom metrics support
- Convergence tracking
-
RegressionReport- Performance regression detection- Per-workload comparison to baseline
- Status: Regression, Improvement, Unchanged
- Configurable threshold (default: 5%)
-
Statistics- Statistical analysis utilitiesConfidenceIntervalwith configurable confidence levelDetailedStatistics: mean, std_dev, min, max, percentiles (p5, p25, median, p75, p95, p99)ScalingMetricsfor analyzing algorithmic scaling (exponent, R²)
use ringkernel_core::benchmark::{BenchmarkSuite, BenchmarkConfig, Benchmarkable};
struct MyWorkload;
impl Benchmarkable for MyWorkload {
fn name(&self) -> &str { "MyWorkload" }
fn code(&self) -> &str { "MW" }
fn execute(&self, config: &WorkloadConfig) -> BenchmarkResult {
// ... run workload ...
}
}
let config = BenchmarkConfig::comprehensive()
.with_sizes(vec![1000, 10_000, 100_000]);
let mut suite = BenchmarkSuite::new(config);
suite.run_all_sizes(&MyWorkload);
// Generate reports
println!("{}", suite.generate_markdown_report());
println!("{}", suite.generate_latex_table());
// Regression detection
let baseline = suite.create_baseline("v1.0");
suite.set_baseline(baseline);
if let Some(report) = suite.compare_to_baseline() {
println!("Regressions: {}", report.regression_count);
}-
HybridDispatcher- Intelligent CPU/GPU workload routing- Automatic threshold-based routing
- Adaptive threshold learning from execution times
- Configurable learning rate
- Fallback to CPU when GPU unavailable
-
HybridWorkloadtrait - Workload interface for hybrid executionexecute_cpu()/execute_gpu()implementationsworkload_size()for routing decisionssupports_gpu()for capability detectionmemory_estimate()for resource planning
-
ProcessingMode- Routing mode configurationGpuOnly- Always use GPUCpuOnly- Always use CPUHybrid { gpu_threshold }- Size-based routingAdaptive- Learn optimal threshold
-
HybridConfig- Dispatcher configuration- Learning rate, initial threshold, min/max thresholds
- GPU availability flag
- Presets:
cpu_only(),gpu_only(),adaptive(),for_small_workloads(),for_large_workloads()
-
HybridStats- Execution statistics- CPU/GPU execution counts and times
- Adaptive threshold history
cpu_gpu_ratio()for balance analysis
use ringkernel_core::hybrid::{HybridDispatcher, HybridConfig, HybridWorkload, ProcessingMode};
struct MatrixMultiply { size: usize, /* ... */ }
impl HybridWorkload for MatrixMultiply {
type Result = Matrix;
fn workload_size(&self) -> usize { self.size * self.size }
fn execute_cpu(&self) -> Matrix { /* CPU impl */ }
fn execute_gpu(&self) -> HybridResult<Matrix> { /* GPU impl */ }
}
let config = HybridConfig::adaptive()
.with_initial_threshold(10_000)
.with_learning_rate(0.1);
let dispatcher = HybridDispatcher::new(config);
let workload = MatrixMultiply { size: 1000 };
// Automatic routing based on size and learned threshold
let result = dispatcher.execute(&workload);
// Check stats
let stats = dispatcher.stats().snapshot();
println!("GPU executions: {}, CPU executions: {}", stats.gpu_executions, stats.cpu_executions);-
ResourceGuard- Memory limit enforcement with reservations- Configurable maximum memory
- Safety margin (default: 30%)
- Reservation system for guaranteed allocations
can_allocate()for pre-flight checksreserve()returnsReservationGuardRAII wrappermax_safe_elements()for capacity planningunguarded()for unlimited allocation modeglobal_guard()singleton for process-wide limits
-
MemoryEstimatortrait - Workload memory estimationestimate()returnsMemoryEstimatename()for identification
-
MemoryEstimate- Detailed memory requirements- Primary, auxiliary, and peak bytes
- Confidence level (0.0-1.0)
total_bytes()/peak_bytes()helpers- Builder pattern with
with_primary(),with_auxiliary(), etc.
-
LinearEstimator- Simple linear memory estimator- Bytes per element + fixed overhead
-
System utilities:
get_total_memory()- System RAMget_available_memory()- Free RAMget_memory_utilization()- Current usage percentage
use ringkernel_core::resource::{ResourceGuard, MemoryEstimate, MemoryEstimator};
let guard = ResourceGuard::with_max_memory(4 * 1024 * 1024 * 1024); // 4 GB
// Check before allocating
if guard.can_allocate(1024 * 1024 * 1024) {
// Safe to allocate 1 GB
}
// Reserve memory with RAII guard
let reservation = guard.reserve(512 * 1024 * 1024)?;
// ... use reserved memory ...
// Automatically released when reservation drops
// Calculate safe element count
let max_elements = guard.max_safe_elements(64); // 64 bytes per element
println!("Can safely process {} elements", max_elements);-
KernelMode- Execution mode selectionElementCentric- One thread per element (default)SoA- Structure-of-Arrays for coalesced accessWorkItemCentric- Load-balanced work distributionTiled { tile_size }- Tiled execution with configurable tile dimensionsWarpCooperative- Warp-level parallelismAuto- Automatic selection based on workload
-
AccessPattern- Memory access pattern hintsCoalesced- Sequential accessStencil { radius }- Stencil patterns with haloIrregular- Random accessReduction- Reduction operationsScatter/Gather- Indirect access
-
WorkloadProfile- Workload characteristics- Element count, bytes per element
- Access pattern, compute intensity
- Builder pattern for configuration
-
GpuArchitecture- GPU capability profiles- L2 cache size, SM count, max threads/SM
- Shared memory per SM
- Compute capability
- Presets:
volta(),ampere(),ada(),hopper()
-
KernelModeSelector- Intelligent mode selectionselect()chooses optimal mode for workloadrecommended_block_size()per moderecommended_grid_size()for element countlaunch_config()returns completeLaunchConfig
-
LaunchConfig- Complete kernel launch configuration- Grid dimensions, block dimensions
- Shared memory bytes
simple_1d()/simple_2d()helpers
use ringkernel_cuda::launch_config::{
KernelModeSelector, WorkloadProfile, AccessPattern, GpuArchitecture,
};
let arch = GpuArchitecture::ada(); // RTX 40xx
let selector = KernelModeSelector::new(arch);
let profile = WorkloadProfile::new(1_000_000, 64)
.with_access_pattern(AccessPattern::Stencil { radius: 1 })
.with_compute_intensity(0.8);
let mode = selector.select(&profile); // Returns Tiled for stencil
let config = selector.launch_config(mode, profile.element_count);
println!("Grid: {:?}, Block: {:?}", config.grid_dim, config.block_dim);-
PartitionedQueue- Multi-partition queue for reduced contention- Hash-based message routing by source kernel ID
- Configurable partition count (rounded to power of 2)
try_enqueue()routes to appropriate partitiontry_dequeue_any()round-robin across partitionstry_dequeue_partition()for targeted dequeuepartition_for()returns partition index for source
-
PartitionedQueueStats- Partition-level statistics- Per-partition message counts
load_imbalance()metric (max/avg ratio)- Total message count across all partitions
use ringkernel_core::queue::PartitionedQueue;
let queue = PartitionedQueue::new(4, 1024); // 4 partitions, 1024 capacity each
// Enqueue routes based on source kernel ID
queue.try_enqueue(envelope)?; // Uses envelope.header.source_kernel for routing
// Dequeue from any partition (round-robin)
if let Some(msg) = queue.try_dequeue_any() {
// Process message
}
// Check load balance
let stats = queue.stats();
println!("Load imbalance: {:.2}x", stats.load_imbalance());-
Test Coverage - Increased from 900+ to 950+ tests
- 12 PTX cache tests
- 15 GPU memory pool tests
- 18 stream manager tests
- 28 benchmark framework tests
- 27 hybrid dispatcher tests
- 23 resource guard tests
- 12 kernel mode selection tests
- 7 partitioned queue tests
-
Dependencies - Added
sha2 = "0.10"for PTX cache hashing
- Fixed
source_id→source_kernelfield name in queue tests - Fixed floating point precision in
max_safe_elementstest - Fixed
RingKernelError::InvalidStatestruct variant usage in memory pool - Removed unused
GpuBufferimport in memory pool
0.3.2 - 2026-01-20
-
CUDA Profiling Module (
ringkernel-cuda/src/profiling/) - NEW MODULE- Feature-gated via
profilingfeature flag - Comprehensive GPU profiling capabilities for performance analysis
- Feature-gated via
-
CUDA Event Wrappers (
profiling/events.rs)CudaEvent- RAII wrapper for CUDA events with timing supportCudaEventFlags- Event configuration (blocking sync, disable timing, interprocess)GpuTimer- Start/stop timer using CUDA events with microsecond precisionGpuTimerPool- Pool of reusable timers with interior mutability for concurrent access
-
NVTX Integration (
profiling/nvtx.rs)CudaNvtxProfiler- Real NVTX profiler using cudarc's nvtx module- Timeline visualization in Nsight Systems and Nsight Compute
NvtxCategory- Predefined categories (Kernel, Transfer, Memory, Sync, Queue, User)NvtxRange- RAII wrapper for automatic range end on dropNvtxPayload- Typed payloads for markers (I32, I64, U32, U64, F32, F64)- Implements
GpuProfilertrait for integration with ringkernel-core
-
Kernel Metrics (
profiling/metrics.rs)KernelMetrics- Execution metadata (grid/block dims, GPU time, occupancy, registers)TransferMetrics- Memory transfer stats with bandwidth calculationTransferDirection- HostToDevice, DeviceToHost, DeviceToDeviceProfilingSession- Collects kernel and transfer events with timestampsKernelAttributes- Query kernel attributes via cuFuncGetAttribute
-
Memory Tracking (
profiling/memory_tracker.rs)CudaMemoryTracker- Track GPU memory allocations with timingTrackedAllocation- Allocation metadata (ptr, size, kind, label, timestamp)CudaMemoryKind- Device, Pinned, Mapped, Managed memory types- Peak usage tracking and allocation statistics
- Integration with
GpuMemoryDashboardfrom ringkernel-core
-
Chrome Trace Export (
profiling/chrome_trace.rs)GpuTraceEvent- Chrome trace format event structureGpuEventArgs- Rich event metadata (grid/block dims, occupancy, bandwidth)GpuChromeTraceBuilder- Build Chrome trace JSON from profiling sessions- Support for kernel events, transfer events, NVTX ranges, memory allocations
- Process/thread naming for multi-GPU and multi-stream visualization
- Compatible with chrome://tracing, Perfetto UI, and Nsight Systems
- Dependencies - Added
nvtxfeature to cudarc dependency - ringkernel-cuda/Cargo.toml - Added optional
serdeandserde_jsonfor Chrome trace export
- Added
ProfilerRange::stub()public constructor in ringkernel-core for external profiler implementations
0.3.1 - 2026-01-19
-
Real Cryptography (
ringkernel-core/src/security.rs)- AES-256-GCM and ChaCha20-Poly1305 encryption algorithms
- Proper nonce generation with
rand::thread_rng() - Key derivation using Argon2id and HKDF-SHA256
- Secure memory wiping with
zeroizecrate - Feature-gated via
cryptofeature flag
-
Secrets Management (
ringkernel-core/src/secrets.rs) - NEW FILESecretStoretrait for pluggable secret backendsInMemorySecretStorefor development/testingEnvVarSecretStorefor environment variable secretsCachedSecretStorewith TTL-based cachingChainedSecretStorefor fallback chainsKeyRotationManagerfor automatic key rotationSecretKeyandSecretValuetypes with secure memory handling
-
Authentication Framework (
ringkernel-core/src/auth.rs) - NEW FILEAuthProvidertrait for pluggable authenticationApiKeyAuthfor simple API key validationJwtAuthfor JWT token validation (RS256/HS256) - requiresauthfeatureChainedAuthProviderfor fallback authentication chainsAuthContextwith identity and credential managementCredentialsenum: ApiKey, Bearer, Basic, Certificate
-
Role-Based Access Control (
ringkernel-core/src/rbac.rs) - NEW FILERoleenum: Admin, Operator, Developer, Viewer, CustomPermissionenum: Read, Write, Execute, Admin, CustomRbacPolicywith subject-role-permission bindingsPolicyEvaluatorwith deny-by-default evaluationResourceRulefor fine-grained resource access control
-
Multi-Tenancy Support (
ringkernel-core/src/tenancy.rs) - NEW FILETenantContextfor request scoping with tenant IDTenantRegistryfor managing tenant configurationsResourceQuotawith limits for memory, kernels, message rateResourceUsagetracking with quota enforcementQuotaUtilizationfor monitoring tenant resource usage
-
OpenTelemetry OTLP Export (
ringkernel-core/src/observability.rs)OtlpExporterfor sending spans to OTLP endpointsOtlpConfigwith endpoint, headers, and transport configuration- Batch export with configurable interval and queue size
- HTTP and gRPC transport options via
OtlpTransportenum - Automatic retry with exponential backoff
OtlpExporterStatsfor monitoring export success/failure
-
Structured Logging (
ringkernel-core/src/logging.rs) - NEW FILEStructuredLoggerwith multi-sink supportLogLevel: Trace, Debug, Info, Warn, Error, FatalLogOutput: Text, Json, Compact, PrettyTraceContextfor automatic trace_id/span_id injectionLogConfigwith builder pattern and presets (development, production)- Built-in sinks:
ConsoleSink,MemoryLogSink,FileLogSink - JSON structured output for log aggregation
- Global logger functions:
init(),info(),error(), etc.
-
Alert Routing System (
ringkernel-core/src/alerting.rs) - NEW FILEAlertSinktrait for pluggable alert destinationsAlertRouterfor routing alerts based on severityWebhookSinkfor Slack, Teams, PagerDuty (requiresalertingfeature)LogSinkandInMemorySinkfor testing/debuggingDeduplicationConfigfor alert deduplication with time windowsAlertSeverity: Info, Warning, Error, CriticalAlertRouterStatsfor monitoring alert delivery
-
Remote Audit Sinks (
ringkernel-core/src/audit.rs)SyslogSinkfor RFC 5424 syslog with configurable facility/severityCloudWatchSinkfor AWS CloudWatch Logs integrationElasticsearchSinkfor direct Elasticsearch indexing (requiresalertingfeature)- Async batch sending with configurable flush intervals
- Rate Limiting (
ringkernel-core/src/rate_limiting.rs) - NEW FILERateLimiterwith pluggable algorithmsRateLimitAlgorithm: TokenBucket, SlidingWindow, LeakyBucketRateLimitConfigwith burst, window size, and rate configurationRateLimiterBuilderwith fluent configuration APIRateLimitGuardRAII wrapper for rate-limited operationsSharedRateLimiterfor distributed rate limitingRateLimiterExttrait for easy integrationRateLimiterStatsSnapshotfor monitoring- Feature-gated via
rate-limitingfeature flag
-
TLS Support (
ringkernel-core/src/tls.rs) - NEW FILETlsConfigwith builder pattern for server/client configurationTlsAcceptorfor server-side TLS with rustlsTlsConnectorfor client-side TLS connectionsCertificateStorewith automatic rotation and hot reloadSniResolverfor multi-domain certificate selection- mTLS (mutual TLS) with client certificate validation
TlsVersionenum: Tls12, Tls13TlsSessionInfofor connection metadata- Feature-gated via
tlsfeature flag
-
K2K Message Encryption (
ringkernel-core/src/k2k.rs)K2KEncryptorfor kernel-to-kernel message encryptionK2KEncryptionConfigwith algorithm and key configurationK2KEncryptionAlgorithm: Aes256Gcm, ChaCha20Poly1305EncryptedK2KMessagewith nonce and authentication tagEncryptedK2KEndpointwrapper for transparent encryptionEncryptedK2KBuilderfor fluent endpoint creationK2KKeyMaterialwith secure key handling- Forward secrecy support with ephemeral keys
- Feature-gated via
cryptofeature flag
-
Operation Timeouts (
ringkernel-core/src/timeout.rs) - NEW FILETimeoutwrapper for async operations with deadlinesDeadlinefor absolute timeout trackingCancellationTokenfor cooperative cancellationOperationContextwith deadline propagationtimeout()andtimeout_named()helper functionswith_timeout()andwith_timeout_named()for futuresTimeoutStatsandTimeoutStatsSnapshotfor monitoring
-
Automatic Recovery (
ringkernel-core/src/health.rs)RecoveryPolicyenum: Restart, Migrate, Checkpoint, Notify, Escalate, CircuitFailureTypeenum: Timeout, Crash, DeviceError, ResourceExhausted, QueueOverflow, StateCorruptionRecoveryConfigwith builder pattern and per-failure-type policiesRecoveryManagerfor coordinating recovery actionsRecoveryActionwith retry tracking and timestampsRecoveryResultwith success/failure detailsRecoveryStatsSnapshotfor monitoring recovery attempts- Automatic escalation after max retries exceeded
- Configurable cooldown periods between recovery attempts
-
Feature Flags - New enterprise feature flags in
ringkernel-core/Cargo.toml:crypto- Real cryptography (AES-GCM, ChaCha20, Argon2)auth- JWT authentication supportrate-limiting- Governor-based rate limitingalerting- Webhook alerts via reqwesttls- TLS support via rustlsenterprise- Combined feature enabling all enterprise features
-
Test Coverage - Increased from 825+ to 900+ tests
- 14 crypto tests for K2K encryption
- 14 logging tests for structured logging
- 15 recovery tests for automatic recovery
- 13 TLS tests for certificate management
- Plus tests for secrets, auth, RBAC, tenancy, rate limiting, alerting
- Fixed SpanStatus pattern matching for OTLP export
- Fixed AttributeValue JSON serialization in observability
- Fixed TraceId/SpanId Display formatting with hex output
- Fixed reqwest blocking feature for webhook alerts
0.3.0 - 2026-01-17
-
#[derive(PersistentMessage)]macro (ringkernel-derive)- Automatic
handler_idgeneration for GPU kernel dispatch - Inline payload serialization with response tracking
- Compile-time handler registration
- Automatic
-
KernelDispatcher(ringkernel-core/src/dispatcher.rs) - NEW FILE- Type-based message routing via K2K broker
DispatcherBuilderwith fluent configuration APIDispatcherConfigfor routing behavior customizationDispatcherMetricsfor observability (messages dispatched, errors, latency)
-
CUDA Handler Dispatch Code Generator (
ringkernel-cuda-codegen/src/ring_kernel.rs)CudaDispatchTablefor handler registration- Switch-based dispatch code generation
ExtendedH2KMessagestruct generation for typed payloads
-
Queue Tiering System (
ringkernel-core/src/queue.rs)QueueTierenum: Small (256), Medium (1024), Large (4096), ExtraLarge (16384)QueueFactoryfor creating appropriately-sized message queuesQueueMonitorfor queue health checking with configurable thresholdsQueueMetricsfor observability (enqueue/dequeue counts, peak depth)for_throughput()method for automatic tier selection based on message rate
-
Persistent Message Infrastructure (
ringkernel-core/src/persistent_message.rs) - NEW FILEPersistentMessagetrait for GPU-dispatchable messagesDispatchTablefor runtime handler registrationHandlerIdtype for type-safe handler identification
compile_ptx()function (ringkernel-cuda/src/lib.rs)- Wraps
cudarc::nvrtc::compile_ptxfor downstream crates - Compile CUDA source to PTX without direct cudarc dependency
- Returns PTX string or compilation error
- Wraps
-
Size-Stratified Memory Pool (
ringkernel-core/src/memory.rs)SizeBucketenum: Tiny (256B), Small (1KB), Medium (4KB), Large (16KB), Huge (64KB)StratifiedMemoryPool- Multi-bucket pool with automatic size selectionStratifiedBuffer- RAII wrapper that returns buffers to correct bucket on dropStratifiedPoolStats- Per-bucket allocation statistics with hit rate trackingcreate_stratified_pool()andcreate_stratified_pool_with_capacity()helpers
-
WebGPU Staging Buffer Pool (
ringkernel-wgpu/src/memory.rs)StagingBufferPool- Reusable staging buffer cache for GPU-to-host transfersStagingBufferGuard- RAII wrapper for automatic buffer returnStagingPoolStats- Cache hit/miss tracking for staging buffersWgpuBufferextended with optional staging pool integration
-
CUDA Reduction Buffer Cache (
ringkernel-cuda/src/reduction.rs)ReductionBufferCache- Cache keyed by (num_slots, ReductionOp) for buffer reuseCachedReductionBuffer<T>- RAII wrapper withDeref/DerefMutfor transparent accessCacheStats- Hit/miss counters with hit rate calculationCacheKey- Hashable key type for cache lookup
-
Analytics Context Manager (
ringkernel-core/src/analytics_context.rs) - NEW FILEAnalyticsContext- Grouped buffer lifecycle for analytics operations (DFG, BFS, pattern detection)AllocationHandle- Type-safe opaque handle to allocationsContextStats- Peak/current bytes, allocation counts, typed allocation trackingAnalyticsContextBuilder- Fluent builder with preallocation supportallocate_typed<T>()for type-safe buffer allocation with automatic sizing
-
Memory Pressure Reactions (
ringkernel-core/src/memory.rs)PressureReactionenum: None, Shrink (with target utilization), CallbackPressureHandler- Monitors pressure levels and triggers configured reactionsPressureAwarePooltrait - Extension for pressure-aware memory pools- Severity-based shrink calculation (Normal → Elevated → Warning → Critical → OutOfMemory)
-
ringkernel-core/src/reduction.rs- Core reduction traitsReductionOpenum: Sum, Min, Max, And, Or, Xor, ProductReductionScalartrait for type-safe reduction with identity valuesReductionConfigfor configuring reduction behaviorReductionHandletrait for streaming operationsGlobalReductiontrait for backend-agnostic reduction interface
-
ringkernel-cuda/src/reduction.rs- CUDA reduction implementationReductionBuffer<T>using mapped memory (CPU+GPU visible)- Zero-copy host read of reduction results
- Multi-slot support for reduced contention
- Block-then-atomic pattern for efficient grid reductions
- Helper code generation:
generate_block_reduce_code(),generate_grid_reduce_code(),generate_reduce_and_broadcast_code()
-
ringkernel-cuda/src/phases.rs- Multi-phase kernel executionSyncModeenum: Cooperative, SoftwareBarrier, MultiLaunchKernelPhasestruct for phase metadataInterPhaseReduction<T>for reduction between phasesMultiPhaseConfigfor phase sequencingMultiPhaseExecutorfor orchestrating phase executionPhaseExecutionStatsfor performance tracking
-
ringkernel-cuda-codegen/src/reduction_intrinsics.rs- Codegen for reductionsgenerate_reduction_helpers()for cooperative groups supportgenerate_inline_reduce_and_broadcast()for inline reduction codeReductionCodegenConfigfor configuring code generation
-
New codegen intrinsics in
GpuIntrinsicenum:- Block-level:
BlockReduceSum,BlockReduceMin,BlockReduceMax,BlockReduceAnd,BlockReduceOr - Grid-level:
GridReduceSum,GridReduceMin,GridReduceMax - Combined:
ReduceAndBroadcast
- Block-level:
-
Ring kernel reduction support via
KernelReductionConfig:with_reduction()builder method onRingKernelConfigwith_sum_reduction()convenience method- Automatic reduction boilerplate generation
-
pagerank_reductionexample demonstrating PageRank with dangling node handling- Triangle graph (no dangling), star graph (75% dangling), chain with sink examples
- Generated CUDA kernel code visualization
alloc_mapped<T>()method for mapped memory allocationsupports_cooperative_groups()method for capability detection
ringkernel-metal- Apple Metal backend implementation (scaffold)MetalRuntimewith compute command queue managementMetalBufferfor GPU buffer allocation and mappingMetalPipelinefor compute pipeline state- Fence-based synchronization (Metal lacks cooperative groups)
- MSL kernel compilation via metal-rs 0.31
- Note: True persistent kernels not yet implemented (requires host-driven dispatch)
- Correlation Tracking - Request/response message matching via
CorrelationIdreceive_with_correlation()with timeout supportHashMap<CorrelationId, oneshot::Sender>for pending correlations
- Kernel Slot Management -
SlotAllocatorfor K2K route management- BitSet-based slot allocation with
allocate()/release() - Prevents slot collisions in multi-kernel topologies
- BitSet-based slot allocation with
- Cooperative Kernel Fallback - Software synchronization when grid exceeds limits
- Automatic fallback to barrier-based sync using atomics
cuLaunchCooperativeKernelintegration via cudarc 0.18.2
- CUDA Backend - Full messaging and HLC node implementation
K2HEnqueue,H2KDequeue,H2KIsEmpty- Host↔Kernel queuesK2KSend,K2KRecv,K2KTryRecv- Kernel-to-kernel messagingHlcNow,HlcTick,HlcUpdate- Hybrid logical clock operations
- MSL Backend - Metal shading language equivalents
- Same 9 node types with Metal-specific implementations
- Energy Calculation - Parallel reduction for total field energy
block_reduce_energy()device function with shared memory- E = Σ(p²) computed at progress intervals
atomicAddfor cross-block accumulation
- Message Checksum - CRC32 integrity verification
- Checksum computation in ring kernel response messages
- Optional bypass for performance-critical paths
- Higher-Dimensional Shared Memory - 2D, 3D, and 4D+ support
SharedTile::new_3d()for 3D nested arrays- 3D generates:
array<array<array<T, X>, Y>, Z> - 4D+ uses linearized indexing with formula generation
SharedVolume<T, X, Y, Z>marker type for type safety
- Parallel Union-Find - Shiloach-Vishkin algorithm implementation
- GPU-accelerated connected components
- Parallel pointer jumping for path compression
- Proper Resampling - Linear interpolation + windowed sinc
LinearResamplerfor low-overhead conversionSincResamplerfor high-quality audio- Sample rate conversion 44.1kHz ↔ 48kHz
- GPU Boundary Reflection - CUDA kernel for boundary conditions
- Support for absorbing, reflecting, and periodic boundaries
- Integrated with tile-based GPU actor system
- True Cooperative Launch -
step_cooperative()withgrid.sync()- Uses
CooperativeLaunchConfigandPersistentParams - Grid-wide synchronization without fallback
- Uses
- Industry Chart of Accounts Templates - Realistic account structures
manufacturing_standard()- Raw Materials, WIP, Finished Goods, Direct Labor/Materials/Overheadprofessional_services_standard()- Unbilled Receivables, WIP-Billable, Client Retainersfinancial_services_standard()- Trading Securities, Loans Receivable, Customer Deposits, Custody Assets
-
ringkernel-montecarlo- GPU-accelerated Monte Carlo primitives for variance reduction- Philox RNG - Counter-based PRNG with
GpuRngtrait (stateless, GPU-friendly) - Antithetic Variates - Variance reduction using negatively correlated samples
- Control Variates - Variance reduction using correlated variables with known expectations
- Importance Sampling - Self-normalized estimator with exponential tilting for rare events
- 16 tests covering all algorithms
- Philox RNG - Counter-based PRNG with
-
ringkernel-graph- GPU-accelerated graph algorithm primitives- CSR Matrix - Compressed Sparse Row format with builder pattern
- BFS - Sequential and parallel breadth-first search with multi-source support
- SCC - Strongly connected components via Tarjan and Kosaraju algorithms
- Union-Find - Parallel disjoint set with path compression and union by rank
- SpMV - Sparse matrix-vector multiplication with power iteration
- Node types:
NodeId,Distance,ComponentIdwith Pod traits - 51 tests covering all algorithms
Domainenum - 20 business domain classifications with type ID ranges- GraphAnalytics (100-199), StatisticalML (200-299), Compliance (300-399)
- RiskManagement (400-499), OrderMatching (500-599), MarketData (600-699)
- Settlement (700-799), Accounting (800-899), NetworkAnalysis (900-999)
- FraudDetection (1000-1099), TimeSeries (1100-1199), Simulation (1200-1299)
- Banking (1300-1399), BehavioralAnalytics (1400-1499), ProcessIntelligence (1500-1599)
- Clearing (1600-1699), TreasuryManagement (1700-1799), PaymentProcessing (1800-1899)
- FinancialAudit (1900-1999), Custom (10000+)
DomainMessagetrait - Domain-aware messages with automatic type ID calculation#[derive(RingMessage)]extended withdomainattribute
- Metrics Types -
MetricType,MetricsEntry,ContextMetricsBuffer - Alert Types -
AlertSeverity,KernelAlertType,AlertRouting,KernelAlert - RingContext methods:
domain(),set_domain()- Domain associationrecord_latency(),record_throughput(),record_counter(),record_gauge()- Metrics collectionflush_metrics()- Retrieve and clear metrics bufferemit_alert(),alert_if_slow()- Alert emission
K2KMessageRegistration- Compile-time message type registrationK2KTypeRegistry- Runtime registry withdiscover(),is_routable(),get_category()#[derive(RingMessage)]extended withk2k_routableandcategoryattributes- Integration with
inventorycrate for automatic registration
EmbeddedStatetrait - For 24-byte states that fit in ControlBlock._reservedStateDescriptor- 24-byte header for external state referencesControlBlockStateHelper- Read/write embedded state from ControlBlockGpuStatetrait - For larger states with serialization support#[derive(ControlBlockState)]- Derive macro for embedded state types
0.2.0 - 2025-01-08
-
ringkernel-ir- Unified Intermediate Representation for multi-backend code generation- SSA-based IR capturing GPU-specific operations
- Architecture: Rust DSL → IR → CUDA/WGSL/MSL backends
IrBuilderfluent API for constructing kernel IR- Optimization passes: constant folding, dead code elimination, algebraic simplification
BackendCapabilitiestrait for querying backend supportValidatorwith configurable validation levels- Pretty-printing and IR visualization
-
ringkernel-cli- Command-line tool for project scaffolding and kernel code generationringkernel new <name>- Create new projects with templates (basic, persistent-actor, wavesim, enterprise)ringkernel init- Initialize RingKernel in existing projectsringkernel codegen <file>- Generate CUDA/WGSL/MSL from Rust DSLringkernel check- Validate kernel compatibility across backendsringkernel completions- Generate shell completions (bash, zsh, fish, PowerShell)- Colored terminal output with progress indicators
-
RuntimeBuilder- Fluent builder for enterprise runtime configuration- Presets:
development(),production(),high_performance() - Automatic component initialization based on configuration
- Presets:
-
RingKernelContext- Unified runtime managing all enterprise features- Centralized access to health, metrics, multi-GPU, and migration components
- Lifecycle management with state machine
-
ConfigBuilder- Nested configuration system with builder pattern- Environment variable overrides
- TOML/YAML configuration file support
-
LifecycleState- Runtime state machine- States:
Initializing→Running→Draining→ShuttingDown→Stopped - Graceful shutdown with drain timeout
- States:
-
Health & Resilience
HealthChecker- Liveness/readiness probes with async health checksCircuitBreaker- Fault tolerance with automatic recovery (Closed/Open/HalfOpen states)DegradationManager- Graceful degradation with 5 levels (Normal → Critical)KernelWatchdog- Stale kernel detection with configurable heartbeat monitoring
-
Observability
PrometheusExporter- Export metrics in Prometheus formatObservabilityContext- Distributed tracing with span management- GPU memory dashboard with pressure alerts
-
Multi-GPU
MultiGpuCoordinator- Device selection with load balancing strategies (RoundRobin, LeastLoaded, Random)KernelMigrator- Live kernel migration between GPUs using checkpointsGpuTopology- NVLink/PCIe topology discovery
-
ShutdownReport- Final statistics on graceful shutdown
-
MemoryEncryption- GPU memory encryption- Algorithms: AES-256-GCM, AES-128-GCM, ChaCha20-Poly1305, XChaCha20-Poly1305
- Key derivation: HKDF-SHA256, HKDF-SHA384, Argon2id, PBKDF2-SHA256
- Automatic key rotation with configurable interval
- Encrypt control blocks, message queues, and kernel state
-
KernelSandbox- Kernel isolation and resource controlResourceLimits- Memory, execution time, message rate, K2K connectionsSandboxPolicy- K2K ACLs (allow/deny lists), memory access levels- Presets:
restrictive()for untrusted kernels,permissive()for trusted - Violation detection and recording
-
ComplianceReporter- Audit-ready compliance documentation- Standards: SOC2, GDPR, HIPAA, PCI-DSS, ISO 27001, FedRAMP, NIST CSF
- Export formats: JSON, HTML, Markdown, PDF, CSV
- Automatic compliance check generation with evidence and recommendations
-
PyTorchBridge- Bidirectional tensor interop with PyTorch- Data types: Float16/32/64, BFloat16, Int8/32/64, UInt8, Bool
- Device management (CPU, CUDA)
- Pinned memory support
-
OnnxExecutor- Load and execute ONNX models on GPU ring kernels- Model loading from file or memory
- Input/output tensor management
- Execution providers configuration
-
HuggingFacePipeline- Integration with Hugging Face Transformers- Text classification, generation, and embedding pipelines
- Model caching and configuration
-
Hot Reload - Kernel hot reload with state preservation
- File system watcher for kernel source changes
- State checkpointing during reload
-
GPU Memory Dashboard - Real-time memory monitoring
- Pressure alerts with configurable thresholds
- Per-kernel memory breakdown
-
Mock GPU Testing (
ringkernel-cpu/src/mock.rs)MockGpuDevicefor testing GPU code without hardware- Deterministic execution for reproducible tests
- Memory allocation tracking
-
Fuzzing Infrastructure (5 fuzz targets)
- Message serialization fuzzing
- Queue operations fuzzing
- HLC timestamp fuzzing
- IR validation fuzzing
- Codegen fuzzing
-
CI GPU Testing Workflow
- GitHub Actions with GPU runner support
- Automated CUDA and WebGPU test execution
-
Interactive Tutorials (4 tutorials)
01-hello-kernel- Basic kernel lifecycle02-message-passing- Request/response patterns03-k2k-messaging- Kernel-to-kernel communication04-persistent-actors- Persistent GPU actors
-
VSCode Extension Scaffolding
- Syntax highlighting for RingKernel DSL
- Code completion support
-
SIMD Optimizations (
ringkernel-cpu/src/simd.rs)- Vectorized stencil operations
- SIMD-accelerated reductions
-
Subgroup Operations (WGSL backend)
subgroupAdd,subgroupMul,subgroupMin,subgroupMax- Broadcast and shuffle operations
-
Metal K2K Halo Exchange - Kernel-to-kernel communication on Metal backend
-
Optimization Passes (ringkernel-ir)
ConstantFolding- Compile-time constant evaluationDeadCodeElimination- Remove unused valuesDeadBlockElimination- Remove unreachable blocksAlgebraicSimplification- Simplify arithmetic expressions
-
API Changes
- Renamed
RuntimeMetrics→ContextMetrics
- Renamed
-
Test Coverage
- Increased from 580+ to 700+ tests across workspace
- Various clippy warnings across all crates
- HLC test using
tick()instead of read-onlynow() - Tutorial code formatting for educational clarity
0.1.3 - 2025-12-14
- Grid-wide GPU synchronization via CUDA cooperative groups (
grid.sync()) cuLaunchCooperativeKerneldriver API interop - Direct FFI calls to CUDA driver for true cooperative launch- Build-time PTX compilation -
build.rswith nvcc detection and automatic kernel compilation cooperativefeature flag forringkernel-cudaandringkernel-wavesim3dcooperativefield inLaunchOptionsfor cooperative launch mode
- 8×8×8 block-based actor model - Hybrid approach combining stencil and actor patterns
- Intra-block: Fast stencil computation with shared memory
- Inter-block: Double-buffered message passing (no atomics)
BlockActorGpuBackendwithstep_fused()for single-kernel-launch execution- Performance: 8,165 Mcells/s (59.6× faster than per-cell actors)
- Grid size validation with
max_cooperative_blocks(144 on RTX 4090)
ComputationMethod::BlockActor- Third GPU computation method for wavesim3d- Combines actor model benefits with stencil performance
- 10-50× faster than per-cell Actor method
- Added
CooperativeKernelwrapper inringkernel-cuda::cooperativemodule - Added cooperative kernel infrastructure to wavesim3d benchmark
- tokio: 1.35 → 1.48 (improved task scheduling, better cancellation handling)
- thiserror: 1.0 → 2.0 (updated derive macros)
- wgpu: 0.19 → 27.0 (Arc-based resource tracking, 40%+ performance improvement)
- Migrated to new Instance/Adapter/Device creation API
- Updated pipeline descriptors with
entry_point: Option<&str>,compilation_options,cache - Renamed
ImageCopyTexture→TexelCopyTextureInfo,ImageDataLayout→TexelCopyBufferLayout - Updated
device.poll()to usePollType::wait_indefinitely()
- winit: 0.29 → 0.30 (new window creation API)
- egui/egui-wgpu/egui-winit: 0.27 → 0.31 (updated for wgpu 27 compatibility)
- glam: 0.27 → 0.29 (linear algebra updates)
- metal: 0.27 → 0.31 (Apple GPU backend updates)
- axum: 0.7 → 0.8 (improved routing, better error handling)
- tower: 0.4 → 0.5 (service abstraction updates)
- tonic: 0.11 → 0.14 (better gRPC streaming, improved health checking)
- prost: 0.12 → 0.14 (protobuf updates to match tonic)
- actix-rt: 2.9 → 2.10
- rayon: 1.10 → 1.11 (requires MSRV 1.80)
- arrow: 52 → 54 (columnar data updates)
- polars: 0.39 → 0.46 (DataFrame updates)
- iced: Kept at 0.13 (0.14 requires major application API rewrite)
- rkyv: Kept at 0.7 (0.8 has incompatible data format, requires significant migration)
0.1.2 - 2025-12-11
- WaveSim3D (
ringkernel-wavesim3d) - 3D acoustic wave simulation with realistic physics- Full 3D FDTD (Finite-Difference Time-Domain) wave propagation solver
- Binaural audio rendering with HRTF (Head-Related Transfer Function) support
- Volumetric ray marching visualization for real-time 3D pressure field rendering
- GPU-native actor system for distributed 3D wave simulation
- Support for multiple sound sources with frequency-dependent propagation
- Material absorption modeling with frequency-dependent coefficients
- Interactive 3D camera controls and visualization modes
- Expanded GPU intrinsics from ~45 to 120+ operations across 13 categories
- Atomic Operations (11 ops):
atomic_add,atomic_sub,atomic_min,atomic_max,atomic_exchange,atomic_cas,atomic_and,atomic_or,atomic_xor,atomic_inc,atomic_dec - Synchronization (7 ops):
sync_threads,sync_threads_count,sync_threads_and,sync_threads_or,thread_fence,thread_fence_block,thread_fence_system - Trigonometric (11 ops):
sin,cos,tan,asin,acos,atan,atan2,sincos,sinpi,cospi - Hyperbolic (6 ops):
sinh,cosh,tanh,asinh,acosh,atanh - Exponential/Logarithmic (18 ops):
exp,exp2,exp10,expm1,log,ln,log2,log10,log1p,pow,ldexp,scalbn,ilogb,erf,erfc,erfinv,erfcinv,lgamma,tgamma - Classification (8 ops):
is_nan,is_infinite,is_finite,is_normal,signbit,nextafter,fdim - Warp Operations (16 ops):
warp_active_mask,warp_shfl,warp_shfl_up,warp_shfl_down,warp_shfl_xor,warp_ballot,warp_all,warp_any,warp_match_any,warp_match_all,warp_reduce_add/min/max/and/or/xor - Bit Manipulation (8 ops):
popc,clz,ctz,ffs,brev,byte_perm,funnel_shift_left,funnel_shift_right - Memory Operations (3 ops):
ldg,prefetch_l1,prefetch_l2 - Special Functions (13 ops):
rcp,fast_div,saturate,j0,j1,jn,y0,y1,yn,normcdf,normcdfinv,cyl_bessel_i0,cyl_bessel_i1 - Timing (3 ops):
clock,clock64,nanosleep - 3D Stencil Intrinsics:
pos.up(buf),pos.down(buf),pos.at(buf, dx, dy, dz)for volumetric kernels
- Added
required-featuresto CUDA-only wavesim binaries to fix build without CUDA - Updated GitHub Actions release workflow with proper feature flags and Ubuntu version
- Updated ringkernel-cuda-codegen tests from 143 to 171 tests
- Fixed release workflow feature flags for showcase applications
- Fixed Ubuntu version compatibility in CI/CD pipeline
0.1.1 - 2025-12-04
- AccNet (
ringkernel-accnet) - GPU-accelerated accounting network analytics- Network visualization with force-directed graph layout
- Fraud detection: circular flows, threshold clustering, Benford's Law violations
- GAAP compliance checking for accounting rule violations
- Temporal analysis for seasonality, trends, and behavioral anomalies
- GPU kernels: Suspense detection, GAAP violation, Benford analysis, PageRank
- ProcInt (
ringkernel-procint) - GPU-accelerated process intelligence- DFG (Directly-Follows Graph) mining from event streams
- Pattern detection: bottlenecks, loops, rework, long-running activities
- Conformance checking with fitness and precision metrics
- Timeline view with partial order traces and concurrent activity visualization
- Multi-sector templates: Healthcare, Manufacturing, Finance, IT
- GPU kernels: DFG construction, pattern detection, partial order derivation, conformance checking
- Updated showcase documentation with AccNet and ProcInt sections
- Updated CI workflow to exclude CUDA tests on runners without GPU hardware
- Fixed 14 clippy warnings in ringkernel-accnet (needless_range_loop, manual_range_contains, clamp patterns, etc.)
- Fixed benchmark API compatibility in ringkernel-accnet
- Fixed code formatting issues across showcase applications
0.1.0 - 2025-12-03
- GPU-native persistent actor model with
RingKernelRuntimetrait - Lock-free
MessageQueue(SPSC ring buffer) for host-GPU message passing ControlBlock- 128-byte GPU-resident structure for kernel lifecycle managementRingContext- GPU intrinsics facade for kernel handlers- Hybrid Logical Clocks (
HlcTimestamp,HlcClock) for causal ordering across distributed kernels KernelHandlefor managing kernel lifecycle (launch, activate, terminate)
RingMessagetrait with zero-copy serialization via rkyv- Kernel-to-Kernel (K2K) direct messaging with
K2KBrokerandK2KEndpoint - Topic-based Publish/Subscribe with wildcard support via
PubSubBroker - Message correlation tracking and priority support
#[derive(RingMessage)]- Automatic message serialization with field annotations#[ring_kernel]- Kernel handler definition with configuration#[derive(GpuType)]- GPU-compatible type generation
- CPU Backend (
ringkernel-cpu) - Always available for testing and fallback - CUDA Backend (
ringkernel-cuda) - NVIDIA GPU support via cudarc - WebGPU Backend (
ringkernel-wgpu) - Cross-platform GPU support (Vulkan, Metal, DX12) - Metal Backend (
ringkernel-metal) - Apple GPU support (scaffolded) - Auto-detection with
Backend::Auto(tries CUDA → Metal → WebGPU → CPU)
- CUDA Codegen (
ringkernel-cuda-codegen) - Rust DSL to CUDA C transpiler- Global kernels with block/grid indices
- Stencil kernels with
GridPosabstraction and tiled shared memory - Ring kernels for persistent actor model with HLC and K2K support
- 45+ GPU intrinsics (atomics, warp ops, sync, math)
- WGSL Codegen (
ringkernel-wgpu-codegen) - Rust DSL to WGSL transpiler- Full parity with CUDA codegen for portable shaders
- 64-bit emulation via lo/hi u32 pairs
- Subgroup operations support
- Actor framework integrations (Actix, Tower)
- Web framework integrations (Axum)
- Data processing (Arrow, Polars)
- gRPC support (Tonic)
- Machine learning (Candle)
- Configuration management
- Metrics and observability (Prometheus, tracing)
- WaveSim (
ringkernel-wavesim) - Interactive 2D acoustic wave simulation- FDTD solver with GPU acceleration
- Educational modes demonstrating parallel computing evolution
- Multiple backends (CPU, CUDA, WebGPU)
- TxMon (
ringkernel-txmon) - Real-time transaction monitoring- GPU-accelerated fraud detection patterns
- Structuring detection, velocity checks, PEP monitoring
- Interactive GUI with real-time visualization
- Audio FFT (
ringkernel-audio-fft) - GPU-accelerated audio processing- Direct/ambience source separation
- Real-time FFT processing with actor model
- CUDA codegen achieves ~93B elem/sec on RTX Ada (12,378x vs CPU)
- Lock-free message queue with sub-microsecond latency
- Zero-copy serialization for GPU transfer
- Comprehensive README files for all crates
- CLAUDE.md with build commands and architecture overview
- Code examples for all major features