Skip to content

Commit 2f0f0ff

Browse files
mivertowskiclaude
andcommitted
Release v0.3.2: GPU profiling infrastructure
Add comprehensive GPU profiling capabilities for CUDA backend: - CUDA event wrappers (CudaEvent, GpuTimer, GpuTimerPool) - NVTX integration for Nsight Systems/Compute timeline visualization - Kernel metrics collection (grid/block dims, occupancy, registers) - Memory tracking with allocation profiling and leak detection - Chrome trace format export for GPU timeline visualization - Feature-gated via 'profiling' feature flag New files: - crates/ringkernel-cuda/src/profiling/mod.rs - crates/ringkernel-cuda/src/profiling/events.rs - crates/ringkernel-cuda/src/profiling/nvtx.rs - crates/ringkernel-cuda/src/profiling/metrics.rs - crates/ringkernel-cuda/src/profiling/memory_tracker.rs - crates/ringkernel-cuda/src/profiling/chrome_trace.rs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 8bf27f8 commit 2f0f0ff

14 files changed

Lines changed: 3501 additions & 26 deletions

File tree

CHANGELOG.md

Lines changed: 57 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,61 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [0.3.2] - 2026-01-20
11+
12+
### Added
13+
14+
#### GPU Profiling Infrastructure
15+
16+
- **CUDA Profiling Module** (`ringkernel-cuda/src/profiling/`) - **NEW MODULE**
17+
- Feature-gated via `profiling` feature flag
18+
- Comprehensive GPU profiling capabilities for performance analysis
19+
20+
- **CUDA Event Wrappers** (`profiling/events.rs`)
21+
- `CudaEvent` - RAII wrapper for CUDA events with timing support
22+
- `CudaEventFlags` - Event configuration (blocking sync, disable timing, interprocess)
23+
- `GpuTimer` - Start/stop timer using CUDA events with microsecond precision
24+
- `GpuTimerPool` - Pool of reusable timers with interior mutability for concurrent access
25+
26+
- **NVTX Integration** (`profiling/nvtx.rs`)
27+
- `CudaNvtxProfiler` - Real NVTX profiler using cudarc's nvtx module
28+
- Timeline visualization in Nsight Systems and Nsight Compute
29+
- `NvtxCategory` - Predefined categories (Kernel, Transfer, Memory, Sync, Queue, User)
30+
- `NvtxRange` - RAII wrapper for automatic range end on drop
31+
- `NvtxPayload` - Typed payloads for markers (I32, I64, U32, U64, F32, F64)
32+
- Implements `GpuProfiler` trait for integration with ringkernel-core
33+
34+
- **Kernel Metrics** (`profiling/metrics.rs`)
35+
- `KernelMetrics` - Execution metadata (grid/block dims, GPU time, occupancy, registers)
36+
- `TransferMetrics` - Memory transfer stats with bandwidth calculation
37+
- `TransferDirection` - HostToDevice, DeviceToHost, DeviceToDevice
38+
- `ProfilingSession` - Collects kernel and transfer events with timestamps
39+
- `KernelAttributes` - Query kernel attributes via cuFuncGetAttribute
40+
41+
- **Memory Tracking** (`profiling/memory_tracker.rs`)
42+
- `CudaMemoryTracker` - Track GPU memory allocations with timing
43+
- `TrackedAllocation` - Allocation metadata (ptr, size, kind, label, timestamp)
44+
- `CudaMemoryKind` - Device, Pinned, Mapped, Managed memory types
45+
- Peak usage tracking and allocation statistics
46+
- Integration with `GpuMemoryDashboard` from ringkernel-core
47+
48+
- **Chrome Trace Export** (`profiling/chrome_trace.rs`)
49+
- `GpuTraceEvent` - Chrome trace format event structure
50+
- `GpuEventArgs` - Rich event metadata (grid/block dims, occupancy, bandwidth)
51+
- `GpuChromeTraceBuilder` - Build Chrome trace JSON from profiling sessions
52+
- Support for kernel events, transfer events, NVTX ranges, memory allocations
53+
- Process/thread naming for multi-GPU and multi-stream visualization
54+
- Compatible with chrome://tracing, Perfetto UI, and Nsight Systems
55+
56+
### Changed
57+
58+
- **Dependencies** - Added `nvtx` feature to cudarc dependency
59+
- **ringkernel-cuda/Cargo.toml** - Added optional `serde` and `serde_json` for Chrome trace export
60+
61+
### Fixed
62+
63+
- Added `ProfilerRange::stub()` public constructor in ringkernel-core for external profiler implementations
64+
1065
## [0.3.1] - 2026-01-19
1166

1267
### Added
@@ -771,7 +826,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
771826
- CLAUDE.md with build commands and architecture overview
772827
- Code examples for all major features
773828

774-
[Unreleased]: https://github.com/mivertowski/RustCompute/compare/v0.3.1...HEAD
829+
[Unreleased]: https://github.com/mivertowski/RustCompute/compare/v0.3.2...HEAD
830+
[0.3.2]: https://github.com/mivertowski/RustCompute/compare/v0.3.1...v0.3.2
775831
[0.3.1]: https://github.com/mivertowski/RustCompute/compare/v0.3.0...v0.3.1
776832
[0.3.0]: https://github.com/mivertowski/RustCompute/compare/v0.2.0...v0.3.0
777833
[0.2.0]: https://github.com/mivertowski/RustCompute/compare/v0.1.3...v0.2.0

CLAUDE.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@ The project is a Cargo workspace with these crates:
9494
- **`ringkernel-core`** - Core traits and types (RingMessage, MessageQueue, HlcTimestamp, ControlBlock, RingContext, RingKernelRuntime, K2K messaging, PubSub)
9595
- **`ringkernel-derive`** - Proc macros: `#[derive(RingMessage)]`, `#[ring_kernel]`, `#[derive(GpuType)]`
9696
- **`ringkernel-cpu`** - CPU backend implementation (always available, used for testing/fallback)
97-
- **`ringkernel-cuda`** - NVIDIA CUDA backend with cooperative groups support (feature-gated)
97+
- **`ringkernel-cuda`** - NVIDIA CUDA backend with cooperative groups and GPU profiling support (feature-gated)
9898
- **`ringkernel-wgpu`** - WebGPU cross-platform backend (feature-gated)
9999
- **`ringkernel-metal`** - Apple Metal backend (feature-gated, macOS only, scaffold implementation with runtime/buffer/pipeline)
100100
- **`ringkernel-codegen`** - GPU kernel code generation
@@ -539,8 +539,9 @@ Main crate (`ringkernel`) features:
539539
- `metal` - Apple Metal backend
540540
- `all-backends` - All GPU backends
541541

542-
CUDA-specific features:
542+
CUDA-specific features (`ringkernel-cuda`):
543543
- `cooperative` - Enable CUDA cooperative groups for grid-wide synchronization (`grid.sync()`). Requires nvcc at build time for PTX compilation.
544+
- `profiling` - GPU profiling infrastructure (CUDA events, NVTX, memory tracking, Chrome trace export). Requires nvToolsExt library.
544545

545546
Core crate (`ringkernel-core`) enterprise features:
546547
- `crypto` - Real cryptography (AES-256-GCM, ChaCha20-Poly1305, Argon2)
@@ -805,7 +806,7 @@ let _ = device.poll(wgpu::PollType::wait_indefinitely());
805806

806807
## Dependency Versions
807808

808-
Key workspace dependencies (as of v0.3.1):
809+
Key workspace dependencies (as of v0.3.2):
809810

810811
| Category | Package | Version | Notes |
811812
|----------|---------|---------|-------|

Cargo.lock

Lines changed: 20 additions & 18 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ members = [
2626
]
2727

2828
[workspace.package]
29-
version = "0.3.1"
29+
version = "0.3.2"
3030
edition = "2021"
3131
authors = ["Michael Ivertowski <mivertowski@outlook.com>"]
3232
license = "Apache-2.0"

README.md

Lines changed: 44 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -643,13 +643,55 @@ cargo run -p ringkernel-txmon --bin txmon-benchmark --release --features cuda-co
643643

644644
Performance varies significantly by hardware and workload.
645645

646+
## GPU Profiling
647+
648+
RingKernel v0.3.2 includes comprehensive GPU profiling infrastructure. Enable with the `profiling` feature:
649+
650+
```toml
651+
[dependencies]
652+
ringkernel-cuda = { version = "0.3", features = ["profiling"] }
653+
```
654+
655+
### GPU Timer and Events
656+
657+
```rust
658+
use ringkernel_cuda::profiling::{GpuTimer, CudaNvtxProfiler, ProfilingSession};
659+
660+
// GPU-side timing with CUDA events
661+
let mut timer = GpuTimer::new()?;
662+
timer.start(stream)?;
663+
// ... kernel execution ...
664+
timer.stop(stream)?;
665+
println!("Kernel time: {:.3} ms", timer.elapsed_ms()?);
666+
667+
// NVTX profiling for Nsight Systems/Compute
668+
let profiler = CudaNvtxProfiler::new();
669+
{
670+
let _range = profiler.push_range("compute_phase", ProfilerColor::CYAN);
671+
// ... kernel execution ...
672+
} // Range automatically ends
673+
674+
// Export to Chrome trace format
675+
let session = ProfilingSession::new();
676+
// ... record kernel/transfer events ...
677+
let builder = GpuChromeTraceBuilder::from_session(&session);
678+
std::fs::write("gpu_trace.json", builder.build())?;
679+
```
680+
681+
Features include:
682+
- **CUDA Events**: GPU-side timing without CPU overhead
683+
- **NVTX Integration**: Timeline visualization in Nsight Systems/Compute
684+
- **Kernel Metrics**: Grid/block dims, occupancy, registers per thread
685+
- **Memory Tracking**: Allocation profiling with leak detection
686+
- **Chrome Trace Export**: GPU timeline visualization in chrome://tracing
687+
646688
## Enterprise Security
647689

648-
RingKernel v0.3.1 includes comprehensive enterprise security features. Enable with the `enterprise` feature:
690+
RingKernel v0.3.2 includes comprehensive enterprise security features. Enable with the `enterprise` feature:
649691

650692
```toml
651693
[dependencies]
652-
ringkernel-core = { version = "0.3.1", features = ["enterprise"] }
694+
ringkernel-core = { version = "0.3", features = ["enterprise"] }
653695
```
654696

655697
### Authentication & Authorization

crates/ringkernel-core/src/observability.rs

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1162,6 +1162,15 @@ impl ProfilerRange {
11621162
}
11631163
}
11641164

1165+
/// Create a stub profiler range for external profiler implementations.
1166+
///
1167+
/// This is used by custom profiler implementations (like CUDA NVTX) that
1168+
/// manage their own range lifecycle but need to return a ProfilerRange
1169+
/// for API compatibility.
1170+
pub fn stub(name: impl Into<String>, backend: GpuProfilerBackend) -> Self {
1171+
Self::new(name, backend)
1172+
}
1173+
11651174
/// Get elapsed duration.
11661175
pub fn elapsed(&self) -> Duration {
11671176
self.start.elapsed()

crates/ringkernel-cuda/Cargo.toml

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,8 @@ build = "build.rs"
1717
ringkernel-core = { workspace = true }
1818

1919
# CUDA bindings - auto-detect CUDA version from build system
20-
cudarc = { version = "0.18.2", optional = true, features = ["cuda-version-from-build-system"] }
20+
# The nvtx feature enables NVTX profiling integration
21+
cudarc = { version = "0.18.2", optional = true, features = ["cuda-version-from-build-system", "nvtx"] }
2122

2223
# Async runtime
2324
tokio = { workspace = true }
@@ -33,6 +34,10 @@ tracing = { workspace = true }
3334
# Synchronization
3435
parking_lot = { workspace = true }
3536

37+
# Serialization (for profiling chrome trace export)
38+
serde = { workspace = true, optional = true }
39+
serde_json = { workspace = true, optional = true }
40+
3641
[dev-dependencies]
3742
tokio = { workspace = true, features = ["test-util", "macros", "rt-multi-thread"] }
3843

@@ -45,3 +50,5 @@ cuda = ["cudarc"]
4550
# Cooperative groups support - requires nvcc at build time for PTX compilation
4651
# Enables grid-wide synchronization via cuLaunchCooperativeKernel
4752
cooperative = ["cuda"]
53+
# GPU profiling support - NVTX integration, CUDA events, Chrome trace export
54+
profiling = ["cuda", "serde", "serde_json"]

0 commit comments

Comments
 (0)