[PGO][HIP] Decouple device profile drain via HSA introspection by lfmeadow · Pull Request #2743 · ROCm/llvm-project

lfmeadow · 2026-05-31T18:14:59Z

Status

Draft — for examination, not for merge yet. Supersedes draft PR #2714
(same logical patch content, rebased onto current amd-staging; the
intermediate multi-device fix in commit b1b20686afe becomes redundant
under this drain and its tests under compiler-rt/test/profile/GPU/
continue to work via the retained ABI forwarder).

This revision also addresses the static-review feedback gathered against
#2714 — see "Review-driven changes" below.

Summary

Replaces the host-shadow / per-CUID / hipModuleLoad-interceptor device
profile drain with an HSA-introspection drain. At process exit the
drain walks every loaded HSA code object on every GPU agent, finds the
canonical __llvm_profile_sections bounds table emitted by
compiler-rt/lib/profile/InstrProfilingPlatformGPU.c, D2H-copies its
counters/data/names back to the host, and writes an arch-prefixed
.profraw via __llvm_write_custom_profile.

What this changes vs. the existing approach

Host and device drains are fully decoupled. The device drain runs
from an atexit handler installed by a library constructor in
libclang_rt.profile. Device counters are collected whether or not
the host TUs were instrumented and without any host-side per-TU
shadow, CUID matching, or module-load interception.
Cases the old 1-1 host↔device model could not handle now work:
separate device-only modules loaded at runtime
(hipModuleLoad / hsa_executable_*), an uninstrumented host, and
multi-GPU. Multi-GPU no longer relies on hipGetSymbolAddress,
removing the comgr-at-atexit null deref that the predecessor
band-aid worked around by restricting collection to a single device.
Clang stops emitting any PGO-specific machinery for HIP.
CGCUDANV.cpp loses the offload-profiling shadow/registration code
entirely. The driver gains a one-line -u __llvm_profile_runtime
force-link so the device profile runtime is pulled in for HIP+PGO
links.
ABI compatibility: legacy
__llvm_profile_offload_register_{shadow,section_shadow,dynamic_module}
and __llvm_profile_hip_collect_device_data are retained as
no-ops / forwarders so binaries compiled against the previous runtime
still link and produce output.

Review-driven changes (vs the #2714 snapshot)

drainDevices() idempotency — split into DrainInProgress /
DrainCompleted; only latches "done" after a successful walk that
actually drained data. Transient no-ops ("HSA/HIP not yet
resolvable", "no GPU agents", "no loaded segments", "no instrumented
sections") stay retryable so a late atexit call still picks up code
objects that loaded after an early host-write trigger. Covered by
the new compiler-rt/test/profile/AMDGPU/device-early-collect.hip.
HIP runtime resolution — tries RTLD_DEFAULT first (catches the
common case of HIP already being in the process namespace, including
runtime-only ROCm installs without an unversioned dev symlink), then
falls back to dlopen of libamdhip64.so.{7,6,5,4} and finally the
unversioned .so.
Section bounds validation — processDeviceSections now does
uintptr_t-based size math, rejects End < Begin and per-section
spans above 256 MiB, requires
DataSize % sizeof(__llvm_profile_data) == 0, and per-record
validates that each CounterPtr resolves inside the copied counters
region (out-of-range entries are zeroed and warned about instead of
producing a .profraw that points at unrelated memory).
HSA symbol-iter error path — non-SUCCESS/non-INFO_BREAK
return from hsa_executable_iterate_agent_symbols is now warned and
reflected in the drain's exit status.
lit gate tightened — compiler-rt/test/profile/lit.cfg.py now
requires /dev/kfd plus a usable HIP install (probed via
$ROCM_PATH//opt/rocm) plus the amdgcn device profile runtime
in the resource directory before enabling the hip / amdgpu
features. Exports %hip_lib_path and %amdgpu_arch so the new
AMDGPU/*.hip tests stay portable (and consistent with the existing
GPU/instrprof-hip-* tests).

Validation

Built clean against current amd-staging on Linux x86_64 + AMDGPU:
host clang/lld/compiler-rt, and libclang_rt.profile-amdgcn.a
(device profile runtime) for the amdgcn target.

End-to-end exercised with RCCL under its --enable-device-coverage
fast build path (debug + -gline-tables-only -O1, default device
linker, gfx90a, ~10 min incremental relink against the rebased runtime):

librccl.so carries the expected
__llvm_prf_{names,cnts,data,vnds} + __llvm_covfun +
__llvm_covmap sections and the
__llvm_profile_hip_collect_device_data forwarder.
all_reduce_perf -g 2 produces 1 host .profraw + 26 device
gfx90a.*.profraw per rank (one per loaded HSA executable), all
LLVM raw profile version 10.
Verbose drain log:
HIP resolved via existing process namespace (RTLD_DEFAULT) and
walk complete: agents=4 pairs=28 found=26 drained=26 iter-failures=0
— zero out-of-range counter pointer warnings.
llvm-profdata merge + llvm-cov report produce realistic per-file
coverage for both the host librccl.so and the per-arch device ELF.

Replace the host-shadow / per-CUID / hipModuleLoad-interceptor device profile drain with an HSA-introspection drain that walks every loaded device code object at process exit, finds the canonical __llvm_profile_sections bounds table emitted by InstrProfilingPlatformGPU.c, D2H-copies its counters/data/names, and writes an arch-prefixed .profraw via __llvm_write_custom_profile. Host and device drains are now fully independent: the drain runs from an atexit handler registered in a library constructor, so device counters are collected whether or not the host TUs were instrumented and without any host-side per-TU shadow, CUID matching, or module-load interception. This fixes the cases the old 1-1 host<->device model could not handle (separate device-only modules, uninstrumented host, multi-GPU). * compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp: full rewrite to the HSA-introspection drain (dlopen'd HSA/HIP, agent enumeration, hsa_ven_amd_loader segment descriptors, executable symbol walk, bounds dedup, idempotent drainDevices + atexit, collision-free target names). Legacy __llvm_profile_offload_register_* symbols kept as no-ops for ABI compatibility. Guarded host-only for GPU builds. * clang/lib/CodeGen/CGCUDANV.cpp: delete the entire offload-profiling machinery (OffloadProfShadow, emitOffloadProfilingSections, both shadow-registration sites). Clang emits nothing PGO-specific for HIP. * clang/lib/Driver/ToolChains/Gnu.cpp: for HIP host links built with PGO, force-link the drain object via -u__llvm_profile_hip_collect_device_data (it is otherwise unreferenced now that the host emits no shadow). * compiler-rt/cmake/Modules/AllSupportedArchDefs.cmake: add amdgcn / nvptx64 to ALL_PROFILE_SUPPORTED_ARCH so the device profile runtime (the sole source of __llvm_profile_sections after this change) is built for GPU targets. Tests: * clang/test/CodeGenHIP/offload-pgo-sections.hip: now asserts the per-CUID struct / shadow / registration are NOT emitted. * clang/test/Driver/hip-profile-device-drain.hip: the -u force-link is added for HIP+PGO and only then. * compiler-rt/test/profile/instrprof-offload-abi-compat.c: legacy no-op symbols still link and run. * compiler-rt/test/profile/AMDGPU/{device-basic,device-no-kernel, device-symbols}.hip (REQUIRES: amdgpu) + lit .hip suffix and amdgpu feature gate. Co-authored-by: Cursor <cursoragent@cursor.com>

ronlieb · 2026-05-31T19:21:58Z

does this need to up upstream at some point ?

ronlieb self-requested a review May 31, 2026 18:58

lfmeadow mentioned this pull request Jun 1, 2026

[RCCL] In-tree device coverage via HSA introspection drain ROCm/rocm-systems#6630

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PGO][HIP] Decouple device profile drain via HSA introspection#2743

[PGO][HIP] Decouple device profile drain via HSA introspection#2743
lfmeadow wants to merge 1 commit into
amd-stagingfrom
device-pgo-introspection-drain-v2

lfmeadow commented May 31, 2026

Uh oh!

ronlieb commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lfmeadow commented May 31, 2026

Status

Summary

What this changes vs. the existing approach

Review-driven changes (vs the #2714 snapshot)

Validation

Uh oh!

ronlieb commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants