Skip to content

[PGO][HIP] Decouple device profile drain via HSA introspection#2743

Draft
lfmeadow wants to merge 1 commit into
amd-stagingfrom
device-pgo-introspection-drain-v2
Draft

[PGO][HIP] Decouple device profile drain via HSA introspection#2743
lfmeadow wants to merge 1 commit into
amd-stagingfrom
device-pgo-introspection-drain-v2

Conversation

@lfmeadow
Copy link
Copy Markdown

Status

Draft — for examination, not for merge yet. Supersedes draft PR #2714
(same logical patch content, rebased onto current amd-staging; the
intermediate multi-device fix in commit b1b20686afe becomes redundant
under this drain and its tests under compiler-rt/test/profile/GPU/
continue to work via the retained ABI forwarder).

This revision also addresses the static-review feedback gathered against
#2714 — see "Review-driven changes" below.

Summary

Replaces the host-shadow / per-CUID / hipModuleLoad-interceptor device
profile drain with an HSA-introspection drain. At process exit the
drain walks every loaded HSA code object on every GPU agent, finds the
canonical __llvm_profile_sections bounds table emitted by
compiler-rt/lib/profile/InstrProfilingPlatformGPU.c, D2H-copies its
counters/data/names back to the host, and writes an arch-prefixed
.profraw via __llvm_write_custom_profile.

What this changes vs. the existing approach

  • Host and device drains are fully decoupled. The device drain runs
    from an atexit handler installed by a library constructor in
    libclang_rt.profile. Device counters are collected whether or not
    the host TUs were instrumented and without any host-side per-TU
    shadow, CUID matching, or module-load interception.

  • Cases the old 1-1 host↔device model could not handle now work:
    separate device-only modules loaded at runtime
    (hipModuleLoad / hsa_executable_*), an uninstrumented host, and
    multi-GPU. Multi-GPU no longer relies on hipGetSymbolAddress,
    removing the comgr-at-atexit null deref that the predecessor
    band-aid worked around by restricting collection to a single device.

  • Clang stops emitting any PGO-specific machinery for HIP.
    CGCUDANV.cpp loses the offload-profiling shadow/registration code
    entirely. The driver gains a one-line -u __llvm_profile_runtime
    force-link so the device profile runtime is pulled in for HIP+PGO
    links.

  • ABI compatibility: legacy
    __llvm_profile_offload_register_{shadow,section_shadow,dynamic_module}
    and __llvm_profile_hip_collect_device_data are retained as
    no-ops / forwarders so binaries compiled against the previous runtime
    still link and produce output.

Review-driven changes (vs the #2714 snapshot)

  • drainDevices() idempotency — split into DrainInProgress /
    DrainCompleted; only latches "done" after a successful walk that
    actually drained data. Transient no-ops ("HSA/HIP not yet
    resolvable", "no GPU agents", "no loaded segments", "no instrumented
    sections") stay retryable so a late atexit call still picks up code
    objects that loaded after an early host-write trigger. Covered by
    the new compiler-rt/test/profile/AMDGPU/device-early-collect.hip.

  • HIP runtime resolution — tries RTLD_DEFAULT first (catches the
    common case of HIP already being in the process namespace, including
    runtime-only ROCm installs without an unversioned dev symlink), then
    falls back to dlopen of libamdhip64.so.{7,6,5,4} and finally the
    unversioned .so.

  • Section bounds validationprocessDeviceSections now does
    uintptr_t-based size math, rejects End < Begin and per-section
    spans above 256 MiB, requires
    DataSize % sizeof(__llvm_profile_data) == 0, and per-record
    validates that each CounterPtr resolves inside the copied counters
    region (out-of-range entries are zeroed and warned about instead of
    producing a .profraw that points at unrelated memory).

  • HSA symbol-iter error path — non-SUCCESS/non-INFO_BREAK
    return from hsa_executable_iterate_agent_symbols is now warned and
    reflected in the drain's exit status.

  • lit gate tightenedcompiler-rt/test/profile/lit.cfg.py now
    requires /dev/kfd plus a usable HIP install (probed via
    $ROCM_PATH//opt/rocm) plus the amdgcn device profile runtime
    in the resource directory before enabling the hip / amdgpu
    features. Exports %hip_lib_path and %amdgpu_arch so the new
    AMDGPU/*.hip tests stay portable (and consistent with the existing
    GPU/instrprof-hip-* tests).

Validation

Built clean against current amd-staging on Linux x86_64 + AMDGPU:
host clang/lld/compiler-rt, and libclang_rt.profile-amdgcn.a
(device profile runtime) for the amdgcn target.

End-to-end exercised with RCCL under its --enable-device-coverage
fast build path (debug + -gline-tables-only -O1, default device
linker, gfx90a, ~10 min incremental relink against the rebased runtime):

  • librccl.so carries the expected
    __llvm_prf_{names,cnts,data,vnds} + __llvm_covfun +
    __llvm_covmap sections and the
    __llvm_profile_hip_collect_device_data forwarder.
  • all_reduce_perf -g 2 produces 1 host .profraw + 26 device
    gfx90a.*.profraw per rank (one per loaded HSA executable), all
    LLVM raw profile version 10.
  • Verbose drain log:
    HIP resolved via existing process namespace (RTLD_DEFAULT) and
    walk complete: agents=4 pairs=28 found=26 drained=26 iter-failures=0
    — zero out-of-range counter pointer warnings.
  • llvm-profdata merge + llvm-cov report produce realistic per-file
    coverage for both the host librccl.so and the per-arch device ELF.

Replace the host-shadow / per-CUID / hipModuleLoad-interceptor device
profile drain with an HSA-introspection drain that walks every loaded
device code object at process exit, finds the canonical
__llvm_profile_sections bounds table emitted by
InstrProfilingPlatformGPU.c, D2H-copies its counters/data/names, and
writes an arch-prefixed .profraw via __llvm_write_custom_profile.

Host and device drains are now fully independent: the drain runs from an
atexit handler registered in a library constructor, so device counters
are collected whether or not the host TUs were instrumented and without
any host-side per-TU shadow, CUID matching, or module-load interception.
This fixes the cases the old 1-1 host<->device model could not handle
(separate device-only modules, uninstrumented host, multi-GPU).

  * compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp: full rewrite
    to the HSA-introspection drain (dlopen'd HSA/HIP, agent enumeration,
    hsa_ven_amd_loader segment descriptors, executable symbol walk,
    bounds dedup, idempotent drainDevices + atexit, collision-free
    target names). Legacy __llvm_profile_offload_register_* symbols kept
    as no-ops for ABI compatibility. Guarded host-only for GPU builds.

  * clang/lib/CodeGen/CGCUDANV.cpp: delete the entire offload-profiling
    machinery (OffloadProfShadow, emitOffloadProfilingSections, both
    shadow-registration sites). Clang emits nothing PGO-specific for HIP.

  * clang/lib/Driver/ToolChains/Gnu.cpp: for HIP host links built with
    PGO, force-link the drain object via
    -u__llvm_profile_hip_collect_device_data (it is otherwise
    unreferenced now that the host emits no shadow).

  * compiler-rt/cmake/Modules/AllSupportedArchDefs.cmake: add amdgcn /
    nvptx64 to ALL_PROFILE_SUPPORTED_ARCH so the device profile runtime
    (the sole source of __llvm_profile_sections after this change) is
    built for GPU targets.

Tests: * clang/test/CodeGenHIP/offload-pgo-sections.hip: now asserts the
    per-CUID struct / shadow / registration are NOT emitted.
  * clang/test/Driver/hip-profile-device-drain.hip: the -u force-link is
    added for HIP+PGO and only then.
  * compiler-rt/test/profile/instrprof-offload-abi-compat.c: legacy
    no-op symbols still link and run.
  * compiler-rt/test/profile/AMDGPU/{device-basic,device-no-kernel,
    device-symbols}.hip (REQUIRES: amdgpu) + lit .hip suffix and amdgpu
    feature gate.
Co-authored-by: Cursor <cursoragent@cursor.com>
@ronlieb ronlieb self-requested a review May 31, 2026 18:58
@ronlieb
Copy link
Copy Markdown
Collaborator

ronlieb commented May 31, 2026

does this need to up upstream at some point ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants