[PGO][HIP] Decouple device profile drain via HSA introspection#2714
Draft
lfmeadow wants to merge 1 commit into
Draft
[PGO][HIP] Decouple device profile drain via HSA introspection#2714lfmeadow wants to merge 1 commit into
lfmeadow wants to merge 1 commit into
Conversation
Replace the host-shadow / per-CUID / hipModuleLoad-interceptor device
profile drain with an HSA-introspection drain that walks every loaded
device code object at process exit, finds the canonical
__llvm_profile_sections bounds table emitted by
InstrProfilingPlatformGPU.c, D2H-copies its counters/data/names, and
writes an arch-prefixed .profraw via __llvm_write_custom_profile.
Host and device drains are now fully independent: the drain runs from an
atexit handler registered in a library constructor, so device counters
are collected whether or not the host TUs were instrumented and without
any host-side per-TU shadow, CUID matching, or module-load interception.
This fixes the cases the old 1-1 host<->device model could not handle
(separate device-only modules, uninstrumented host, multi-GPU).
* compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp: full rewrite
to the HSA-introspection drain (dlopen'd HSA/HIP, agent enumeration,
hsa_ven_amd_loader segment descriptors, executable symbol walk,
bounds dedup, idempotent drainDevices + atexit, collision-free
target names). Legacy __llvm_profile_offload_register_* symbols kept
as no-ops for ABI compatibility. Guarded host-only for GPU builds.
* clang/lib/CodeGen/CGCUDANV.cpp: delete the entire offload-profiling
machinery (OffloadProfShadow, emitOffloadProfilingSections, both
shadow-registration sites). Clang emits nothing PGO-specific for HIP.
* clang/lib/Driver/ToolChains/Gnu.cpp: for HIP host links built with
PGO, force-link the drain object via
-u__llvm_profile_hip_collect_device_data (it is otherwise
unreferenced now that the host emits no shadow).
* compiler-rt/cmake/Modules/AllSupportedArchDefs.cmake: add amdgcn /
nvptx64 to ALL_PROFILE_SUPPORTED_ARCH so the device profile runtime
(the sole source of __llvm_profile_sections after this change) is
built for GPU targets.
Tests: * clang/test/CodeGenHIP/offload-pgo-sections.hip: now asserts the
per-CUID struct / shadow / registration are NOT emitted.
* clang/test/Driver/hip-profile-device-drain.hip: the -u force-link is
added for HIP+PGO and only then.
* compiler-rt/test/profile/instrprof-offload-abi-compat.c: legacy
no-op symbols still link and run.
* compiler-rt/test/profile/AMDGPU/{device-basic,device-no-kernel,
device-symbols}.hip (REQUIRES: amdgpu) + lit .hip suffix and amdgpu
feature gate.
Co-authored-by: Cursor <cursoragent@cursor.com>
7bdb9ed to
7541ca0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the host-shadow / per-CUID /
hipModuleLoad-interceptor device profiledrain (added in llvm#177665) with an HSA-introspection drain that, at process
exit, walks every loaded device code object on every GPU agent, finds the
canonical
__llvm_profile_sectionsbounds table emitted bycompiler-rt/lib/profile/InstrProfilingPlatformGPU.c, copies itscounters/data/names back to the host, and writes an arch-prefixed
.profrawvia
__llvm_write_custom_profile.Host and device profile drains become fully independent. The drain runs
from an
atexithandler registered in a library constructor, so devicecounters are collected whether or not the host translation units were
instrumented, and without any host-side per-TU shadow, CUID matching, or
module-load interception. This fixes the cases the old 1‑1 host↔device model
could not handle: separate device-only modules (e.g. runtime
hipModuleLoad),an uninstrumented host, and multi-GPU.
Everything lives in-tree in
libclang_rt.profile; there is no out-of-treelibrary or
LD_PRELOADshim.Changes
compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp— full rewrite tothe HSA-introspection drain: dlopen'd HSA/HIP via the interception helpers,
GPU-agent enumeration,
hsa_ven_amd_loader_query_segment_descriptorsto get(agent, executable)pairs, executable-symbol walk for the canonical__llvm_profile_sections, bounds dedup, idempotentdrainDevices()reachedby both the existing weak host-write hook and a constructor-registered
atexit, and collision-free target names for multi-module/non-RDC. Legacy__llvm_profile_offload_register_*symbols kept as no-ops for ABIcompatibility. The file is guarded host-only so the GPU profile-runtime build
compiles it to an empty TU.
atexit-only; no fatal-signal handler (acrash before
atexitloses device counters — documented limitation).clang/lib/CodeGen/CGCUDANV.cpp— delete the entire offload-profilingmachinery (
OffloadProfShadow,emitOffloadProfilingSections, the non-RDC__hipRegisterVar+ register-shadow block, the RDC offloading-entry + per-CUIDctor block). Clang now emits nothing PGO-specific for HIP.
clang/lib/Driver/ToolChains/Gnu.cpp— for HIP host links built with PGO,force-link the drain object via
-u__llvm_profile_hip_collect_device_data(it is otherwise unreferenced now that the host emits no shadow).
compiler-rt/cmake/Modules/AllSupportedArchDefs.cmake— addamdgcn/nvptx64toALL_PROFILE_SUPPORTED_ARCHso the device profile runtime (nowthe sole source of
__llvm_profile_sections) is actually built for GPUtargets.
filter_available_targetskeeps host builds unaffected.Tests
Tier A (hardware-free, run in CI):
clang/test/CodeGenHIP/offload-pgo-sections.hip— asserts the per-CUIDstruct / host shadow / register-shadow ctor are no longer emitted, while
device counter instrumentation still is.
clang/test/Driver/hip-profile-device-drain.hip—-u__llvm_profile_hip_collect_device_datais added for HIP+PGO links and only then.
compiler-rt/test/profile/instrprof-offload-abi-compat.c— objectsreferencing the legacy
__llvm_profile_offload_register_*symbols still linkand run against the new runtime.
Tier B (
REQUIRES: amdgpu) + lit.hipsuffix andamdgpufeature gate:compiler-rt/test/profile/AMDGPU/device-basic.hip— RDC + non-RDC; host anddevice
.profrawproduced, merged profile contains the device kernel, andllvm-covreports device coverage.device-no-kernel.hip— instrumented HIP program that launches no kernel:host
.profrawproduced, device drain is a clean no-op (no crash, no spuriousfile).
device-symbols.hip—__llvm_profile_sectionspresent (PROTECTED, in.dynsym) in the device ELF for RDC + non-RDC.Test plan
libclang_rt.profile+ amdgcn device profile runtime.__llvm_profile_sections(PROTECTED/dynsym) for RDC and non-RDC..profraw, merge,llvm-covdevice coverage.llvm-lit.all_reduce_perfsmoke (validated previously out-of-tree with the same algorithm; not re-run here).Made with Cursor