[PGO][HIP] Decouple device profile drain via HSA introspection#2743
Draft
lfmeadow wants to merge 1 commit into
Draft
[PGO][HIP] Decouple device profile drain via HSA introspection#2743lfmeadow wants to merge 1 commit into
lfmeadow wants to merge 1 commit into
Conversation
Replace the host-shadow / per-CUID / hipModuleLoad-interceptor device
profile drain with an HSA-introspection drain that walks every loaded
device code object at process exit, finds the canonical
__llvm_profile_sections bounds table emitted by
InstrProfilingPlatformGPU.c, D2H-copies its counters/data/names, and
writes an arch-prefixed .profraw via __llvm_write_custom_profile.
Host and device drains are now fully independent: the drain runs from an
atexit handler registered in a library constructor, so device counters
are collected whether or not the host TUs were instrumented and without
any host-side per-TU shadow, CUID matching, or module-load interception.
This fixes the cases the old 1-1 host<->device model could not handle
(separate device-only modules, uninstrumented host, multi-GPU).
* compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp: full rewrite
to the HSA-introspection drain (dlopen'd HSA/HIP, agent enumeration,
hsa_ven_amd_loader segment descriptors, executable symbol walk,
bounds dedup, idempotent drainDevices + atexit, collision-free
target names). Legacy __llvm_profile_offload_register_* symbols kept
as no-ops for ABI compatibility. Guarded host-only for GPU builds.
* clang/lib/CodeGen/CGCUDANV.cpp: delete the entire offload-profiling
machinery (OffloadProfShadow, emitOffloadProfilingSections, both
shadow-registration sites). Clang emits nothing PGO-specific for HIP.
* clang/lib/Driver/ToolChains/Gnu.cpp: for HIP host links built with
PGO, force-link the drain object via
-u__llvm_profile_hip_collect_device_data (it is otherwise
unreferenced now that the host emits no shadow).
* compiler-rt/cmake/Modules/AllSupportedArchDefs.cmake: add amdgcn /
nvptx64 to ALL_PROFILE_SUPPORTED_ARCH so the device profile runtime
(the sole source of __llvm_profile_sections after this change) is
built for GPU targets.
Tests: * clang/test/CodeGenHIP/offload-pgo-sections.hip: now asserts the
per-CUID struct / shadow / registration are NOT emitted.
* clang/test/Driver/hip-profile-device-drain.hip: the -u force-link is
added for HIP+PGO and only then.
* compiler-rt/test/profile/instrprof-offload-abi-compat.c: legacy
no-op symbols still link and run.
* compiler-rt/test/profile/AMDGPU/{device-basic,device-no-kernel,
device-symbols}.hip (REQUIRES: amdgpu) + lit .hip suffix and amdgpu
feature gate.
Co-authored-by: Cursor <cursoragent@cursor.com>
Collaborator
|
does this need to up upstream at some point ? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Status
Draft — for examination, not for merge yet. Supersedes draft PR #2714
(same logical patch content, rebased onto current
amd-staging; theintermediate multi-device fix in commit
b1b20686afebecomes redundantunder this drain and its tests under
compiler-rt/test/profile/GPU/continue to work via the retained ABI forwarder).
This revision also addresses the static-review feedback gathered against
#2714 — see "Review-driven changes" below.
Summary
Replaces the host-shadow / per-CUID /
hipModuleLoad-interceptor deviceprofile drain with an HSA-introspection drain. At process exit the
drain walks every loaded HSA code object on every GPU agent, finds the
canonical
__llvm_profile_sectionsbounds table emitted bycompiler-rt/lib/profile/InstrProfilingPlatformGPU.c, D2H-copies itscounters/data/names back to the host, and writes an arch-prefixed
.profrawvia__llvm_write_custom_profile.What this changes vs. the existing approach
Host and device drains are fully decoupled. The device drain runs
from an
atexithandler installed by a library constructor inlibclang_rt.profile. Device counters are collected whether or notthe host TUs were instrumented and without any host-side per-TU
shadow, CUID matching, or module-load interception.
Cases the old 1-1 host↔device model could not handle now work:
separate device-only modules loaded at runtime
(
hipModuleLoad/hsa_executable_*), an uninstrumented host, andmulti-GPU. Multi-GPU no longer relies on
hipGetSymbolAddress,removing the comgr-at-atexit null deref that the predecessor
band-aid worked around by restricting collection to a single device.
Clang stops emitting any PGO-specific machinery for HIP.
CGCUDANV.cpploses the offload-profiling shadow/registration codeentirely. The driver gains a one-line
-u __llvm_profile_runtimeforce-link so the device profile runtime is pulled in for HIP+PGO
links.
ABI compatibility: legacy
__llvm_profile_offload_register_{shadow,section_shadow,dynamic_module}and
__llvm_profile_hip_collect_device_dataare retained asno-ops / forwarders so binaries compiled against the previous runtime
still link and produce output.
Review-driven changes (vs the #2714 snapshot)
drainDevices()idempotency — split intoDrainInProgress/DrainCompleted; only latches "done" after a successful walk thatactually drained data. Transient no-ops ("HSA/HIP not yet
resolvable", "no GPU agents", "no loaded segments", "no instrumented
sections") stay retryable so a late
atexitcall still picks up codeobjects that loaded after an early host-write trigger. Covered by
the new
compiler-rt/test/profile/AMDGPU/device-early-collect.hip.HIP runtime resolution — tries
RTLD_DEFAULTfirst (catches thecommon case of HIP already being in the process namespace, including
runtime-only ROCm installs without an unversioned dev symlink), then
falls back to dlopen of
libamdhip64.so.{7,6,5,4}and finally theunversioned
.so.Section bounds validation —
processDeviceSectionsnow doesuintptr_t-based size math, rejectsEnd < Beginand per-sectionspans above 256 MiB, requires
DataSize % sizeof(__llvm_profile_data) == 0, and per-recordvalidates that each
CounterPtrresolves inside the copied countersregion (out-of-range entries are zeroed and warned about instead of
producing a
.profrawthat points at unrelated memory).HSA symbol-iter error path — non-
SUCCESS/non-INFO_BREAKreturn from
hsa_executable_iterate_agent_symbolsis now warned andreflected in the drain's exit status.
lit gate tightened —
compiler-rt/test/profile/lit.cfg.pynowrequires
/dev/kfdplus a usable HIP install (probed via$ROCM_PATH//opt/rocm) plus the amdgcn device profile runtimein the resource directory before enabling the
hip/amdgpufeatures. Exports
%hip_lib_pathand%amdgpu_archso the newAMDGPU/*.hiptests stay portable (and consistent with the existingGPU/instrprof-hip-*tests).Validation
Built clean against current
amd-stagingon Linux x86_64 + AMDGPU:host
clang/lld/compiler-rt, andlibclang_rt.profile-amdgcn.a(device profile runtime) for the amdgcn target.
End-to-end exercised with RCCL under its
--enable-device-coveragefast build path (debug +
-gline-tables-only -O1, default devicelinker, gfx90a, ~10 min incremental relink against the rebased runtime):
librccl.socarries the expected__llvm_prf_{names,cnts,data,vnds}+__llvm_covfun+__llvm_covmapsections and the__llvm_profile_hip_collect_device_dataforwarder.all_reduce_perf -g 2produces 1 host.profraw+ 26 devicegfx90a.*.profrawper rank (one per loaded HSA executable), allLLVM raw profile version 10.
HIP resolved via existing process namespace (RTLD_DEFAULT)andwalk complete: agents=4 pairs=28 found=26 drained=26 iter-failures=0— zero
out-of-range counter pointerwarnings.llvm-profdata merge+llvm-cov reportproduce realistic per-filecoverage for both the host
librccl.soand the per-arch device ELF.