Describe the issue
The rocm-6.4.2 branch does not build against Fedora 43's packaged rocMLIR, and the existing graceful-disable mechanism in src/targets/gpu/mlir.cpp does not fully work. Filing this to document the gap and ask whether there is a supported path for distros that ship a non-matched rocMLIR.
To unblock a local build I had to replace src/targets/gpu/fuse_mlir.cpp and src/targets/gpu/mlir.cpp with no-op stubs (the entire MLIR fusion pass and all MLIR introspection helpers). That works functionally on a diarization workload, but it is a sledgehammer and clearly not the right upstream fix.
Where the graceful-disable intent is today
src/targets/gpu/mlir.cpp already has an explicit ABI-version check:
#if !defined(MLIR_MIGRAPHX_DIALECT_API_VERSION) || MLIR_MIGRAPHX_DIALECT_API_VERSION != 4
#warning "Incompatible version of rocMLIR library used, disabling"
#ifndef CPPCHECK
#undef MIGRAPHX_MLIR
#endif
#else
#include <mlir-c/RegisterRocMLIR.h>
#endif
The intent is clear: if the packaged rocMLIR does not match the expected dialect API version, disable MLIR gracefully via the #else branch at the bottom of the file. Two concrete things prevent that from working on Fedora 43:
1. Unconditional rocMLIR header include before the version check
At the top of mlir.cpp, before the #ifdef MIGRAPHX_MLIR block that guards the rest of the rocMLIR includes, there is an unconditional line:
#include <mlir-c/Dialect/RockEnums.h>
If the distro packages rocMLIR but without that header (or at a layout where this path does not resolve), the translation unit fails to compile before the in-source version check has any chance to run. Moving this include inside the #ifdef MIGRAPHX_MLIR guard, or at least inside the "version matches" branch, would restore the graceful path.
2. CMake links rocMLIR libraries even after the compile-time #undef
Even if the version-check path does fire and #undef MIGRAPHX_MLIR kicks in, the #else stubs at the bottom of mlir.cpp return empty objects, but CMake has already decided (from its earlier find_package / find_library step) that rocMLIR is present and appends the MLIR libraries to target_link_libraries on libmigraphx_gpu. The symbols the stubs would have emitted are gone, and the linker then fails on unresolved references from other translation units that included <migraphx/gpu/mlir.hpp> and still expect MLIR-bound symbols.
In other words the in-source version check is too late: by the time the preprocessor disables MIGRAPHX_MLIR, CMake has already committed to linking against rocMLIR.
What I am shipping locally as a stop-gap
My local tree replaces both fuse_mlir.cpp and mlir.cpp with minimal stubs (mlir_enabled() -> false, fuse_mlir::apply as a no-op, all dump_mlir / compile_mlir / insert_mlir / get_tuning_config_mlir return empty). That lets the gpu target build and link with zero references to any rocMLIR symbol. It is not a real fix because it also disables MLIR for people who have a matched rocMLIR. Patches are under patches/03-migraphx-mlir-fuse-stub.patch and patches/04-migraphx-mlir-introspection-stub.patch in https://github.com/maherr/onnxruntime-migraphx-rdna4 if it helps as a reference point.
Question for maintainers
What is the supported path for distros that ship a rocMLIR that does not match MLIR_MIGRAPHX_DIALECT_API_VERSION == 4?
- Move the in-source version check up to CMake configure time, so the decision to link rocMLIR is made after the dialect API version is verified.
- Keep the in-source check but gate the
#include <mlir-c/Dialect/RockEnums.h> (and any other unconditional rocMLIR header) behind it, and teach CMake to honor a MIGRAPHX_MLIR_DISABLED_BY_VERSION signal.
- Add a
-DMIGRAPHX_USE_MLIR=OFF CMake option that fully disables MLIR everywhere (sources and link line), so users on non-matched rocMLIR distros have an explicit opt-out.
- Something else the team is already considering.
If the answer is (3), I am happy to send a PR adding that option and gating both files on it. If the answer is (1) or (2), the design is yours to choose.
Reproduction environment
| Layer |
Version |
| OS |
Fedora 43, kernel 6.19.11 |
| MIGraphX |
rocm-6.4.2 branch |
| ROCm |
6.4.4 (Fedora packages) |
| rocMLIR |
Fedora 43 packaged (I do not have the exact version string on hand; happy to regather if useful) |
| GPU target |
gfx1201 |
I do not have a clean saved build log from the failing configuration; the patches landed about a month ago and the log was not kept. If a repro log would be useful I can rebuild without the patches and capture it.
Upfront caveat on platform
gfx1201 is not officially supported in ROCm 6.4, and this might resolve naturally when ROCm 7.x ships. Filing anyway because the rocMLIR ABI drift is a real architectural gap that will likely recur on other distros as the rocMLIR ABI evolves, independent of GPU arch support.
Related
CC @causten @pfultz2
Describe the issue
The
rocm-6.4.2branch does not build against Fedora 43's packaged rocMLIR, and the existing graceful-disable mechanism insrc/targets/gpu/mlir.cppdoes not fully work. Filing this to document the gap and ask whether there is a supported path for distros that ship a non-matched rocMLIR.To unblock a local build I had to replace
src/targets/gpu/fuse_mlir.cppandsrc/targets/gpu/mlir.cppwith no-op stubs (the entire MLIR fusion pass and all MLIR introspection helpers). That works functionally on a diarization workload, but it is a sledgehammer and clearly not the right upstream fix.Where the graceful-disable intent is today
src/targets/gpu/mlir.cppalready has an explicit ABI-version check:The intent is clear: if the packaged rocMLIR does not match the expected dialect API version, disable MLIR gracefully via the
#elsebranch at the bottom of the file. Two concrete things prevent that from working on Fedora 43:1. Unconditional rocMLIR header include before the version check
At the top of
mlir.cpp, before the#ifdef MIGRAPHX_MLIRblock that guards the rest of the rocMLIR includes, there is an unconditional line:If the distro packages rocMLIR but without that header (or at a layout where this path does not resolve), the translation unit fails to compile before the in-source version check has any chance to run. Moving this include inside the
#ifdef MIGRAPHX_MLIRguard, or at least inside the "version matches" branch, would restore the graceful path.2. CMake links rocMLIR libraries even after the compile-time
#undefEven if the version-check path does fire and
#undef MIGRAPHX_MLIRkicks in, the#elsestubs at the bottom ofmlir.cppreturn empty objects, but CMake has already decided (from its earlierfind_package/find_librarystep) that rocMLIR is present and appends the MLIR libraries totarget_link_librariesonlibmigraphx_gpu. The symbols the stubs would have emitted are gone, and the linker then fails on unresolved references from other translation units that included<migraphx/gpu/mlir.hpp>and still expect MLIR-bound symbols.In other words the in-source version check is too late: by the time the preprocessor disables MIGRAPHX_MLIR, CMake has already committed to linking against rocMLIR.
What I am shipping locally as a stop-gap
My local tree replaces both
fuse_mlir.cppandmlir.cppwith minimal stubs (mlir_enabled() -> false,fuse_mlir::applyas a no-op, alldump_mlir/compile_mlir/insert_mlir/get_tuning_config_mlirreturn empty). That lets thegputarget build and link with zero references to any rocMLIR symbol. It is not a real fix because it also disables MLIR for people who have a matched rocMLIR. Patches are underpatches/03-migraphx-mlir-fuse-stub.patchandpatches/04-migraphx-mlir-introspection-stub.patchin https://github.com/maherr/onnxruntime-migraphx-rdna4 if it helps as a reference point.Question for maintainers
What is the supported path for distros that ship a rocMLIR that does not match
MLIR_MIGRAPHX_DIALECT_API_VERSION == 4?#include <mlir-c/Dialect/RockEnums.h>(and any other unconditional rocMLIR header) behind it, and teach CMake to honor aMIGRAPHX_MLIR_DISABLED_BY_VERSIONsignal.-DMIGRAPHX_USE_MLIR=OFFCMake option that fully disables MLIR everywhere (sources and link line), so users on non-matched rocMLIR distros have an explicit opt-out.If the answer is (3), I am happy to send a PR adding that option and gating both files on it. If the answer is (1) or (2), the design is yours to choose.
Reproduction environment
rocm-6.4.2branchI do not have a clean saved build log from the failing configuration; the patches landed about a month ago and the log was not kept. If a repro log would be useful I can rebuild without the patches and capture it.
Upfront caveat on platform
gfx1201 is not officially supported in ROCm 6.4, and this might resolve naturally when ROCm 7.x ships. Filing anyway because the rocMLIR ABI drift is a real architectural gap that will likely recur on other distros as the rocMLIR ABI evolves, independent of GPU arch support.
Related
no_device.cpp#errorunder hipcc device-compile pass: different file and different mechanism, filing separately so they can be triaged independently.corrupted double-linked listSIGABRT at process exit on RDNA 4 (gfx1201) with MIGraphX 6.4.2 + ORT 1.24.2 #4792 for an unrelated process-exit heap race.CC @causten @pfultz2