Skip to content

[Build] rocMLIR ABI version drift: graceful-disable path in mlir.cpp is undermined by unconditional header include and CMake link behavior #4800

@maherr

Description

@maherr

Describe the issue

The rocm-6.4.2 branch does not build against Fedora 43's packaged rocMLIR, and the existing graceful-disable mechanism in src/targets/gpu/mlir.cpp does not fully work. Filing this to document the gap and ask whether there is a supported path for distros that ship a non-matched rocMLIR.

To unblock a local build I had to replace src/targets/gpu/fuse_mlir.cpp and src/targets/gpu/mlir.cpp with no-op stubs (the entire MLIR fusion pass and all MLIR introspection helpers). That works functionally on a diarization workload, but it is a sledgehammer and clearly not the right upstream fix.

Where the graceful-disable intent is today

src/targets/gpu/mlir.cpp already has an explicit ABI-version check:

#if !defined(MLIR_MIGRAPHX_DIALECT_API_VERSION) || MLIR_MIGRAPHX_DIALECT_API_VERSION != 4
#warning "Incompatible version of rocMLIR library used, disabling"
#ifndef CPPCHECK
#undef MIGRAPHX_MLIR
#endif
#else
#include <mlir-c/RegisterRocMLIR.h>
#endif

The intent is clear: if the packaged rocMLIR does not match the expected dialect API version, disable MLIR gracefully via the #else branch at the bottom of the file. Two concrete things prevent that from working on Fedora 43:

1. Unconditional rocMLIR header include before the version check

At the top of mlir.cpp, before the #ifdef MIGRAPHX_MLIR block that guards the rest of the rocMLIR includes, there is an unconditional line:

#include <mlir-c/Dialect/RockEnums.h>

If the distro packages rocMLIR but without that header (or at a layout where this path does not resolve), the translation unit fails to compile before the in-source version check has any chance to run. Moving this include inside the #ifdef MIGRAPHX_MLIR guard, or at least inside the "version matches" branch, would restore the graceful path.

2. CMake links rocMLIR libraries even after the compile-time #undef

Even if the version-check path does fire and #undef MIGRAPHX_MLIR kicks in, the #else stubs at the bottom of mlir.cpp return empty objects, but CMake has already decided (from its earlier find_package / find_library step) that rocMLIR is present and appends the MLIR libraries to target_link_libraries on libmigraphx_gpu. The symbols the stubs would have emitted are gone, and the linker then fails on unresolved references from other translation units that included <migraphx/gpu/mlir.hpp> and still expect MLIR-bound symbols.

In other words the in-source version check is too late: by the time the preprocessor disables MIGRAPHX_MLIR, CMake has already committed to linking against rocMLIR.

What I am shipping locally as a stop-gap

My local tree replaces both fuse_mlir.cpp and mlir.cpp with minimal stubs (mlir_enabled() -> false, fuse_mlir::apply as a no-op, all dump_mlir / compile_mlir / insert_mlir / get_tuning_config_mlir return empty). That lets the gpu target build and link with zero references to any rocMLIR symbol. It is not a real fix because it also disables MLIR for people who have a matched rocMLIR. Patches are under patches/03-migraphx-mlir-fuse-stub.patch and patches/04-migraphx-mlir-introspection-stub.patch in https://github.com/maherr/onnxruntime-migraphx-rdna4 if it helps as a reference point.

Question for maintainers

What is the supported path for distros that ship a rocMLIR that does not match MLIR_MIGRAPHX_DIALECT_API_VERSION == 4?

  1. Move the in-source version check up to CMake configure time, so the decision to link rocMLIR is made after the dialect API version is verified.
  2. Keep the in-source check but gate the #include <mlir-c/Dialect/RockEnums.h> (and any other unconditional rocMLIR header) behind it, and teach CMake to honor a MIGRAPHX_MLIR_DISABLED_BY_VERSION signal.
  3. Add a -DMIGRAPHX_USE_MLIR=OFF CMake option that fully disables MLIR everywhere (sources and link line), so users on non-matched rocMLIR distros have an explicit opt-out.
  4. Something else the team is already considering.

If the answer is (3), I am happy to send a PR adding that option and gating both files on it. If the answer is (1) or (2), the design is yours to choose.

Reproduction environment

Layer Version
OS Fedora 43, kernel 6.19.11
MIGraphX rocm-6.4.2 branch
ROCm 6.4.4 (Fedora packages)
rocMLIR Fedora 43 packaged (I do not have the exact version string on hand; happy to regather if useful)
GPU target gfx1201

I do not have a clean saved build log from the failing configuration; the patches landed about a month ago and the log was not kept. If a repro log would be useful I can rebuild without the patches and capture it.

Upfront caveat on platform

gfx1201 is not officially supported in ROCm 6.4, and this might resolve naturally when ROCm 7.x ships. Filing anyway because the rocMLIR ABI drift is a real architectural gap that will likely recur on other distros as the rocMLIR ABI evolves, independent of GPU arch support.

Related

CC @causten @pfultz2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions