feat(compiler): Add custom LLVM pass pipeline and plugin support to JIT by fsx950223 · Pull Request #613 · ROCm/FlyDSL

fsx950223 · 2026-06-02T09:29:07Z

Summary

Adds an opt-in path to control LLVM IR optimization of the device kernel in the JIT,
including loading external LLVM pass plugins.

@flyc.jit(llvm_pass_pipeline="default<O3>,my-pass",
          llvm_pass_plugins=["/path/libMyPass.so"])

Also configurable via env vars FLYDSL_COMPILE_LLVM_PASS_PIPELINE /
FLYDSL_COMPILE_LLVM_PASS_PLUGINS (overridden by the decorator args).

A second, deeper path adds custom MIR / codegen passes (instruction
selection-time and machine-level transforms that the opt IR pipeline cannot
reach) via a small in-tree tool fly-llc:

@flyc.jit(llvm_codegen_passes=["my-mir-pass"],
          llvm_codegen_plugins=["/path/libMyMir.so"])

How it works

LLVM optimization is normally the monolithic upstream gpu-module-to-binary
(translate → optimize@O → codegen), with no hook for a custom -passes= pipeline or
--load-pass-plugin. When llvm_pass_pipeline is set, compilation takes a new sub-path
(in MlirCompiler._compile_with_llvm_opt):

Run the pre-binary fragments in-process (gpu.module → LLVM dialect).
Extract the device kernel's (pre-link) LLVM IR.
Run opt --passes="<pipeline>" [--load-pass-plugin=...] on it.
mlir-translate --import-llvm → re-wrap into a gpu.module → re-codegen via
gpu-module-to-binary at O=0 (so the user pipeline isn't re-optimized), reusing
ROCm device-lib linking + HSACO production.
Splice the produced gpu.binary back; runtime loads it unchanged.

When llvm_codegen_passes is set, the tail instead runs fly-llc (a minimal
addPassesToEmitFile driver that injects the named MIR passes pre-emit) →
ld.lld → HSACO → splice. This is what enables genuine machine-level passes
(e.g. scheduling/reordering), which opt cannot express.

Both paths require FLYDSL_COMPILE_LLVM_DIR, are entirely gated behind their
options being set (zero impact on the default path), and fold the effective
pipeline/pass names + plugin file-content hashes into the JIT cache key.

Changes

backends/rocm.py: factor out _rocdl_opts/_bin_cli_opts; add
llvm_recodegen_fragments(opt_level=0).
external_llvm.py: run_llvm_opt_then_binary(...), run_fly_llc_codegen(...),
and llvm_opt_fingerprint(...) / fly_llc_codegen_fingerprint(...).
jit_function.py: jit(llvm_pass_pipeline=, llvm_pass_plugins=, llvm_codegen_passes=, llvm_codegen_plugins=), PipelineConfig fields, _effective_* resolvers,
_compile_with_llvm_opt / _compile_with_fly_llc, cache-key folds, and
FLYDSL_COMPILE_LLVM_* added to the cache-invalidating env vars.
utils/env.py: new compile env vars.
tools/fly-llc/: new IR→object tool with injectable pre-emit MIR passes.

Test Plan

New tests (16):

File	Tier	Count	Covers
`tests/unit/test_llvm_pass_pipeline.py`	unit (no GPU)	10	recodegen fragments, decorator/env precedence, fingerprint invalidation (opt + codegen plumbing)
`tests/kernels/test_llvm_pass_plugin_e2e.py`	l2_device	2	real `opt` pass plugin injecting a device `printf` (hostcall `__ockl_printf_*`); positive runs+correct, negative fails without the plugin
`tests/kernels/test_llvm_codegen_pass_e2e.py`	l2_device	4	`fly-llc` MIR pass: runs, is required, modifies ASM (s_nop sled at entry), and reorders instructions (safe scheduler — same instruction multiset, different order, results unchanged)

Run the GPU-free plumbing tests:

python -m pytest tests/unit/test_llvm_pass_pipeline.py -v

Run the device e2e tests (need a ROCm GPU, FLYDSL_COMPILE_LLVM_DIR, a C++
compiler; the codegen tests also need the fly-llc tool and ld.lld):

FLYDSL_COMPILE_LLVM_DIR=<llvm-install> \
FLYDSL_COMPILE_FLY_LLC=<.../bin/fly-llc> \
FLYDSL_COMPILE_LLD=<.../bin/ld.lld> \
python -m pytest tests/kernels/test_llvm_pass_plugin_e2e.py \
                 tests/kernels/test_llvm_codegen_pass_e2e.py -v

All e2e tests skip cleanly (with a reason) when the GPU/toolchain is unavailable,
so they are CI-safe. Use pytest -s to see the injected device printf /
scheduling effects.

Test Result

Verified on MI308X (gfx942), LLVM/MLIR install used for FLYDSL_COMPILE_LLVM_DIR
and fly-llc, ld.lld from ROCm:

24 passed (16 new + 8 existing backend/external-codegen regression tests);
e2e tests skip cleanly when the toolchain is absent.
Custom opt plugin: device printf("threadIdx.x=%d") observed for lanes 0–63.
Custom MIR pass: confirmed it both modifies the emitted ASM (entry s_nop sled)
and reorders instructions while keeping results correct.
black (line length 120) and ruff clean on all changed files.

Notes / follow-ups

fly-llc is built in-tree (tools/fly-llc/); ld.lld is located via
FLYDSL_COMPILE_LLD or <FLYDSL_COMPILE_LLVM_DIR>/bin.
The fly-llc codegen path does not yet link ROCm device libs, so kernels that
pull in ockl/ocml from a codegen pass are out of scope for now (compute
kernels work); chaining the opt IR pipeline before fly-llc is also a follow-up.

🤖 Generated with Claude Code

Signed-off-by: fsx950223 <fsx950223@outlook.com>

Copilot

Pull request overview

Adds an opt-in JIT compilation path that lets users run a custom LLVM new-PM opt --passes=... pipeline (optionally with --load-pass-plugin=...) on the device kernel LLVM IR, then re-codegen and cache the resulting GPU binary.

Changes:

Extend @flyc.jit + compile env to accept llvm_pass_pipeline / llvm_pass_plugins, and route compilation through a new “extract LLVM IR → run opt → re-import → external re-codegen” path.
Add external-LLVM helpers for opt execution, plugin/pipeline fingerprinting for cache invalidation, and ROCm backend helpers to re-codegen at O=0.
Add unit plumbing tests plus an end-to-end pass-plugin test that builds a real LLVM plugin and exercises the full chain.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`tests/unit/test_llvm_pass_pipeline.py`	Unit tests for decorator/env precedence, ROCm recodegen fragments, and cache fingerprint behavior.
`tests/kernels/test_llvm_pass_plugin_e2e.py`	E2E test that builds and loads an LLVM pass plugin and validates the custom-pipeline JIT path (plus negative case).
`python/flydsl/utils/env.py`	Adds `FLYDSL_COMPILE_LLVM_PASS_PIPELINE` / `FLYDSL_COMPILE_LLVM_PASS_PLUGINS` compile env knobs.
`python/flydsl/compiler/jit_function.py`	Wires new config into pipeline selection, compilation path, and cache key; extends `jit()` decorator API.
`python/flydsl/compiler/external_llvm.py`	Implements `opt`-then-recodegen flow plus plugin/pipeline fingerprinting.
`python/flydsl/compiler/backends/rocm.py`	Factors ROCm pipeline option formatting and adds `llvm_recodegen_fragments(opt_level=0)` helper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    A = torch.randint(0, 10, (n,), dtype=torch.float32).cuda()
+    B = torch.randint(0, 10, (n,), dtype=torch.float32).cuda()


Signed-off-by: fsx950223 <fsx950223@outlook.com>

jli-melchior · 2026-06-03T09:13:04Z

+    errs() << "fly-llc: addISelPasses failed\n";
+    return 1;
+  }
+  PC->addMachinePasses();


Currently fly-llc only supports injecting custom passes at a single point — after addMachinePasses() and before addAsmPrinter():

PC->addISelPasses();
PC->addMachinePasses(); // ISel, RegAlloc, Scheduling all run here
// ---- only injection point ----
for (auto &name : PreEmitPass)
PM.add(PI->getNormalCtor()());
PC->setInitialized();
CG.addAsmPrinter(...);

By the time user passes run, RegAlloc and instruction scheduling have already completed. This limits the scope of custom passes to post-RA transformations (NOP insertion, peephole, instruction reordering,
etc.). Passes that need to operate at earlier pipeline stages — e.g. custom spill strategies (pre-RA), target-specific legalization (post-ISel), or custom scheduling heuristics (pre-sched2) — cannot be
supported.

LLVM's TargetPassConfig already provides virtual hooks at multiple codegen stages:

addPostRegAlloc() — after register allocation

addPreRegAlloc() — before register allocation

addPreSched2() — before the second scheduling pass

addPreEmitPass() / addPreEmitPass2() — before emission (current fly-llc insertion point)

If we can allow more insert point should be better and flexible?

Signed-off-by: fsx950223 <fsx950223@outlook.com>

feat(compiler): Add custom LLVM pass pipeline and plugin support to JIT

aa8c12a

Signed-off-by: fsx950223 <fsx950223@outlook.com>

Copilot AI review requested due to automatic review settings June 2, 2026 09:29

Copilot started reviewing on behalf of fsx950223 June 2, 2026 09:29 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

Comment thread tests/kernels/test_llvm_pass_plugin_e2e.py

Comment on lines +181 to +182

A = torch.randint(0, 10, (n,), dtype=torch.float32).cuda()

B = torch.randint(0, 10, (n,), dtype=torch.float32).cuda()

fsx950223 marked this pull request as draft June 2, 2026 09:52

fsx950223 added 2 commits June 3, 2026 05:29

feat(compiler): Add fly-llc codegen path for custom MIR passes

2204698

Signed-off-by: fsx950223 <fsx950223@outlook.com>

test(compiler): Add e2e tests for ASM-modifying codegen passes

e8275dd

Signed-off-by: fsx950223 <fsx950223@outlook.com>

jli-melchior reviewed Jun 3, 2026

View reviewed changes

fsx950223 added 2 commits June 4, 2026 08:32

feat(compiler): Add fly-llc multi-stage MIR pass insertion points

da39572

Signed-off-by: fsx950223 <fsx950223@outlook.com>

test(compiler): Extract custom pass plugins into standalone .cpp files

f5a6a63

Signed-off-by: fsx950223 <fsx950223@outlook.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(compiler): Add custom LLVM pass pipeline and plugin support to JIT#613

feat(compiler): Add custom LLVM pass pipeline and plugin support to JIT#613
fsx950223 wants to merge 5 commits into
mainfrom
worktree-llvm-pass-pipeline

fsx950223 commented Jun 2, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

jli-melchior Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		A = torch.randint(0, 10, (n,), dtype=torch.float32).cuda()
		B = torch.randint(0, 10, (n,), dtype=torch.float32).cuda()

Conversation

fsx950223 commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Changes

Test Plan

Test Result

Notes / follow-ups

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

jli-melchior Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fsx950223 commented Jun 2, 2026 •

edited

Loading