Skip to content

feat(compiler): Add custom LLVM pass pipeline and plugin support to JIT#613

Draft
fsx950223 wants to merge 5 commits into
mainfrom
worktree-llvm-pass-pipeline
Draft

feat(compiler): Add custom LLVM pass pipeline and plugin support to JIT#613
fsx950223 wants to merge 5 commits into
mainfrom
worktree-llvm-pass-pipeline

Conversation

@fsx950223
Copy link
Copy Markdown
Contributor

@fsx950223 fsx950223 commented Jun 2, 2026

Summary

Adds an opt-in path to control LLVM IR optimization of the device kernel in the JIT,
including loading external LLVM pass plugins.

@flyc.jit(llvm_pass_pipeline="default<O3>,my-pass",
          llvm_pass_plugins=["/path/libMyPass.so"])

Also configurable via env vars FLYDSL_COMPILE_LLVM_PASS_PIPELINE /
FLYDSL_COMPILE_LLVM_PASS_PLUGINS (overridden by the decorator args).

A second, deeper path adds custom MIR / codegen passes (instruction
selection-time and machine-level transforms that the opt IR pipeline cannot
reach) via a small in-tree tool fly-llc:

@flyc.jit(llvm_codegen_passes=["my-mir-pass"],
          llvm_codegen_plugins=["/path/libMyMir.so"])

How it works

LLVM optimization is normally the monolithic upstream gpu-module-to-binary
(translate → optimize@O → codegen), with no hook for a custom -passes= pipeline or
--load-pass-plugin. When llvm_pass_pipeline is set, compilation takes a new sub-path
(in MlirCompiler._compile_with_llvm_opt):

  1. Run the pre-binary fragments in-process (gpu.module → LLVM dialect).
  2. Extract the device kernel's (pre-link) LLVM IR.
  3. Run opt --passes="<pipeline>" [--load-pass-plugin=...] on it.
  4. mlir-translate --import-llvm → re-wrap into a gpu.module → re-codegen via
    gpu-module-to-binary at O=0 (so the user pipeline isn't re-optimized), reusing
    ROCm device-lib linking + HSACO production.
  5. Splice the produced gpu.binary back; runtime loads it unchanged.

When llvm_codegen_passes is set, the tail instead runs fly-llc (a minimal
addPassesToEmitFile driver that injects the named MIR passes pre-emit) →
ld.lld → HSACO → splice. This is what enables genuine machine-level passes
(e.g. scheduling/reordering), which opt cannot express.

Both paths require FLYDSL_COMPILE_LLVM_DIR, are entirely gated behind their
options being set (zero impact on the default path), and fold the effective
pipeline/pass names + plugin file-content hashes into the JIT cache key.

Changes

  • backends/rocm.py: factor out _rocdl_opts/_bin_cli_opts; add
    llvm_recodegen_fragments(opt_level=0).
  • external_llvm.py: run_llvm_opt_then_binary(...), run_fly_llc_codegen(...),
    and llvm_opt_fingerprint(...) / fly_llc_codegen_fingerprint(...).
  • jit_function.py: jit(llvm_pass_pipeline=, llvm_pass_plugins=, llvm_codegen_passes=, llvm_codegen_plugins=), PipelineConfig fields, _effective_* resolvers,
    _compile_with_llvm_opt / _compile_with_fly_llc, cache-key folds, and
    FLYDSL_COMPILE_LLVM_* added to the cache-invalidating env vars.
  • utils/env.py: new compile env vars.
  • tools/fly-llc/: new IR→object tool with injectable pre-emit MIR passes.

Test Plan

New tests (16):

File Tier Count Covers
tests/unit/test_llvm_pass_pipeline.py unit (no GPU) 10 recodegen fragments, decorator/env precedence, fingerprint invalidation (opt + codegen plumbing)
tests/kernels/test_llvm_pass_plugin_e2e.py l2_device 2 real opt pass plugin injecting a device printf (hostcall __ockl_printf_*); positive runs+correct, negative fails without the plugin
tests/kernels/test_llvm_codegen_pass_e2e.py l2_device 4 fly-llc MIR pass: runs, is required, modifies ASM (s_nop sled at entry), and reorders instructions (safe scheduler — same instruction multiset, different order, results unchanged)

Run the GPU-free plumbing tests:

python -m pytest tests/unit/test_llvm_pass_pipeline.py -v

Run the device e2e tests (need a ROCm GPU, FLYDSL_COMPILE_LLVM_DIR, a C++
compiler; the codegen tests also need the fly-llc tool and ld.lld):

FLYDSL_COMPILE_LLVM_DIR=<llvm-install> \
FLYDSL_COMPILE_FLY_LLC=<.../bin/fly-llc> \
FLYDSL_COMPILE_LLD=<.../bin/ld.lld> \
python -m pytest tests/kernels/test_llvm_pass_plugin_e2e.py \
                 tests/kernels/test_llvm_codegen_pass_e2e.py -v

All e2e tests skip cleanly (with a reason) when the GPU/toolchain is unavailable,
so they are CI-safe. Use pytest -s to see the injected device printf /
scheduling effects.

Test Result

Verified on MI308X (gfx942), LLVM/MLIR install used for FLYDSL_COMPILE_LLVM_DIR
and fly-llc, ld.lld from ROCm:

  • 24 passed (16 new + 8 existing backend/external-codegen regression tests);
    e2e tests skip cleanly when the toolchain is absent.
  • Custom opt plugin: device printf("threadIdx.x=%d") observed for lanes 0–63.
  • Custom MIR pass: confirmed it both modifies the emitted ASM (entry s_nop sled)
    and reorders instructions while keeping results correct.
  • black (line length 120) and ruff clean on all changed files.

Notes / follow-ups

  • fly-llc is built in-tree (tools/fly-llc/); ld.lld is located via
    FLYDSL_COMPILE_LLD or <FLYDSL_COMPILE_LLVM_DIR>/bin.
  • The fly-llc codegen path does not yet link ROCm device libs, so kernels that
    pull in ockl/ocml from a codegen pass are out of scope for now (compute
    kernels work); chaining the opt IR pipeline before fly-llc is also a follow-up.

🤖 Generated with Claude Code

Signed-off-by: fsx950223 <fsx950223@outlook.com>
Copilot AI review requested due to automatic review settings June 2, 2026 09:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in JIT compilation path that lets users run a custom LLVM new-PM opt --passes=... pipeline (optionally with --load-pass-plugin=...) on the device kernel LLVM IR, then re-codegen and cache the resulting GPU binary.

Changes:

  • Extend @flyc.jit + compile env to accept llvm_pass_pipeline / llvm_pass_plugins, and route compilation through a new “extract LLVM IR → run opt → re-import → external re-codegen” path.
  • Add external-LLVM helpers for opt execution, plugin/pipeline fingerprinting for cache invalidation, and ROCm backend helpers to re-codegen at O=0.
  • Add unit plumbing tests plus an end-to-end pass-plugin test that builds a real LLVM plugin and exercises the full chain.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/unit/test_llvm_pass_pipeline.py Unit tests for decorator/env precedence, ROCm recodegen fragments, and cache fingerprint behavior.
tests/kernels/test_llvm_pass_plugin_e2e.py E2E test that builds and loads an LLVM pass plugin and validates the custom-pipeline JIT path (plus negative case).
python/flydsl/utils/env.py Adds FLYDSL_COMPILE_LLVM_PASS_PIPELINE / FLYDSL_COMPILE_LLVM_PASS_PLUGINS compile env knobs.
python/flydsl/compiler/jit_function.py Wires new config into pipeline selection, compilation path, and cache key; extends jit() decorator API.
python/flydsl/compiler/external_llvm.py Implements opt-then-recodegen flow plus plugin/pipeline fingerprinting.
python/flydsl/compiler/backends/rocm.py Factors ROCm pipeline option formatting and adds llvm_recodegen_fragments(opt_level=0) helper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +181 to +182
A = torch.randint(0, 10, (n,), dtype=torch.float32).cuda()
B = torch.randint(0, 10, (n,), dtype=torch.float32).cuda()
@fsx950223 fsx950223 marked this pull request as draft June 2, 2026 09:52
fsx950223 added 2 commits June 3, 2026 05:29
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Comment thread tools/fly-llc/fly-llc.cpp
errs() << "fly-llc: addISelPasses failed\n";
return 1;
}
PC->addMachinePasses();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently fly-llc only supports injecting custom passes at a single point — after addMachinePasses() and before addAsmPrinter():

PC->addISelPasses();
PC->addMachinePasses(); // ISel, RegAlloc, Scheduling all run here
// ---- only injection point ----
for (auto &name : PreEmitPass)
PM.add(PI->getNormalCtor()());
PC->setInitialized();
CG.addAsmPrinter(...);

By the time user passes run, RegAlloc and instruction scheduling have already completed. This limits the scope of custom passes to post-RA transformations (NOP insertion, peephole, instruction reordering,
etc.). Passes that need to operate at earlier pipeline stages — e.g. custom spill strategies (pre-RA), target-specific legalization (post-ISel), or custom scheduling heuristics (pre-sched2) — cannot be
supported.

LLVM's TargetPassConfig already provides virtual hooks at multiple codegen stages:

  • addPostRegAlloc() — after register allocation
  • addPreRegAlloc() — before register allocation
  • addPreSched2() — before the second scheduling pass
  • addPreEmitPass() / addPreEmitPass2() — before emission (current fly-llc insertion point)

If we can allow more insert point should be better and flexible?

fsx950223 added 2 commits June 4, 2026 08:32
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants