feat(compiler): Add custom LLVM pass pipeline and plugin support to JIT#613
feat(compiler): Add custom LLVM pass pipeline and plugin support to JIT#613fsx950223 wants to merge 5 commits into
Conversation
Signed-off-by: fsx950223 <fsx950223@outlook.com>
There was a problem hiding this comment.
Pull request overview
Adds an opt-in JIT compilation path that lets users run a custom LLVM new-PM opt --passes=... pipeline (optionally with --load-pass-plugin=...) on the device kernel LLVM IR, then re-codegen and cache the resulting GPU binary.
Changes:
- Extend
@flyc.jit+ compile env to acceptllvm_pass_pipeline/llvm_pass_plugins, and route compilation through a new “extract LLVM IR → run opt → re-import → external re-codegen” path. - Add external-LLVM helpers for
optexecution, plugin/pipeline fingerprinting for cache invalidation, and ROCm backend helpers to re-codegen atO=0. - Add unit plumbing tests plus an end-to-end pass-plugin test that builds a real LLVM plugin and exercises the full chain.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
tests/unit/test_llvm_pass_pipeline.py |
Unit tests for decorator/env precedence, ROCm recodegen fragments, and cache fingerprint behavior. |
tests/kernels/test_llvm_pass_plugin_e2e.py |
E2E test that builds and loads an LLVM pass plugin and validates the custom-pipeline JIT path (plus negative case). |
python/flydsl/utils/env.py |
Adds FLYDSL_COMPILE_LLVM_PASS_PIPELINE / FLYDSL_COMPILE_LLVM_PASS_PLUGINS compile env knobs. |
python/flydsl/compiler/jit_function.py |
Wires new config into pipeline selection, compilation path, and cache key; extends jit() decorator API. |
python/flydsl/compiler/external_llvm.py |
Implements opt-then-recodegen flow plus plugin/pipeline fingerprinting. |
python/flydsl/compiler/backends/rocm.py |
Factors ROCm pipeline option formatting and adds llvm_recodegen_fragments(opt_level=0) helper. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| A = torch.randint(0, 10, (n,), dtype=torch.float32).cuda() | ||
| B = torch.randint(0, 10, (n,), dtype=torch.float32).cuda() |
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Signed-off-by: fsx950223 <fsx950223@outlook.com>
| errs() << "fly-llc: addISelPasses failed\n"; | ||
| return 1; | ||
| } | ||
| PC->addMachinePasses(); |
There was a problem hiding this comment.
Currently fly-llc only supports injecting custom passes at a single point — after addMachinePasses() and before addAsmPrinter():
PC->addISelPasses();
PC->addMachinePasses(); // ISel, RegAlloc, Scheduling all run here
// ---- only injection point ----
for (auto &name : PreEmitPass)
PM.add(PI->getNormalCtor()());
PC->setInitialized();
CG.addAsmPrinter(...);
By the time user passes run, RegAlloc and instruction scheduling have already completed. This limits the scope of custom passes to post-RA transformations (NOP insertion, peephole, instruction reordering,
etc.). Passes that need to operate at earlier pipeline stages — e.g. custom spill strategies (pre-RA), target-specific legalization (post-ISel), or custom scheduling heuristics (pre-sched2) — cannot be
supported.
LLVM's TargetPassConfig already provides virtual hooks at multiple codegen stages:
- addPostRegAlloc() — after register allocation
- addPreRegAlloc() — before register allocation
- addPreSched2() — before the second scheduling pass
- addPreEmitPass() / addPreEmitPass2() — before emission (current fly-llc insertion point)
If we can allow more insert point should be better and flexible?
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Summary
Adds an opt-in path to control LLVM IR optimization of the device kernel in the JIT,
including loading external LLVM pass plugins.
Also configurable via env vars
FLYDSL_COMPILE_LLVM_PASS_PIPELINE/FLYDSL_COMPILE_LLVM_PASS_PLUGINS(overridden by the decorator args).A second, deeper path adds custom MIR / codegen passes (instruction
selection-time and machine-level transforms that the
optIR pipeline cannotreach) via a small in-tree tool
fly-llc:How it works
LLVM optimization is normally the monolithic upstream
gpu-module-to-binary(translate → optimize@O → codegen), with no hook for a custom
-passes=pipeline or--load-pass-plugin. Whenllvm_pass_pipelineis set, compilation takes a new sub-path(in
MlirCompiler._compile_with_llvm_opt):opt --passes="<pipeline>" [--load-pass-plugin=...]on it.mlir-translate --import-llvm→ re-wrap into agpu.module→ re-codegen viagpu-module-to-binaryat O=0 (so the user pipeline isn't re-optimized), reusingROCm device-lib linking + HSACO production.
gpu.binaryback; runtime loads it unchanged.When
llvm_codegen_passesis set, the tail instead runsfly-llc(a minimaladdPassesToEmitFiledriver that injects the named MIR passes pre-emit) →ld.lld→ HSACO → splice. This is what enables genuine machine-level passes(e.g. scheduling/reordering), which
optcannot express.Both paths require
FLYDSL_COMPILE_LLVM_DIR, are entirely gated behind theiroptions being set (zero impact on the default path), and fold the effective
pipeline/pass names + plugin file-content hashes into the JIT cache key.
Changes
backends/rocm.py: factor out_rocdl_opts/_bin_cli_opts; addllvm_recodegen_fragments(opt_level=0).external_llvm.py:run_llvm_opt_then_binary(...),run_fly_llc_codegen(...),and
llvm_opt_fingerprint(...)/fly_llc_codegen_fingerprint(...).jit_function.py:jit(llvm_pass_pipeline=, llvm_pass_plugins=, llvm_codegen_passes=, llvm_codegen_plugins=),PipelineConfigfields,_effective_*resolvers,_compile_with_llvm_opt/_compile_with_fly_llc, cache-key folds, andFLYDSL_COMPILE_LLVM_*added to the cache-invalidating env vars.utils/env.py: new compile env vars.tools/fly-llc/: new IR→object tool with injectable pre-emit MIR passes.Test Plan
New tests (16):
tests/unit/test_llvm_pass_pipeline.pytests/kernels/test_llvm_pass_plugin_e2e.pyoptpass plugin injecting a deviceprintf(hostcall__ockl_printf_*); positive runs+correct, negative fails without the plugintests/kernels/test_llvm_codegen_pass_e2e.pyfly-llcMIR pass: runs, is required, modifies ASM (s_nop sled at entry), and reorders instructions (safe scheduler — same instruction multiset, different order, results unchanged)Run the GPU-free plumbing tests:
Run the device e2e tests (need a ROCm GPU,
FLYDSL_COMPILE_LLVM_DIR, a C++compiler; the codegen tests also need the
fly-llctool andld.lld):All e2e tests skip cleanly (with a reason) when the GPU/toolchain is unavailable,
so they are CI-safe. Use
pytest -sto see the injected deviceprintf/scheduling effects.
Test Result
Verified on MI308X (gfx942), LLVM/MLIR install used for
FLYDSL_COMPILE_LLVM_DIRand
fly-llc,ld.lldfrom ROCm:e2e tests skip cleanly when the toolchain is absent.
optplugin: deviceprintf("threadIdx.x=%d")observed for lanes 0–63.s_nopsled)and reorders instructions while keeping results correct.
black(line length 120) andruffclean on all changed files.Notes / follow-ups
fly-llcis built in-tree (tools/fly-llc/);ld.lldis located viaFLYDSL_COMPILE_LLDor<FLYDSL_COMPILE_LLVM_DIR>/bin.fly-llccodegen path does not yet link ROCm device libs, so kernels thatpull in
ockl/ocmlfrom a codegen pass are out of scope for now (computekernels work); chaining the
optIR pipeline beforefly-llcis also a follow-up.🤖 Generated with Claude Code