[SOT] add direct kernel fast runtime#79043
Conversation
|
你的PR提交成功,感谢你对开源项目的贡献! |
There was a problem hiding this comment.
Pull request overview
This PR introduces a "direct kernel fast runtime" path for SOT (Symbolic OpTracing) in Paddle. When the new SOT_ENABLE_FAST_KERNEL_CODEGEN environment variable is enabled, sot_call lowers the PIR program to the pd_kernel dialect and code-generates a Python function that directly invokes per-op pybind kernels (newly exposed under core.eager.kernel_ops), bypassing the run_program executor. The path intentionally does not fall back to run_program and raises a RuntimeError if autograd is active or required metadata is missing.
Changes:
- Adds a codegen mode (
--direct_kernel) topython_c_gen.pyplus new CMake wiring that produceskernel_op_function.{h,cc}exposing direct kernel APIs and aget_kernel_ops_args_inforegistry undercore.eager.kernel_ops. - Adds manual pybind helpers in
kernel_op_function_manual.{h,cc}(incl.get_phi_kernel_op_infoand arun_cinn_jit_kernellauncher), exposesapply_pd_op_to_kernel_passinpir.cc, and binds the new submodule fromeager.cc. - Adds
pir_fast_runtime.pyimplementing the lowering + Python source codegen, plumbs it intoPirPartialProgram.sot_callvia a cachedFastKernelRuntime, and adds unit tests intest/sot/test_fast_kernel_runtime.py.
Reviewed changes
Copilot reviewed 13 out of 14 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| python/paddle/jit/sot/utils/envs.py | Adds SOT_ENABLE_FAST_KERNEL_CODEGEN boolean env flag. |
| python/paddle/jit/dy2static/pir_partial_program.py | Routes sot_call to the fast kernel path when enabled; adds runtime cache and quick_index_map handling for Value raw inputs. |
| python/paddle/jit/dy2static/pir_fast_runtime.py | New module that lowers PIR to pd_kernel and codegens a Python kernel-dispatch function with constant folding and CINN jit_kernel support. |
| paddle/fluid/pybind/pir.cc | Exposes apply_pd_op_to_kernel_pass to Python. |
| paddle/fluid/pybind/eager.{h,cc} | Declares and creates the core.eager.kernel_ops submodule and binds both generated and manual direct-kernel functions. |
| paddle/fluid/pybind/kernel_op_function_manual.{h,cc} | Implements get_phi_kernel_op_info and a CINN jit_kernel launcher used by the fast runtime. |
| paddle/fluid/pybind/CMakeLists.txt, .gitignore, paddle/fluid/pir/dialect/CMakeLists.txt | Build wiring for generated kernel_op_function.* files. |
| paddle/fluid/eager/auto_code_generator/generator/python_c_gen.py | Extends the Python-C generator with a --direct_kernel mode, custom namespaces, args-info registry, and configurable bind/method/error names. |
| test/sot/test_fast_kernel_runtime.py | Unit tests covering pybind surface, fast kernel codegen for add/reshape/BN, and the no-fallback contract. |
| .gitignore | Ignores generated kernel_op_function.* outputs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Codex <noreply@openai.com>
e40ec78 to
c06fe94
Compare
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (64.37%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #79043 +/- ##
==========================================
Coverage ? 64.37%
==========================================
Files ? 3
Lines ? 407
Branches ? 0
==========================================
Hits ? 262
Misses ? 145
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
PR Category
Performance Optimization
PR Types
New features, Performance
Description
本 PR 为 SOT 动转静执行链路增加 fast kernel runtime,用于压缩运行时调度开销。核心目标是绕过原 C++ program executor 的逐 op 调度路径,在 PIR lower 到 pd_kernel dialect 后直接 codegen Python 可执行函数,并通过自动生成的
core.eager.kernel_opspybind API 调用底层 kernel。主要改动:
pir_fast_runtime.py,从 lower 后的 pd_kernel/CINN IR 生成 Python runtime code。PartialProgramLayer.sot_call增加可开关的 fast kernel runtime 路径。run_program,避免隐藏问题。验证
prek --files python/paddle/jit/dy2static/pir_fast_runtime.py test/sot/test_fast_kernel_runtime.pypython -m unittest test_fast_kernel_runtimectest -R sot --output-on-failurecore.eager.kernel_ops.addsmoke test 与test_fast_kernel_runtime。PR-CI-SOT / Build and Test已通过。性能结果
以下 benchmark 均排除首次 codegen/CINN compile 时间。
GPU:
single_add:sync 22.866us -> 19.605us,enqueue 18.336us -> 14.357us。reshape_add:sync 60.321us -> 53.518us,enqueue 55.852us -> 48.256us。resnet18_b1:sync 0.930ms -> 0.800ms,enqueue 0.824ms -> 0.739ms。resnet18_b10:sync 0.999ms -> 0.855ms,enqueue 0.912ms -> 0.813ms。CPU:
single_add:3.589us -> 1.501us。reshape_add:18.617us -> 5.620us。resnet18_b1:8.627ms -> 8.096ms。resnet18_b10:78.468ms -> 76.781ms。是否引起精度变化
否。该 PR 仅调整 SOT 编译图运行时调度路径,kernel 输入输出保持与 lower 后 IR 一致;新增单测覆盖了 direct kernel、CINN kernel、BatchNorm eval 和禁止 run_program fallback 的行为。