EventReplay: extend beyond aten ops with auto-import and custom initializers#607
Open
ajassani wants to merge 4 commits into
Open
EventReplay: extend beyond aten ops with auto-import and custom initializers#607ajassani wants to merge 4 commits into
ajassani wants to merge 4 commits into
Conversation
Op resolution: - Add _resolve_op_func() with JIT-first resolution (preserves in-place kernel dispatch for aten ops) and torch.ops fallback for custom ops - Add _search_schemas() to collect schemas from both JIT registry and torch.ops namespace overloads Schemaless replay: - Add _get_event_replay_IR_schemaless() that infers argument types directly from profiled data when no schema is available, enabling replay of ops like _C::silu_and_mul and _C::rotary_embedding Type handling: - Fix Scalar type: preserve integer values for integral tensor ops instead of always casting to float - Add str/str?, SymInt?/int?, Generator? support in schema matching - Add _is_tensor_schema_type() for annotated variants like Tensor(a!) - Add _should_skip_tensor_init() generalizing in-place/output detection Dtype support (utils.py): - Add long int, unsigned char, char, short, and FP8 types - Use zeros init for non-floating-point tensors Tested with: - ResNet regression suite (70 aten:: ops) - vLLM Qwen1.5-MoE-A2.7B trace: 7/9 aiter ops, plus _rocm_C::wvSplitK, _C::silu_and_mul, _C::rotary_embedding (requires import vllm._C/_rocm_C) Made-with: Cursor
String arg defaults (_STR_ARG_DEFAULTS): - When the profiler drops a str arg value, check a known-defaults table keyed by arg name (e.g. kv_cache_dtype -> "auto") - Log a WARNING when a default is used so users know the value was inferred - Recovers _C_cache_ops::reshape_and_cache_flash (1.97% GPU time) Op name aliases (_OP_NAME_ALIASES): - Map trace-recorded names to their runtime-registered names (e.g. _rocm_C::wvSplitK -> _rocm_C::wvSpltK) Python module resolution (3rd strategy): - After JIT and torch.ops, try importlib.import_module(namespace) for JIT-compiled ops like aiter Schema parser fix: - parse_schema_string handles annotated tensor types with spaces like "Tensor($0! -> )" correctly now Tested with vLLM Qwen1.5-MoE-A2.7B trace on MI300X (tw025). Made-with: Cursor
Bug fixes: - Fix lazy+auto_init crash: replay() now sets self.args in lazy mode so custom initializers can access them (BUG-1) - Fix get_repro_info() shallow copy corruption: no longer mutates event_replay_IR on repeated calls (BUG-2) - Fix batched_replay.py: handle benchmark_func dict return type, implement --op-filter and --op-limit flags (BUG-3) - replay() now returns the op result instead of None (CLAIM-4) - First-match-wins for custom initializers (CLAIM-1) - Exact name matching for op_patterns (no more substring matching) Tests: - Add CPU-only unit tests (test_event_replay.py, 11 tests) - Add GPU integration tests (test_event_replay_gpu.py) with kernel name validation Docs (EventReplay.md): - Fix benchmark_func example (wrong params and key names) - Remove broken Shape Metadata Guide links - Rewrite custom initializer section as step-by-step guide - Rewrite iteration annotations section with full explanation - Add batch replay CLI flag examples - Update all op_patterns to fully-qualified names
94dd686 to
95406f3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
EventReplay previously only worked with
aten::ops. This PR extends it to support custom ops from any namespace (vLLM, aiter, etc.) and adds custom initializers so data-dependent ops produce realistic replay behavior.Extending beyond aten
_rocm_C::paged_attention,aiter::ck_moe_stage1), it automatically imports the library that registers the op's schema. Supportsaiter,_rocm_C,_C,vllmout of the box, and users can register additional namespaces.Custom initializers for data-dependent ops
PagedAttentionInit— fillsblock_tables,seq_lens, andquery_start_locwith realistic values so the attention kernel does real work instead of short-circuiting on zeros.MoeRoutingInit— constructs a complete token-to-expert routing table (sorted_token_ids,sorted_expert_ids,num_valid_ids) with configurable distribution (uniform or Zipf).CustomInit, setop_patternsto the exact op name, implementinitialize(), and register. First-match-wins, exact name matching.Iteration annotations (vLLM)
extract_batch_contextparses vLLM's per-iterationuser_annotationevents to get the exact prefill/decode split, soPagedAttentionInitbuilds an accuratequery_start_locfor mixed batches.Bug fixes
replay()withlazy=True+auto_init=Truecrashed (AttributeError) — fixedget_repro_info()corruptedevent_replay_IRvia shallow copy — fixedbatched_replay.py:benchmark_funcdict return type crash + dead--op-filter/--op-limitflags — fixedreplay()now returns the op resultTests and docs
Test plan