Skip to content

EventReplay: extend beyond aten ops with auto-import and custom initializers#607

Open
ajassani wants to merge 4 commits into
mainfrom
feature/event-replay-custom-ops
Open

EventReplay: extend beyond aten ops with auto-import and custom initializers#607
ajassani wants to merge 4 commits into
mainfrom
feature/event-replay-custom-ops

Conversation

@ajassani
Copy link
Copy Markdown
Collaborator

@ajassani ajassani commented Apr 28, 2026

Summary

EventReplay previously only worked with aten:: ops. This PR extends it to support custom ops from any namespace (vLLM, aiter, etc.) and adds custom initializers so data-dependent ops produce realistic replay behavior.

Extending beyond aten

  • Auto-import for custom op namespaces: when EventReplay encounters a non-aten op (e.g., _rocm_C::paged_attention, aiter::ck_moe_stage1), it automatically imports the library that registers the op's schema. Supports aiter, _rocm_C, _C, vllm out of the box, and users can register additional namespaces.
  • Schemaless fallback: ops that lack a registered schema can still be replayed with heuristic type inference from the profiler data.

Custom initializers for data-dependent ops

  • PagedAttentionInit — fills block_tables, seq_lens, and query_start_loc with realistic values so the attention kernel does real work instead of short-circuiting on zeros.
  • MoeRoutingInit — constructs a complete token-to-expert routing table (sorted_token_ids, sorted_expert_ids, num_valid_ids) with configurable distribution (uniform or Zipf).
  • User-extensible: subclass CustomInit, set op_patterns to the exact op name, implement initialize(), and register. First-match-wins, exact name matching.

Iteration annotations (vLLM)

  • extract_batch_context parses vLLM's per-iteration user_annotation events to get the exact prefill/decode split, so PagedAttentionInit builds an accurate query_start_loc for mixed batches.

Bug fixes

  • replay() with lazy=True + auto_init=True crashed (AttributeError) — fixed
  • get_repro_info() corrupted event_replay_IR via shallow copy — fixed
  • batched_replay.py: benchmark_func dict return type crash + dead --op-filter/--op-limit flags — fixed
  • replay() now returns the op result
  • Custom init matching changed from substring to exact name

Tests and docs

  • 11 CPU-only unit tests + GPU integration test with kernel name validation
  • Docs rewritten: IR interpretability examples, step-by-step custom init guide, iteration annotations explained

Test plan

  • CPU unit tests pass (11/11, ~3s)
  • GPU integration tests pass (5/5 ops, MI300X)
    • Kernel name match: all MATCH
    • Lazy mode, get_repro_info idempotency, return value, first-match-wins: all PASS

@ajassani ajassani changed the title EventReplay: custom initializers, bug fixes, tests, and docs EventReplay: extend beyond aten ops with auto-import, custom initializers, and iteration annotations Apr 28, 2026
@ajassani ajassani changed the title EventReplay: extend beyond aten ops with auto-import, custom initializers, and iteration annotations EventReplay: extend beyond aten ops with auto-import and custom initializers Apr 28, 2026
ajassani and others added 4 commits April 28, 2026 15:48
Op resolution:
- Add _resolve_op_func() with JIT-first resolution (preserves in-place
  kernel dispatch for aten ops) and torch.ops fallback for custom ops
- Add _search_schemas() to collect schemas from both JIT registry and
  torch.ops namespace overloads

Schemaless replay:
- Add _get_event_replay_IR_schemaless() that infers argument types
  directly from profiled data when no schema is available, enabling
  replay of ops like _C::silu_and_mul and _C::rotary_embedding

Type handling:
- Fix Scalar type: preserve integer values for integral tensor ops
  instead of always casting to float
- Add str/str?, SymInt?/int?, Generator? support in schema matching
- Add _is_tensor_schema_type() for annotated variants like Tensor(a!)
- Add _should_skip_tensor_init() generalizing in-place/output detection

Dtype support (utils.py):
- Add long int, unsigned char, char, short, and FP8 types
- Use zeros init for non-floating-point tensors

Tested with:
- ResNet regression suite (70 aten:: ops)
- vLLM Qwen1.5-MoE-A2.7B trace: 7/9 aiter ops, plus _rocm_C::wvSplitK,
  _C::silu_and_mul, _C::rotary_embedding (requires import vllm._C/_rocm_C)

Made-with: Cursor
String arg defaults (_STR_ARG_DEFAULTS):
- When the profiler drops a str arg value, check a known-defaults table
  keyed by arg name (e.g. kv_cache_dtype -> "auto")
- Log a WARNING when a default is used so users know the value was inferred
- Recovers _C_cache_ops::reshape_and_cache_flash (1.97% GPU time)

Op name aliases (_OP_NAME_ALIASES):
- Map trace-recorded names to their runtime-registered names
  (e.g. _rocm_C::wvSplitK -> _rocm_C::wvSpltK)

Python module resolution (3rd strategy):
- After JIT and torch.ops, try importlib.import_module(namespace) for
  JIT-compiled ops like aiter

Schema parser fix:
- parse_schema_string handles annotated tensor types with spaces like
  "Tensor($0! -> )" correctly now

Tested with vLLM Qwen1.5-MoE-A2.7B trace on MI300X (tw025).

Made-with: Cursor
Bug fixes:
- Fix lazy+auto_init crash: replay() now sets self.args in lazy mode
  so custom initializers can access them (BUG-1)
- Fix get_repro_info() shallow copy corruption: no longer mutates
  event_replay_IR on repeated calls (BUG-2)
- Fix batched_replay.py: handle benchmark_func dict return type,
  implement --op-filter and --op-limit flags (BUG-3)
- replay() now returns the op result instead of None (CLAIM-4)
- First-match-wins for custom initializers (CLAIM-1)
- Exact name matching for op_patterns (no more substring matching)

Tests:
- Add CPU-only unit tests (test_event_replay.py, 11 tests)
- Add GPU integration tests (test_event_replay_gpu.py) with kernel
  name validation

Docs (EventReplay.md):
- Fix benchmark_func example (wrong params and key names)
- Remove broken Shape Metadata Guide links
- Rewrite custom initializer section as step-by-step guide
- Rewrite iteration annotations section with full explanation
- Add batch replay CLI flag examples
- Update all op_patterns to fully-qualified names
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant