EventReplay: extend beyond aten ops with auto-import and custom initializers by ajassani · Pull Request #607 · AMD-AGI/TraceLens

ajassani · 2026-04-28T19:41:28Z

Summary

EventReplay previously only worked with aten:: ops. This PR extends it to support custom ops from any namespace (vLLM, aiter, etc.) and adds custom initializers so data-dependent ops produce realistic replay behavior.

Extending beyond aten

Auto-import for custom op namespaces: when EventReplay encounters a non-aten op (e.g., _rocm_C::paged_attention, aiter::ck_moe_stage1), it automatically imports the library that registers the op's schema. Supports aiter, _rocm_C, _C, vllm out of the box, and users can register additional namespaces.
Schemaless fallback: ops that lack a registered schema can still be replayed with heuristic type inference from the profiler data.

Custom initializers for data-dependent ops

PagedAttentionInit — fills block_tables, seq_lens, and query_start_loc with realistic values so the attention kernel does real work instead of short-circuiting on zeros.
MoeRoutingInit — constructs a complete token-to-expert routing table (sorted_token_ids, sorted_expert_ids, num_valid_ids) with configurable distribution (uniform or Zipf).
User-extensible: subclass CustomInit, set op_patterns to the exact op name, implement initialize(), and register. First-match-wins, exact name matching.

Iteration annotations (vLLM)

extract_batch_context parses vLLM's per-iteration user_annotation events to get the exact prefill/decode split, so PagedAttentionInit builds an accurate query_start_loc for mixed batches.

Bug fixes

replay() with lazy=True + auto_init=True crashed (AttributeError) — fixed
get_repro_info() corrupted event_replay_IR via shallow copy — fixed
batched_replay.py: benchmark_func dict return type crash + dead --op-filter/--op-limit flags — fixed
replay() now returns the op result
Custom init matching changed from substring to exact name

Tests and docs

11 CPU-only unit tests + GPU integration test with kernel name validation
Docs rewritten: IR interpretability examples, step-by-step custom init guide, iteration annotations explained

Test plan

CPU unit tests pass (11/11, ~3s)
GPU integration tests pass (5/5 ops, MI300X)
- Kernel name match: all MATCH
- Lazy mode, get_repro_info idempotency, return value, first-match-wins: all PASS

Op resolution: - Add _resolve_op_func() with JIT-first resolution (preserves in-place kernel dispatch for aten ops) and torch.ops fallback for custom ops - Add _search_schemas() to collect schemas from both JIT registry and torch.ops namespace overloads Schemaless replay: - Add _get_event_replay_IR_schemaless() that infers argument types directly from profiled data when no schema is available, enabling replay of ops like _C::silu_and_mul and _C::rotary_embedding Type handling: - Fix Scalar type: preserve integer values for integral tensor ops instead of always casting to float - Add str/str?, SymInt?/int?, Generator? support in schema matching - Add _is_tensor_schema_type() for annotated variants like Tensor(a!) - Add _should_skip_tensor_init() generalizing in-place/output detection Dtype support (utils.py): - Add long int, unsigned char, char, short, and FP8 types - Use zeros init for non-floating-point tensors Tested with: - ResNet regression suite (70 aten:: ops) - vLLM Qwen1.5-MoE-A2.7B trace: 7/9 aiter ops, plus _rocm_C::wvSplitK, _C::silu_and_mul, _C::rotary_embedding (requires import vllm._C/_rocm_C) Made-with: Cursor

String arg defaults (_STR_ARG_DEFAULTS): - When the profiler drops a str arg value, check a known-defaults table keyed by arg name (e.g. kv_cache_dtype -> "auto") - Log a WARNING when a default is used so users know the value was inferred - Recovers _C_cache_ops::reshape_and_cache_flash (1.97% GPU time) Op name aliases (_OP_NAME_ALIASES): - Map trace-recorded names to their runtime-registered names (e.g. _rocm_C::wvSplitK -> _rocm_C::wvSpltK) Python module resolution (3rd strategy): - After JIT and torch.ops, try importlib.import_module(namespace) for JIT-compiled ops like aiter Schema parser fix: - parse_schema_string handles annotated tensor types with spaces like "Tensor($0! -> )" correctly now Tested with vLLM Qwen1.5-MoE-A2.7B trace on MI300X (tw025). Made-with: Cursor

Made-with: Cursor

Bug fixes: - Fix lazy+auto_init crash: replay() now sets self.args in lazy mode so custom initializers can access them (BUG-1) - Fix get_repro_info() shallow copy corruption: no longer mutates event_replay_IR on repeated calls (BUG-2) - Fix batched_replay.py: handle benchmark_func dict return type, implement --op-filter and --op-limit flags (BUG-3) - replay() now returns the op result instead of None (CLAIM-4) - First-match-wins for custom initializers (CLAIM-1) - Exact name matching for op_patterns (no more substring matching) Tests: - Add CPU-only unit tests (test_event_replay.py, 11 tests) - Add GPU integration tests (test_event_replay_gpu.py) with kernel name validation Docs (EventReplay.md): - Fix benchmark_func example (wrong params and key names) - Remove broken Shape Metadata Guide links - Rewrite custom initializer section as step-by-step guide - Rewrite iteration annotations section with full explanation - Add batch replay CLI flag examples - Update all op_patterns to fully-qualified names

ajassani changed the title ~~EventReplay: custom initializers, bug fixes, tests, and docs~~ EventReplay: extend beyond aten ops with auto-import, custom initializers, and iteration annotations Apr 28, 2026

ajassani changed the title ~~EventReplay: extend beyond aten ops with auto-import, custom initializers, and iteration annotations~~ EventReplay: extend beyond aten ops with auto-import and custom initializers Apr 28, 2026

ajassani and others added 4 commits April 28, 2026 15:48

Add custom initializers, auto-import, and updated docs for EventReplay

a2bd622

Made-with: Cursor

ajassani force-pushed the feature/event-replay-custom-ops branch from 94dd686 to 95406f3 Compare April 28, 2026 19:49

ajassani mentioned this pull request May 5, 2026

Extend Replay to any general dispatch functions beyond aten:: #312

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EventReplay: extend beyond aten ops with auto-import and custom initializers#607

EventReplay: extend beyond aten ops with auto-import and custom initializers#607
ajassani wants to merge 4 commits into
mainfrom
feature/event-replay-custom-ops

ajassani commented Apr 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ajassani commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Extending beyond aten

Custom initializers for data-dependent ops

Iteration annotations (vLLM)

Bug fixes

Tests and docs

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ajassani commented Apr 28, 2026 •

edited

Loading