Skip to content

[Frontend] Remove the dead convert_index floor/mod codegen path#271

Open
YWHyuk wants to merge 11 commits into
feature/togsim-cpp-tracefrom
cleanup/remove-convert-index
Open

[Frontend] Remove the dead convert_index floor/mod codegen path#271
YWHyuk wants to merge 11 commits into
feature/togsim-cpp-tracefrom
cleanup/remove-convert-index

Conversation

@YWHyuk

@YWHyuk YWHyuk commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Summary

The load/store affine-index path (convert_index and _convert_sympy_to_mlir_expr in PyTorchSimFrontend/mlir/mlir_codegen_backend.py) used to lower FloorDiv/ModularIndexing sub-expressions (view/reshape/transpose indices) into affine.apply maps with a constant divisor and a single free symbol.

This is superseded by axis-split's affine-only contract: axis_split.py strips FloorDiv/ModularIndexing from index expressions upstream at the Inductor scheduling layer (see docs/axis-split-scheduling.md), so MLIR codegen now only ever receives pure affine, constant-stride indices. The DMA-index path already asserts this.

Verification

A temporary tripwire (raise RuntimeError) placed at the entry of both floor/mod handlers (the ModularIndexing/// branches of convert_index, and the ModularIndexing/FloorDiv branches of _convert_sympy_to_mlir_expr) never fired across:

  • View/reshape/transpose suite: test_floormod_axis_split (group_norm c//(C/G), repeat mod, repeat_interleave floor), test_transpose2D, test_transpose3D, test_view3D_2D, test_cat
  • Broad sanity set: test_add, test_matmul, test_reduce, test_softmax, test_layernorm, test_batchnorm, test_conv2d

confirming the floor/mod codegen path is dead.

Change

Remove the floor/mod lowering branches. Mirroring the existing DMA-index assert, replace them with a clear NotImplementedError so any residual floor/mod that escapes axis-split fails loudly instead of being silently mis-lowered. The now-unused re import is dropped.

The full view/op test suite passes unchanged (allclose) after removal.

🤖 Generated with Claude Code

@YWHyuk YWHyuk force-pushed the cleanup/remove-convert-index branch from 95f127e to f4ac374 Compare June 24, 2026 12:58
@YWHyuk YWHyuk force-pushed the feature/togsim-cpp-trace branch from 9b913d4 to 4767e8a Compare June 24, 2026 13:16
@YWHyuk YWHyuk force-pushed the cleanup/remove-convert-index branch from f4ac374 to b16b2d8 Compare June 24, 2026 13:29
YWHyuk and others added 10 commits June 24, 2026 22:35
… feed

Skeleton + EmitC + cost/dep analysis on the frontend; the trace runtime,
loader, bridge, and Core feed on the simulator; shared MLIR pass helpers and
the pipeline tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
Per-record tag key in the bridge plus per-iteration tag alloc in
dma-fine-grained so multi-tile-K and conv loads do not collide; strip the
reduction accum marker from the memory_barrier slot.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
togsim_dispatch with TILE_BEGIN/TILE_END; outline each work-item into
togsim_kernel_tile.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
DMA-capacity throttle and frozen-state guard, per-core VMEM in the configs,
and the SA weight-buffer throttle.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
trace_timeline.py with per-work-item grouping and resource-centric DMA lanes;
the trace logs the first DRAM response and the assigned systolic array, and
scopes the compute barrier to its dispatch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
Default to the trace path; fix uninitialized Instruction fields, the matmul
accumulator wedge, fused-subtile dedup, nested/fused epilogue dataflow, and
dma_wait fusion; bound concurrent dispatches to the spad, round-robin
work-items within a partition, benchmark autotune and run the multi-tenant
scheduler through the trace path, and emit trace.so for pooling/reduction.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
Carry simulator headers through the wrapper for cache-safe replay; drop verbose
[P3-trace] logs; fix the key.mlir compile race in load().

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
… runtime model

Replace the trace bridge's accumulated special cases with one dataflow rule and
clean up the runtime that consumes it.

Dependency rule: per SRAM buffer keep a writers SET; a reader depends on all
current writers (occupancy=ISSUE when both are systolic-array ops, else
latency=DONE); a writer REPLACEs the set. The only exception is is_mm_accum (a
matmul that reads and writes the same buffer = a commutative accumulator): skip
its read edge and UNION its write, waiting only the non-matmul init seed and not
ordering co-matmuls. This drops the matmul-accumulator chain that deadlocked the
SA weight-slot pipeline while keeping the init->matmul edge, and lets a vector
epilogue or the store wait every K matmul (fixes the pure-vector store that an
empty COMPUTE_BAR let slip).

Remove COMPUTE_BAR entirely: a matmul is its own DONE-handle (finish == SA
drain), so the store JOINs the matmul writers directly. The whole emit/loader
chain is gone -- build_skeleton, lower_to_emitc, togsim.compute_barrier, the
runtime symbol, the Opcode/case/_fence_finish, and TraceRec::COMPUTE_BAR -- so a
stale producer fails loudly instead of emitting records the bridge would drop.
Only MEMORY_BAR remains (an async load's DONE is its data arrival, not issue).

Model compute-output spad footprint in the SRAM version/capacity machinery so
buffer reuse (WAR) is capacity-modeled, not a hard edge. The output size comes
from the DMA records that touch the same buffer (a buf_bytes pre-pass); an
in-place buffer (accumulator, relu) is version-transparent so footprint is not
double-counted. The occupy gate and version release sit in the MOVIN/MOVOUT/COMP
issue points (release before the COMP skip path so a skipped matmul still frees).

Runtime: collapse child_inst / _pipeline_children into one event-indexed
_deps[ISSUE|DONE] with add_dep(c, on) and fire(e); collapse the weight-slot
release queue and the async-load wakeup into one _due_events timed-effect table
drained by process_due_events. Both are behavior-preserving (byte-identical).

Require the weight-slot model: sa_weight_buffer_depth must be > 0 (errors at
init), and the round-robin disable mode is removed. Degenerate traces (a
consumer-less preload, an unpinned matmul) hit explicit error+exit guards rather
than asserts that vanish under NDEBUG.

Mark the legacy ONNX TOG path deprecated: it is superseded by the trace path, so
TileGraphParser logs a deprecation warning and the TORCHSIM_LEGACY_TOG=1 opt-in
warns at command build.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
@YWHyuk YWHyuk force-pushed the feature/togsim-cpp-trace branch from b7c1ec4 to 9033945 Compare June 24, 2026 13:37
axis-split's affine-only contract (docs/axis-split-scheduling.md) linearizes
aligned FloorDiv/ModularIndexing into per-axis affine indices upstream, so the
load/store index reaching codegen is already pure affine. The convert_index
method (which lowered floor/mod view indices to affine.apply with a constant
divisor and a single free symbol) and the helper that guarded it are therefore
dead -- a tripwire at every floor/mod branch never fired across the view/op test
suite.

Remove convert_index entirely and inline its only live behaviour at the call
sites in _convert_sympy_to_mlir_expr and parse_index_list. A residual
FloorDiv/ModularIndexing now fails loudly via the whole-expression guard in each
of those functions (mirroring the DMA-index assert) instead of silently
mis-lowering. Also drop the assumption-stripping Symbol(str(...)) + expr.replace
round-trip in the affine-map builder: it only mattered when convert_index
transformed the term, and is now a verified no-op (indices and the affine string
depend only on symbol names). Drop the now-unused re import.

Verified: the full view/reshape/transpose suite (test_floormod_axis_split incl.
group_norm/repeat/repeat_interleave/mixed-radix/pixel_shuffle, transpose2D/3D,
view3D_2D, cat) plus add/matmul/reduce/softmax/layernorm/batchnorm pass
unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant