Skip to content

Dynamic shape on the C++ trace path (elementwise add)#269

Open
YWHyuk wants to merge 24 commits into
feature/togsim-cpp-tracefrom
feature/togsim-dynamic-shape
Open

Dynamic shape on the C++ trace path (elementwise add)#269
YWHyuk wants to merge 24 commits into
feature/togsim-cpp-tracefrom
feature/togsim-dynamic-shape

Conversation

@YWHyuk

@YWHyuk YWHyuk commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds torch.compile(dynamic=True) support on the C++ trace path. One compiled
trace producer .so serves any input size: the producer reads its loop bound
from shape_args at runtime, while the per-tile cost table (sampled once, since a
tile's cost is shape-invariant) keys the timing.

Verified end to end: a single dynamic a + b runs at N=1024 (2 tiles, 183
cycles) and N=2048 (4 tiles, 261 cycles) from the same .so, the trace
scaling with shape_args.

What each commit does

  1. Guard MLIR tile sizing against symbolic dims — neutralize the tile-fit
    heuristics for a sympy dim (they only minimize a known dim's tail); all gated
    on isinstance(.., sympy.Expr), so static is unchanged.
  2. Emit symbolic loop bounds and dynamic memref dims — size symbol becomes a
    scalar-int kernel arg; memref<?xf32>; affine.for ... to %<name>_bound with a
    top-of-function memref.load + index_cast prologue.
  3. Make the kernel meta import-safe — stringify sympy in arg_attributes so the
    generated wrapper imports (the extent already arrives as a runtime arg).
  4. Skip compile-time Spike validation for dynamic — its fixed-shape validation
    binary can't be instantiated for a runtime extent.
  5. Sample per-tile cycles on a one-tile copypin_loops_to_one_tile (general
    for static + dynamic) runs gem5 sampling on a one-tile copy; the symbolic IR is
    kept for the producer / cost table.
  6. Emit a dynamic-shape trace producerbuild_skeleton skips the loop_end
    serialization it never needed; lower_to_emitc re-sources each still-used size
    arg from shape_args[k] via emitc.subscript, so the producer loop reads
    for (iv=0; iv<shape_args[k]; iv+=step).
  7. Pass the runtime shape via the attribute file — the scalar size goes into the
    existing per-kernel attribute YAML (shape_args), run_standalone passes
    --attribute, and main.cc fills shape_args for run_producer.
  8. Testtests/ops/elementwise/test_dynamic_add.py.

Known limitations / follow-ups

  • Timing only: the Spike functional run (which fills the output tensor) is
    skipped for dynamic, so output values are zeros today; the test checks output
    shape. Functional output for dynamic is a follow-up.
  • Static cost sampling can use the (already general) pin_loops_to_one_tile once it
    is decoupled from run_tog; op coverage beyond contiguous 1D add (matmul /

    1D-dynamic: symbolic-stride addressing, multi-symbol shape_args) is future work.

  • One change to the loop-padding mlir-opt pass (skip a symbolic bound) lives in the
    LLVM fork and is not part of this PR.

Base: stacked on feature/togsim-cpp-trace (#267).

🤖 Generated with Claude Code

@YWHyuk YWHyuk force-pushed the feature/togsim-dynamic-shape branch from c2a243b to 6b39c94 Compare June 23, 2026 06:00
@YWHyuk YWHyuk force-pushed the feature/togsim-cpp-trace branch from 9b913d4 to 4767e8a Compare June 24, 2026 13:16
YWHyuk and others added 10 commits June 24, 2026 22:35
… feed

Skeleton + EmitC + cost/dep analysis on the frontend; the trace runtime,
loader, bridge, and Core feed on the simulator; shared MLIR pass helpers and
the pipeline tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
Per-record tag key in the bridge plus per-iteration tag alloc in
dma-fine-grained so multi-tile-K and conv loads do not collide; strip the
reduction accum marker from the memory_barrier slot.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
togsim_dispatch with TILE_BEGIN/TILE_END; outline each work-item into
togsim_kernel_tile.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
DMA-capacity throttle and frozen-state guard, per-core VMEM in the configs,
and the SA weight-buffer throttle.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
trace_timeline.py with per-work-item grouping and resource-centric DMA lanes;
the trace logs the first DRAM response and the assigned systolic array, and
scopes the compute barrier to its dispatch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
Default to the trace path; fix uninitialized Instruction fields, the matmul
accumulator wedge, fused-subtile dedup, nested/fused epilogue dataflow, and
dma_wait fusion; bound concurrent dispatches to the spad, round-robin
work-items within a partition, benchmark autotune and run the multi-tenant
scheduler through the trace path, and emit trace.so for pooling/reduction.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
Carry simulator headers through the wrapper for cache-safe replay; drop verbose
[P3-trace] logs; fix the key.mlir compile race in load().

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
… runtime model

Replace the trace bridge's accumulated special cases with one dataflow rule and
clean up the runtime that consumes it.

Dependency rule: per SRAM buffer keep a writers SET; a reader depends on all
current writers (occupancy=ISSUE when both are systolic-array ops, else
latency=DONE); a writer REPLACEs the set. The only exception is is_mm_accum (a
matmul that reads and writes the same buffer = a commutative accumulator): skip
its read edge and UNION its write, waiting only the non-matmul init seed and not
ordering co-matmuls. This drops the matmul-accumulator chain that deadlocked the
SA weight-slot pipeline while keeping the init->matmul edge, and lets a vector
epilogue or the store wait every K matmul (fixes the pure-vector store that an
empty COMPUTE_BAR let slip).

Remove COMPUTE_BAR entirely: a matmul is its own DONE-handle (finish == SA
drain), so the store JOINs the matmul writers directly. The whole emit/loader
chain is gone -- build_skeleton, lower_to_emitc, togsim.compute_barrier, the
runtime symbol, the Opcode/case/_fence_finish, and TraceRec::COMPUTE_BAR -- so a
stale producer fails loudly instead of emitting records the bridge would drop.
Only MEMORY_BAR remains (an async load's DONE is its data arrival, not issue).

Model compute-output spad footprint in the SRAM version/capacity machinery so
buffer reuse (WAR) is capacity-modeled, not a hard edge. The output size comes
from the DMA records that touch the same buffer (a buf_bytes pre-pass); an
in-place buffer (accumulator, relu) is version-transparent so footprint is not
double-counted. The occupy gate and version release sit in the MOVIN/MOVOUT/COMP
issue points (release before the COMP skip path so a skipped matmul still frees).

Runtime: collapse child_inst / _pipeline_children into one event-indexed
_deps[ISSUE|DONE] with add_dep(c, on) and fire(e); collapse the weight-slot
release queue and the async-load wakeup into one _due_events timed-effect table
drained by process_due_events. Both are behavior-preserving (byte-identical).

Require the weight-slot model: sa_weight_buffer_depth must be > 0 (errors at
init), and the round-robin disable mode is removed. Degenerate traces (a
consumer-less preload, an unpinned matmul) hit explicit error+exit guards rather
than asserts that vanish under NDEBUG.

Mark the legacy ONNX TOG path deprecated: it is superseded by the trace path, so
TileGraphParser logs a deprecation warning and the TORCHSIM_LEGACY_TOG=1 opt-in
warns at command build.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
@YWHyuk YWHyuk force-pushed the feature/togsim-cpp-trace branch from b7c1ec4 to 9033945 Compare June 24, 2026 13:37
YWHyuk and others added 13 commits June 24, 2026 23:32
Under torch.compile(dynamic=True) the Inductor loop ranges carry sympy
symbols (e.g. ks0/s52) instead of concrete ints. The tile-size
heuristics did concrete-int arithmetic on those ranges and crashed with
sympy "cannot determine truth value" before any MLIR was emitted.

Neutralize the tile-fit heuristics for symbolic dims: they only shave a
tile to a known dim to minimize the wasted tail, which is meaningless
when the dim is unknown at compile time. Skip them, keep the fixed init
tile, and let the tail become a runtime remainder (masked).

- trim_large_tail: skip a dim whose range is symbolic
- get_padding_ratio: report zero padding for a symbolic dim/tile
- is_dim_dividable: raise a clear NotImplementedError for symbolic dims
  (the recompile-to-divisible path has no symbolic equivalent and would
  loop forever; index_expr/indirect indexing under dynamic shape is a
  later step)
- make_choices: drop a symbolic axis from the tile-grow candidates

All guards are isinstance(sympy.Expr)-gated, so the concrete-shape path
is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
Make the MLIR backend emit valid IR for torch.compile(dynamic=True). A
size symbol (e.g. ks0) now becomes a usable kernel argument and the loop
over the dynamic dim carries the symbol as a runtime bound:

- mlir_argdefs: a size-symbol arg had no buffer_types entry (it is not a
  buffer/graph_input/constant), so it KeyError'd. Key it by name (which
  is also the host-side SymInt the wrapper passes) and describe it as a
  scalar int.
- get_mlir_shape: a symbolic numel becomes a dynamic memref dim ("?")
  instead of being stringified into an invalid type.
- LoopLevel: a symbolic upper bound is emitted as an index SSA value
  (%<name>_bound); a non-symbol symbolic expr raises NotImplementedError.
- codegen_loops: a prologue at the function top level loads each size arg
  (memref<1xi64>) and index_casts it to %<name>_bound, a valid affine
  symbol usable as the loop bound.

The emitted IR parses and lowers through the whole standard pipeline
(decompose/vlane -> fine-grained/vcix -> standard lowering) for a dynamic
elementwise add. Static kernels are unchanged (every path gates on
isinstance(.., sympy.Expr)).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
torch.compile(dynamic=True) puts sympy size symbols (e.g. s52) in the
arg_attributes shape/stride fields. define_kernel emitted that list as a
module-scope Python literal in the generated wrapper, so a bare s52 was
undefined at import time and raised NameError before call() ran.

Recursively stringify sympy expressions in the meta before emitting it
('s52'). The real extent already reaches the kernel as a runtime arg (the
wrapper's call() computes s52 from the input tensor shape and passes it),
so the compile-time descriptor only needs to be import-safe and
shape-agnostic. No-op for static kernels (their meta has no sympy).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
The functional (Spike) validation binary is generated in MLIRCodeCache.load
at compile time with the tensor extent baked into the host buffer sizes
(mlir_caller_codegen allocates each buffer from arg_size). Under
torch.compile(dynamic=True) the extent is a runtime value (memref<?>), so
there is no concrete size to instantiate the fixed-shape validation binary
-- generate_args_define would size a buffer from the symbol and fail.

Skip the functional-validation block when the kernel MLIR carries a dynamic
memref dim (same effect as pytorchsim_functional_mode=off). The kernel is
still compiled shape-agnostically and timed via the gem5/TOG + trace path;
correctness of a dynamic kernel is validated at its concrete instantiation,
not at compile time.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
gem5 measures per-tile compute cost, which is shape-invariant. Add
pin_loops_to_one_tile (cycle_table.py): a general MLIR-bindings rewrite
that forces every affine.for which would iterate more than once to run a
single tile (upper bound -> the loop step). It handles both a constant
multi-iteration bound and a symbolic (runtime-extent) bound, so the cpp
TOG cycle sampling can use it for static and dynamic kernels alike.

Wire it into MLIRCodeCache.load for dynamic shape: run the legacy cycle
machinery (run_tog -> _custom.mlir -> cycle binary -> gem5) on a one-tile
COPY of the post-vcix IR, while the symbolic _postvcix.mlir is kept for
the producer .so / cycle_table. The sampling host buffers are sized to
one tile (_concretize_attrs_for_sampling), and the legacy ONNX TOG output
(generate_tile_graph) is skipped for dynamic (it enumerates tiles
statically and is unused when the trace path is the default sim path).
dump_metadata now also tolerates a scalar size argument.

Static kernels are unchanged (every new branch gates on a dynamic memref
dim). Wiring the static cycle sampling through pin_loops_to_one_tile too
is the intended next step but needs the sampling decoupled from run_tog
(which also builds the legacy full TOG).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
Make the C++ trace producer .so build for a dynamic (runtime-extent)
kernel, so its loop bounds are read at runtime from shape_args.

- build_tog._build gains serialize=False: build_skeleton only needs the
  builder side effects (loop/compute/DMA nodes), not the serialized TOG
  string, whose display() formats a constant loop_end -- None for a
  dynamic loop. The bound stays on the affine.for in the IR.
- lower_to_emitc._rewrite_signature: an original kernel arg still used
  after build_skeleton's DCE is a size symbol (its memref.load feeds a
  loop bound; tensors are referenced by name in togsim.dma attrs and DCE
  to unused). Re-source each such load from shape_args[k] via
  emitc.subscript (k = the size arg's order), then drop the arg. The
  producer's loop then reads the runtime extent: for (iv=0; iv<shape_args[k]; ...).

Verified: a dynamic elementwise add builds one trace.so whose recorded
trace scales with shape_args (1024 -> 14 insts, 2048 -> 28).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
…te file

The dynamic trace producer reads its loop bounds from shape_args; feed
them at simulation time through the existing per-kernel attribute YAML
(the file that already carries address_info), not a bespoke channel.

- write_kernel_attribute_file: a scalar input (a dynamic size arg, e.g.
  s52) is not a tensor address -- collect such scalars into a shape_args
  sequence in the YAML, in arg order (== the producer's shape_args[k]).
- run_standalone: pass --attribute <yaml> alongside --trace_so so the
  trace path receives it, the same file the legacy path passes via the
  models_list command.
- main.cc: add --attribute; in the trace branch load the YAML and fill
  shape_args from its shape_args sequence, passed to run_producer (was
  nullptr,0).
- run_kernel_simulation: skip the Spike functional run for a dynamic
  kernel (its fixed-shape validation binary is intentionally not built).

Verified end to end: one compiled add runs at 1024 (183 cycles) and 2048
(261 cycles) from the same trace.so, driven by shape_args.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
…nary)

Produce correct output VALUES for a dynamic kernel: the Spike validation
binary is now shape-agnostic and reads the runtime extent from the
size-arg buffer, the same way the trace producer reads shape_args.

- Simulator.dump_args/write_arg: a size symbol arg (MLIR_ARGS_VAR) is a
  kernel input -- write its runtime value (int64) to a .raw so the kernel
  can load its loop bound. This is Spike's existing per-arg .raw channel
  (used for tensors); the size arg was just being skipped.
- mlir_caller_codegen: the validation binary loads each size arg first
  into N_<sym>, then mallocs the tensor buffers and builds the memref
  descriptors from N at runtime (not the compile-time extent). argv slots
  are assigned in arg order (matching dump_args). A numel that is a size
  SYMBOL becomes N_<sym>; a concrete numel (including a stringified
  sympy.Integer like '128') stays a literal.
- extension_codecache: build + run the validation binary for dynamic too.

Verified: one compiled add returns correct values at 1024 / 2048 / 1536
and a 1D tail size 1000 from the same binary. Tail/lane padding for >1D
shapes is a separate follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
One torch.compile(dynamic=True) add, run at 1024 and 2048 from a single
compiled trace producer .so, checking the output values (allclose) at
each size. Sizes are tile multiples so no tail padding is needed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
Small robustness cleanups from the PR review (no behavior change):

- Add MLIRKernelArgs.is_mlir_arg_var and use it where the MLIR_ARGS_VAR
  mask was open-coded (mlir_caller_codegen._is_var, Simulator.dump_args).
- Detect a dynamic kernel in MLIRCodeCache.load via that flag
  (any size-symbol arg) instead of sniffing "memref<?" in the IR text.
- Drop a dead shape_args local in run_kernel_simulation: it was left over
  from an earlier run_spike gate; the runtime extents reach the simulator
  via the attribute YAML (write_kernel_attribute_file), not from there.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
The dynamic-shape tile/bound paths each had their own ad hoc guard for a
symbolic dimension (isinstance sympy.Expr / and-not-is_number variants).
Add one predicate, mlir_common.is_symbolic_dim(x) = a sympy.Expr that is
not a compile-time constant, and use it at every site: is_dim_dividable,
trim_large_tail, get_padding_ratio, LoopLevel._bound_str, and make_choices.
No behavior change (verified static 128/512 + dynamic add still pass); it
just gives one place to get the rule right when adding new dim arithmetic.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
Full roadmap for extending the C++ trace path to general dynamic shape:
the runtime DMA stack already carries runtime dims/strides, so the work is
codegen (general symbolic index lowering + runtime togsim.dma descriptors);
7-phase build order, cross-cutting contracts, test matrix, risks. Notes
that dynamic floor/mod belongs in axis_split (symbolic-aware), not the
legacy convert_index affine path. Planning artifact -- remove before
merging the feature.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
Generalise axis_split boundary detection and divisibility-chain construction to
accept symbolic size expressions, as a strict superset of the integer case:
concrete-int reshapes produce identical split plans, and a dynamic reshape whose
flattened extent E is a product of dims (divisor a genuine factor, e.g.
FloorDiv(v, N) / ModularIndexing(v, 1, N) with extent M*N) is now detected.

- _divides/_eq/_gt1/_proper/_quotient/_as_size: boundary arithmetic that reduces
  exactly to int ops when operands are concrete and otherwise uses sympy (Mod
  simplifies to 0, cancel gives the quotient) under the symbols' integer/positive
  assumptions.
- _ordered_chain replaces _is_chain + numeric sort: orders boundaries by the
  divisibility partial order (b_i precedes b_j iff b_i | b_j) instead of numeric
  value, so symbolic suffix-product boundaries (N | M*N) chain; returns None on a
  non-total chain (incompatible radices) exactly as before.
- collect_boundaries / find_split_plan keep symbolic divisors and extents.
- build_split_body sizes sub-vars with _quotient/_as_size (symbolic seg extents).

Detection layer only: the residual-floor/mod folding (_fold_with_ranges) for
symbolic divisors and the runtime dynamic-stride DMA needed for end-to-end
symbolic reshape are follow-ups. Verified by tests/test_axis_split_symbolic.py
(static cases match legacy, symbolic cases detected, misaligned/non-divisor bail)
and confirmed behaviour-neutral on the static view suite.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
@YWHyuk YWHyuk force-pushed the feature/togsim-dynamic-shape branch from 8e9db1a to 82e9255 Compare June 24, 2026 14:43
Extract the per-dimension and per-type memref rendering into mlir_dim /
mlir_memref_type so the symbolic-vs-concrete decision lives in one place.
get_mlir_shape now delegates to mlir_memref_type.

This fixes a gating leak: get_mlir_shape decided a dynamic memref dim with
isinstance(numel, sympy.Expr), which is also true for a concrete sympy.Integer,
so a static buffer could be emitted as memref<?xf32> in one place and
memref<128xf32> in another -- an MLIR type mismatch that broke the static view
suite (test_cat, transpose2D/3D). mlir_dim gates on is_symbolic_dim, so a
concrete sympy.Integer renders as its value and only a true symbol becomes "?".
mlir_memref_type also takes a dim list, ready for multi-axis symbolic memrefs.

Verified: static view suite recovered (cat 10, transpose2D 4, transpose3D 6,
view3D_2D 3) and test_dynamic_add still passes at N=1024 and N=2048.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
@YWHyuk YWHyuk force-pushed the feature/togsim-cpp-trace branch from c166abd to ed5c747 Compare June 25, 2026 07:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant