Skip to content

Develop#274

Open
YWHyuk wants to merge 42 commits into
masterfrom
develop
Open

Develop#274
YWHyuk wants to merge 42 commits into
masterfrom
develop

Conversation

@YWHyuk

@YWHyuk YWHyuk commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

YWHyuk and others added 30 commits May 22, 2026 14:08
Stop actions/checkout from leaving its short-lived ghs_* installation
token in .git/config as http.<host>.extraheader after the workflow.
On self-hosted runners (build-and-test, tag_release build, tutorial
build) the work tree persists between jobs, so an interrupted or
killed job can leave the (later-expired) token behind. None of the
existing checkouts run authenticated git operations afterwards, so
disabling persistence is safe. Public submodules are unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop-in reference for Claude Code (and similar AI tools) covering the
three-simulator pipeline (Gem5 -> Spike -> TOGSim), directory map,
test entry points, key env vars and YAML knobs, multi-tenant API
contract, build steps, and known gotchas. Complements README.md
with a denser cheat-sheet aimed at code navigation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/build_from_source.sh now reads release_tag from the manifest
that the CI docker image is built against, and clones gem5, llvm-project,
and riscv-isa-sim at those tags. Untagged HEAD clones had caused
mlir-opt option-name drift (Pass-Options-Parser error followed by an
assert(0) in extension_codecache.py).

Documents the pin in README.md and CLAUDE.md, and adds a plain-text /
ASCII rule for commit messages in CLAUDE.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…overage

[Build] Pin third-party deps to thirdparty/github-releases.json tags
scripts/setup_worktree.sh creates a sibling worktree under
$PARENT_DIR/<repo>-<purpose>/, wires per-worktree env vars via a
generated .envrc (TORCHSIM_DIR, TORCHSIM_DUMP_PATH, TORCHSIM_LOG_PATH,
TOGSIM_CONFIG, PYTHONPATH), unsets the upstream so the first
`git push -u origin <branch>` creates the right remote branch, and
symlinks the TOGSim binary from the worktree the script was run from
to skip a ~10-minute TOGSim rebuild per fresh worktree (readlink -f
resolves chains so the link points at the real binary).

scripts/clear_codegen_cache.sh wipes Inductor's compile cache
(.torchinductor/, set via TORCHINDUCTOR_CACHE_DIR in
extension_config.py:139) and the per-source-hash dirs identified by
the 11-char prefix from extension_codecache.hash_prefix. Run it
between iterations on PyTorchSimFrontend/mlir/* or anything else
that affects emitted MLIR -- otherwise the next torch.compile
silently replays the previous compile. togsim_results/ and unrelated
files under outputs/ are preserved.

docs/worktrees.md documents the flow: activate via `source .envrc`,
build the .so once per worktree, TOGSim binary sharing, codegen
iteration loop, and the diagnostic for the common "forgot to source
.envrc" case (tracebacks pointing at the canonical worktree path
while editing elsewhere).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Running tests section now states that .github/workflows/pytorchsim_test.yml
runs an explicit allowlist of tests/*.py (one Docker job per test, ~40 total),
not a glob-discovered set, and calls out that test_gqa.py, test_gqa_decode.py,
and test_eager.py exist in the repo but are not in CI.

The Gotchas section adds a bullet explaining that codegen iteration
requires wiping $TORCHSIM_DUMP_PATH between runs -- Inductor's compile
cache and the per-source-hash MLIR/wrapper dirs both live under it and
will silently replay the previous (possibly broken) compile otherwise.

This prevents two common mistakes: assuming new tests/*.py files are
automatically gated on PR, and assuming a fresh re-run will pick up a
codegen fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…228)

The second clause of the index-cast guard at mlir_ops.py:268 compared
the whole [tile_size, dtype] list against the string "index", which
is never true. With i64 on the lhs and index on the rhs the code fell
through to the same-bit-width branch (MLIR_TO_BIT["i64"] ==
MLIR_TO_BIT["index"] == 64) and emitted arith.cmpi between
vector<Nxi64> and vector<Nxindex>, which mlir-opt rejects. This
blocked any MoE model whose (i64_buf == arange) expert-mask pattern
landed with that operand orientation -- first observed on
deepseek_v3.

Fix replaces the dead clause with op_type2[1] == "index" so the
operand2-side index_cast at lines 285-288 is reachable, normalizing
the rhs to i64 before the cmpi.

Add tests/test_expert_mask.py as a focused regression covering the
(expert_idx_i64.unsqueeze(-1) == arange(N)) -> torch.where pattern.
Adds three top-level jobs (test_eager, test_exponent, test_sort) and two
steps inside the test_fusion job (test_attention_fusion, test_matmul_vector).
All five were verified locally before registration.

Files in tests/ that remain intentionally out of CI:
- test_gqa.py, test_gqa_decode.py: WIP GQA SDPA path, tracked by issue #198
- test_sdpa.py: overlaps with in-flight SDPA template work; ambiguous about
  which backend it actually exercises
- test_topk.py: sort-family coverage now provided by test_sort (stable +
  unstable + duplicate-key); revisit if topk-specific shapes need gating
- test_group_conv.py: not run locally yet (stress config); follow-up after
  runtime cost is understood
- test_vectorops.py: imports from other tests (test_add, test_activation,
  test_reduce, test_layernorm, test_softmax) which is fragile; needs an
  independent helper extraction before going into CI

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ovided (issue #236)

decompose_native_multi_head_attention referenced attn_bias before it was
defined, so any masked-attention call raised NameError. The same block
also computed attn_bias but never folded it into scores before softmax,
so the mask would have had no effect even if attn_bias existed.

Initialize attn_bias as zeros_like(scores), build the bias from the
boolean/additive mask, and add it to scores ahead of softmax.

Resolves issue #236.
…sue #237)

1. mlir_codegen_backend.py:909 - NotImplementedError was constructed but
   not raised, so kernels with multiple reduction axes silently fell
   through to incorrect codegen. Add the missing raise.

2. mlir_codegen_backend.py:171-179 - codegen_sram_plan_prefix called
   buf.get_size() before checking buf is None, so a None entry in
   graph_inputs would AttributeError. Reorder the None check first.

3. mlir_common.py - CSEProxy had two identical static check_bounds
   declarations; only the latter survived class-body evaluation. Drop
   the duplicate so future drift cannot hide which one is authoritative.

4. extension_codecache.py:274 - CustomAsyncCompile.mlir passed
   valdiation_wrapper_name (typo) carrying self.validation_binary_name
   to MLIRCodeCache.load. The misspelled kwarg was absorbed by **kwargs
   and validation_wrapper_name silently fell back to its default. Use
   the correct keyword and value.

Resolves issue #237.
CONFIG_TORCHSIM_TOG_HOST_{CC,CFLAGS,LDFLAGS} were added in 54ccd4c but
no caller ever consumed them; MLIRCodeCache._load_library is still
`pass`. Issue #239 flagged the `if True:` shortcut in
_default_tog_host_cflags, but the helper's return value is never read,
so no .so is actually compiled with `-Og` as the issue assumed —
removing the whole block is the honest fix instead of restoring the
env-var gate around code nothing calls. If a host-side TOG .so compile
path lands later, the gate can be reintroduced at the real call site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
)

`extension_config.CONFIG_TORCHSIM_DUMP_PATH` was resolved through the
module-level `__getattr__`, which read `TORCHSIM_DUMP_PATH` and
*mutated* `os.environ["TORCHINDUCTOR_CACHE_DIR"]` as a side effect on
every attribute access. Generic attribute reads should not change
process state, and the four codegen call sites trigger this on every
compile.

The dynamic re-derivation is load-bearing for the Jupyter tutorials
(tutorial/session1/CompilerOptimization.ipynb, Mapping.ipynb) which
flip `TORCHSIM_DUMP_PATH` between cells to compare fusion ON/OFF in
separate output dirs, so the issue's "move to module top-level" fix
would silently break them.

Instead expose the side effect as an explicit `get_dump_path()`
function — same behavior at the codegen call sites, but readers can
no longer trigger an env mutation by accident, and the function name
signals "this writes to os.environ".

Also fixes the unrelated bug noted at the end of #240: `__getattr__`
fell through to an implicit `None` return for unknown names. Now it
raises `AttributeError` per PEP 562, so typos like
`extension_config.CONFIG_NONEXISTENT` surface immediately instead of
becoming downstream `NoneType` errors.

Verified locally: `get_dump_path()` syncs `TORCHINDUCTOR_CACHE_DIR`,
follows dynamic env changes between calls, leaves env untouched on
unrelated attribute reads, and unknown attrs raise `AttributeError`.

Closes #240.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sion

The guard at mlir_codegen_backend.py:909-910

    if len(reductions.loops) > 1:
        raise NotImplementedError("Not support multiple reduction axis..")

was a leftover from when the codegen was deliberately limited to a
single reduction axis, but multi-axis support was restored months ago
and the guard was never removed.

Timeline:
- 8eacf27 ("Optimize reduce/elementwise code", 2025): replaced
  `for reduction in reductions.loops` with `reductions.loops[0]`,
  added this guard (no `raise` — silently dead).
- c61f67d ("Support multi reduction dim + Add Diffusion model test",
  2025-08-14): restored the `for reduction_loop in reductions.loops`
  iteration explicitly to support multi-axis reduction. test_diffusion
  was added in this same commit and passed. The guard was left in
  place — harmless because still `raise`-less.
- 5045837 ("Fix four codegen correctness issues surfaced by review",
  2026-05-26, issue #237): added the missing `raise`, framing it as a
  "silently fell through to incorrect codegen" fix. In fact the
  codegen below the guard (which iterates all reduction loops) was
  correct — c61f67d had made it so — and adding `raise` broke the
  very test (test_diffusion) that c61f67d added to validate multi-axis
  support.

Right fix is to delete the guard, not add `raise` to it. The other
three items in #237 are unrelated and stand.

Verified: file py_compiles. CI test_diffusion is the empirical proof
that the post-guard codegen handles multi-axis correctly within the
allclose tolerance the test uses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s.py

Replaces 48 near-duplicate ~12-line `test_result(name, out, ...)` defs
(two signature variants, three slightly divergent bodies — one had a
'pass message only' bug) with a single canonical helper that prints
framed pass/fail messages and exits 1 on mismatch. Positional-argument
compatible, so caller sites are unchanged.

Each migrated file replaces its local def with:

  import os, sys
  sys.path.insert(0, os.path.join(
      os.environ.get("TORCHSIM_DIR", default="/workspace/PyTorchSim"), "tests"))
  from _pytorchsim_utils import test_result

The module name is deliberately unique rather than `tests._utils`:
ultralytics ships its own top-level `tests` package in site-packages,
which shadows any generic `tests` import. `insert(0, .../tests)` puts
the repo's tests dir ahead of site-packages so the helper resolves
regardless of installed packages.

Net diff: 50 files changed, 220 insertions(+), 729 deletions(-).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tests/ was a flat mix of op-level files, single-file model tests, and
inconsistently-cased model directories. Adds a hierarchy:

  tests/
    _pytorchsim_utils.py
    ops/
      elementwise/  reduce/  gemm/  conv/  attention/
      view/         sort/    sparsity/  misc/  fusion/
    models/
      DeepSeek/  Diffusion/  Llama/  MLP/  MoE/
      MobileNet/  Mixtral8x7B/  Yolov5/
      test_mlp.py  test_resnet.py  test_single_perceptron.py
      test_transformer.py  test_vit.py
    system/
      test_eager.py  test_hetro.py  test_scheduler.py
      test_stonne.py  test_vectorops.py

Mixtral_8x7B → Mixtral8x7B for consistency with the other PascalCase
model dirs. Existing single-file model dirs are kept as dirs (they
may grow companion files like the Mixtral model.py).

All file moves use `git mv` to preserve history. External path
references rewritten across .github/workflows/pytorchsim_test.yml,
README.md, CLAUDE.md, .github/ISSUE_TEMPLATE/bug_report.md, and
scripts/{sparsity,stonne}_experiment/.

Cross-test imports drop the `tests.` prefix (the prior commit puts
`<repo>/tests` on sys.path[0]). `__init__.py` added to each new subdir
so e.g. `from ops.elementwise.test_add import test_vectoradd` resolves.

Sample-verified locally on tests/ops/elementwise/test_add.py and
tests/ops/fusion/test_matmul_vector.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PyTorchSim never executes CUDA: the npu PrivateUse1 backend
(third_party/openreg) simulates CUDA on the CPU and forces math-only
SDPA, and flash-attn is never imported. The only thing that required
CUDA was find_package(Torch) inheriting it from the CUDA-built torch
wheel; installing the CPU wheels removes it, so the whole CUDA base
image can go.

Dockerfile.base:
- Base image ubuntu:22.04 instead of the cuda-cudnn-devel image, kept on
  22.04 to match the prebuilt gem5/llvm/spike ABIs (libpython3.11,
  libprotobuf.so.23).
- Python 3.11 via deadsnakes and CPU torch/torchvision wheels.
- Remove flash-attn (unused, needed CUDA/nvcc to build).
- Install psutil and the other yolov5 hub runtime deps (pyyaml, requests,
  tqdm, py-cpuinfo) that the old fat base provided implicitly; ultralytics
  is installed with --no-deps.
- Remove duplicate onnx/matplotlib/conan/ninja installs.
- Fix the RISC-V toolchain step to download and extract the elf toolchain
  once (the glibc tarball was downloaded but unused, the elf tarball was
  extracted twice).
- Drop conda/nvidia paths from LD_LIBRARY_PATH.

thirdparty/github-releases.json: pytorch_image -> ubuntu:22.04.

DeepSeek: its remote modeling file (trust_remote_code) imports flash_attn,
which transformers check_imports requires installed even though flash
attention is never executed on the npu. Register an import shim in
tests/models/DeepSeek/test_deepseek_v3_base.py that satisfies the static
import check while leaving is_flash_attn_2_available() False.

CI runners: the CPU-only image (~3.7 GB compressed) is small enough to
pull on GitHub-hosted runners, so the base build, app build, and the
op/model test jobs run on ubuntu-latest. Only test_deepseek (largest
model), test_diffusion (UNet2D simulation OOMs the hosted runner), and
test_accuracy (accuracy + speedup) stay on self-hosted.
The tests/ reorg (tests -> tests/ops, tests/models) left experiments/BERT.py
importing EncoderBlock from the old paths (tests.Fusion.test_transformer_fusion,
tests.test_transformer), so every BERT size raised ModuleNotFoundError. Put tests/
on sys.path and import from the new locations (ops.fusion.test_transformer_fusion,
models.test_transformer), matching how the test files import their siblings.

run_cycle.sh ran each model as 'python3 ... | tee log' under 'set -e' only, so a
model crash was masked by tee's exit code and the job stayed green with empty logs
(this is why BERT failures went unnoticed). Add 'set -o pipefail' so a failing
model aborts the run.

Verified: BERT --size base now runs end to end (TOGSim, 329573 cycles).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Run the MLIR -> LLVM pipeline in-process via the bindings PassManager.
Add Python out-of-line passes: lower_to_llvm, lower_dma_to_gemmini, lower_vlane_idx.
Auto-resolve the bindings path from TORCHSIM_LLVM_PATH and ship the bindings in the
LLVM artifact. Add op_coverage tooling and the bindings smoke test. Bump the LLVM
pin and rebuild the thirdparty base image.
Emit togsim.transfer for >4D DMA and decompose it to a <=4D customized
memref.dma_start: unit-collapse fast path, unrolled-subview peel for >4 effective
dims. Fix #258 by emitting affine.apply (not arith.addi) for the peeled DRAM offset
so the TOG pass can walk the loop index through it.
Split a loop axis so aligned FloorDiv/ModularIndexing collapse to per-axis affine
indices. Mixed-radix split over a divisibility chain; integer-typed split symbols;
r-prefix innermost reduce dims. Reindex the collapsed LoopBody instead of re-tracing;
fold residual floor/mod via tensor range info. Shared boundary helpers, rank guard,
and an uncovered floor/mod ledger. Enabled by default with the recompile fallback
instrumented.
… floor/mod

Insert a copy to relayout an operand whose floor/mod cannot be removed by axis-split:
incompatible-radix shared-axis access and cross-axis multi-variable arguments.
Enabled by default alongside axis-split.
Port the analysis and IR-mutation halves of the C++ test-tile-operation-graph pass
to Python, wire build_tog into the gem5 path, and drop the C++ pass. Node-id counter
is thread-local for concurrent compilation.
…seek seed

Add tests/ops/view/test_floormod_axis_split.py covering axis-split and graph-copy
patterns. Seed the global RNG in the deepseek base test so config-random weights are
deterministic.
…cal SRAM offset

Rewrite the >4D peel to mirror the C++ -dma-fine-grained subtile loop: wrap the
outer dims in an affine.for nest (marked inner_loop so build_tog/TOG registers the
induction var) and emit one <=4D memref.dma_start per iteration. The slice SRAM
offset is the lane-banked physical offset -- split-outer dims rescaled by the lane
coeff (stride/old_size*new_size, the MVIN block_stride / buildSramAffineMap rule) --
delivered as the last SRAM index operand. The previous unrolled subview carried the
offset in the subview, which extract_aligned_pointer_as_index strips in the gemmini
lowering, so every slice aliased the same spad location (pixel_shuffle MISMATCH). The
DRAM offset folds with the original index into one affine.apply so processDramIndices
can walk the loop index (#258).

Thread vectorlane (systolic-array size) through run_python_passes into the pass for
the rescale's nr_outerloop. Drop the axis-split rank guard now that >4D is peeled
correctly, and register tests/ops/view/test_floormod_axis_split.py in the CI allowlist.

Validated end-to-end (Gem5+Spike+TOGSim): pixel_shuffle (>4D peel) and the full
floor/mod suite pass; elementwise/gemm/conv2d/reduce/softmax/MLP regress clean.
Route every MVIN/MVOUT -- both the MLIRKernel load/store backend path and the
template path (gemm/conv/bmm/maxpool/cat) -- through emit_transfer, so a single
decompose-transfer pass lowers all DMAs to memref.dma_start. This drops the
get_dma_code emitter, the _dma_needs_transfer instance flag, and
format_dma_op_attributes.

togsim.transfer now also carries subtile_size and async, which decompose
propagates onto the lowered dma_start (subtile filtered to the kept axes when
unit dims collapse). For <=4D tiles decompose emits the descriptor directly on
the original SRAM buffer (no collapse_shape) so the C++ -dma-fine-grained
subtile split, which walks the SRAM operand, sees a direct buffer as before.

Validated end-to-end (Spike + TOGSim) on elementwise, gemm (matmul/addmm), bmm,
conv2d, group_conv, pool, cat, reduce, softmax, layernorm, batchnorm.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reflect a6b7ebb: the find_split_plan rank guard is gone (>4D index now lowers
through the decompose-transfer affine.for peel, pixel_shuffle end-to-end), and
the decompose-transfer peel <-> TOG incompatibility is resolved. Move it from
Known-issues to Done; drop the >4D rank-guard caveat and the high-rank next-step.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Port mlir/test/lib/Analysis/TestDmaFineGrained.cpp to a Python out-of-line pass
(passes/dma_fine_grained.py): split the matmul MVIN DMAs (input/weight/bias) into
subtile affine.for nests and fuse the input/weight nests, replacing the C++
-dma-fine-grained pass. The MLIR Python bindings expose no IRMapping, so the fused
nest is built directly (each DMA emitted with the fused induction vars) instead of
cloning bodies -- structurally equivalent, not byte-exact SSA text.

Pipeline: the single mlir-opt invocation is split around the Python pass
(loop-padding -> run_fine_grained in place -> pytorchsim-to-vcix) in both the
functional and gem5 paths (extension_codecache); vectorlane (systolic-array size)
is threaded in for the lane-banked SRAM offset rescale.

Validated against mlir-opt -dma-fine-grained on rank 2/3/4 fixtures (matmul / bmm /
conv: same vcix dma_start and line counts) and end-to-end (Gem5+Spike+TOGSim):
gemm/bmm/conv2d plus the resnet/transformer/vit/mlp models pass.

Docs: dma-transfer-lowering.md -- >4D peel is affine.for + lane-banked physical SRAM
offset via the last index operand; dma_fine_grained / build_tog are now Python
passes; the #258 appendix is marked resolved.
…ings)

Port mlir/test/lib/Conversion/PyTorchSimToVCIX/TestPyTorchSimToVCIXConversion.cpp to
a Python out-of-line pass (passes/lower_to_vcix.py): lower linalg.matmul (gemm and
conv2d) and the transcendental math ops (exp/erf/tanh/sin/cos) to VCIX dialect ops
(RISC-V vector custom instructions), replacing the C++ -test-pytorchsim-to-vcix.

The C++ pass is a dialect conversion (applyPartialConversion); the bindings expose no
conversion framework, so each matchAndRewrite is reimplemented as imperative IR
rewriting. The VCIX dialect is not in the Python bindings, so vcix ops are created as
unregistered generic ops -- mlir-opt / mlir-translate (vcix registered) re-parse the
{}-attr generic form fine, and run_standard_lowering already consumes vcix output via
allow_unregistered_dialects, so this matches the existing pipeline.

Pipeline: the vcix mlir-opt invocation is dropped; run_to_vcix runs in-process after
the Python fine-grained pass and before the standard lowering (both functional and
gem5 paths in extension_codecache). mlir-opt now runs only -test-loop-padding.

Validated structurally against mlir-opt -test-pytorchsim-to-vcix (non-constant ops
byte-identical including the dma_wait tag maps, on gemm and conv2d fixtures) and
numerically end-to-end (Gem5+Spike+TOGSim allclose): gemm/bmm/conv2d (incl. large
N/K), softmax, exp/erf/sin/cos, and the resnet18/vit/transformer/mlp models.
dma-fine-grained and pytorchsim-to-vcix are now Python passes (dma_fine_grained,
lower_to_vcix); update the docstring listing -- only test-loop-padding still runs
in mlir-opt.
axis-split + graph-copy (on by default) linearize aligned floor/mod at the
scheduling layer, so the index reaching get_dma_info is affine and the
FloorDiv/ModularIndexing tile-divisibility branches there are never entered
(measured: 0 entries across elementwise, gemm, bmm, conv, cat, floor/mod,
reduce, attention). Remove those dead branches and their orphans:

  - the FloorDiv and ModularIndexing tile-forcing + RecompileSignal blocks
  - the implicit-ModularIndexing index rewrite and implicit_local_dims
  - the dead ModularIndexing branch in the dram_stride computation
  - is_modular_indexing, the write-only implicit_dim_size, unused import sys

Kept: the non-floor/mod recompile paths (index-divisibility, indirect access,
non-power-of-2 vec size), RecompileSignal, and the retry loop. The upstream
implicit_dim_ops tile-forcing is left untouched (separate change).

Validated end-to-end (Spike + TOGSim): elementwise, gemm, bmm, conv2d,
group_conv, pool, cat, floor/mod suite, reduce, softmax, layernorm, batchnorm,
gqa -- all pass, 0 recompiles.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
YWHyuk and others added 12 commits June 19, 2026 13:16
…-split)

implicit_dim_ops/extract_dividers/apply_constraints forced the initial tile size
to match a view's floor/mod divider, up front in compute_tile_size. axis-split now
linearizes those views at the scheduling layer, so the forcing is redundant:
disabling it leaves every test allclose-correct and, on the affected kernels,
slightly faster (the forced tile was over-constrained -- batchnorm 1189->1114,
layernorm 4092->3947 cycles; non-floor/mod kernels unchanged).

Remove the machinery and its now-unused imports (ModularIndexing, FloorDiv, Mod,
MemoryDep, StarDep, WeakDep).

Validated end-to-end (Spike + TOGSim): elementwise, gemm, bmm, conv2d, group_conv,
pool, cat, floor/mod suite, reduce, softmax, layernorm, batchnorm, gqa -- all pass,
0 recompiles.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the TPU layout-assignment & padding investigation (docs/tpu_layout_padding_report.md)
and the loop-padding design doc. Settled model: padding is two layers --
(A) lane/sublane 8x128 alignment is materialized (footprint + DMA traffic), (B) the
compute-block (MXU tile) boundary tail is masked (compute-utilization only, not traffic).
test-loop-padding's post-codegen heuristic is to be replaced by informed emission at the
scheduling/codegen layer (decide early, materialize late); the two costs must be modeled
by separate functions (do not double-count the compute-block tail as traffic).
…ases

Three fixes from the max-effort review of this branch:

- get_dma_info: after retiring the floor/mod recompile branches, a residual
  floor/mod (store-side ModularIndexing, reduction-axis floor/mod, incompatible
  radix) that axis-split/graph-copy did not linearize was silently bucketed by
  its base symbol in the dram_stride loop, emitting a wrong DRAM descriptor.
  Raise NotImplementedError instead of mis-striding silently. No test triggers
  it (0 floor/mod reach get_dma_info in the suite) -- it is a safety net.

- decompose_transfer collapse fast path: keep=[g[-1]] picked the last dim of
  each reassociation group, which is a unit dim when trailing unit dims attach
  after the non-unit one (e.g. [..,4,1,1]); strides/subtile were read from the
  wrong axis. Pick the non-unit dim in each group.

- decompose_transfer >4D peel: new_vlane fell back to 0 whenever the vlane split
  axis was not among the inner 4 dims, conflating peeled-into-the-outer-loop
  (genuinely unrepresentable -> raise) with a unit lane axis (default 0 is fine).

Validated: elementwise, gemm, conv2d, cat, floor/mod suite (incl. pixel_shuffle
>4D peel), softmax, layernorm, batchnorm -- all pass, no spurious raise.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
More fixes from the max-effort review, after verifying each against the C++
reference and reachability:

- graph_copy _relayout_args: ranges picked the consumer iteration shape by rank
  alone (max key=len), so for two equal-rank operands with different per-dim
  extents the broadcast-from operand's smaller shape could win and the real
  incompatible-radix conflict on the broadcast-to dim was missed (order-dependent:
  a commutative reorder flipped correct relayout into a silent miss). Use per-dim
  max extent over the max-rank operands.

- lower_to_vcix _sew/_legalize_vector_type: mirror the C++ legalizeVectorType --
  F16/BF16 return sew 0 (transcendentals stay unlowered for -convert-math-to-llvm,
  as in the validated path) instead of being lowered to VCIX, and add the missing
  rank != 1 guard.

- lower_to_vcix matmul: port the C++ guards as loud failures -- M/N/K must be a
  multiple of the systolic size when > SS (else the N//SS / K//SS loops drop the
  tail tile), and A vs B must agree on the K subtile (last-writer-wins would pick
  one silently). Latent today (heuristic/autotune only emit SS-multiple tiles).

- Doc-only: graph-copy is default-on (TORCHSIM_GRAPH_COPY=0 to disable); fixed the
  two stale 'no-op unless set' comments.

Validated: elementwise, gemm, bmm, conv2d, group_conv, cat, floor/mod suite,
softmax, layernorm -- all pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…fix exp chunk

Two fixes to the C++->Python vcix port (lower_to_vcix.py) that SDPA exercises but
the gemm/bmm/conv tests do not:

- _lower_matmul bailed with 'if ATag is None or BTag is None: return False',
  gating on an MVIN dma_start tag for both operands. In SDPA's fused scores.V
  matmul, operand B is the softmax output produced in place by affine.vector_store,
  not DMAed, so BTag stayed None and the matmul was left un-lowered -> wrong
  attention output. Mirror the C++ MatmulOpLowering: an operand is initialized by
  either a dma_start OR a preceding affine.vector_store into its root memref; bail
  only when an operand is truly uninitialized. BTag/BAsync stay None/0 and are only
  read under 'if BAsync:', so the B dma_wait is correctly skipped (as in C++).

- _make_sf_vc_v_iv n>1 transcendental chunking called
  vector.ExtractStridedSliceOp(offsets, sizes, strides, vec) -- wrong arg order,
  missing the result type and vector operand, raising TypeError under these MLIR
  bindings. Pass (result=legal_ty, vector=vec, offsets, sizes, strides). Only
  reached by large transcendentals (n>1), e.g. SDPA softmax exp, so CI's small-tile
  (n==1) tests never hit it.

Validated end-to-end (Spike+TOGSim allclose): SDPA 56 cases pass (was crash/wrong);
matmul/bmm/conv2d regress clean. Bisected: C++ vcix passes SDPA, Python vcix did
not; exp chunking and fine-grained ruled out separately.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…bug env vars

axis-split and graph-copy are the floor/mod handling path and were
default-on but still gated behind TORCHSIM_AXIS_SPLIT / TORCHSIM_GRAPH_COPY.
Remove the gates so they run unconditionally, and delete the env vars that
were only introduced for validation/debug during development:

  TORCHSIM_AXIS_SPLIT, TORCHSIM_GRAPH_COPY        - default-on toggles
  TORCHSIM_AXIS_SPLIT_FORCE                        - force-split validation aid
  TORCHSIM_AXIS_LEDGER + axis_split.ledger()       - coverage measurement
  TORCHSIM_DEBUG_AXIS_SPLIT + _dump_axis()         - debug dump
  TORCHSIM_GRAPH_COPY_DEBUG                         - graph-copy debug prints
  TORCHSIM_RECOMPILE_LOG                            - vestigial recompile log

Also drop the now-dead ledger() function, the _dump_axis() helper, and the
unused os import in graph_copy.py. The floor/mod regression test no longer
sets the removed env vars. Behavior is unchanged (the toggles were already on).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
dma_fine_grained and lower_to_vcix already exposed run(module, **opts) like the
registered decompose_transfer/lower_vlane_idx, but were called file-based and
directly from extension_codecache, so the .mlir was parsed+printed twice (once per
pass) between loop-padding and the standard lowering, and the pipeline was
hardcoded+duplicated across the functional and gem5 paths.

Give both passes MARKERS and group the four rewrite passes into PRE_OPT_PASSES /
POST_OPT_PASSES around the one remaining mlir-opt pass (-test-loop-padding). A
single driver run_module_passes(in, out, passes, **opts) parses once, runs each
marker-matched pass on the shared Module in order, prints once (copies through when
no marker matches). run_python_passes is now PRE_OPT via that driver; the
functional/gem5 fine-grained+vcix calls each become one run_module_passes.

run_fine_grained / run_to_vcix stay re-exported for standalone/CLI use.

Validated (Spike+TOGSim): elementwise, gemm, conv2d, softmax, floor/mod suite,
SDPA -- all pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Guards the Spike VI_VV_EXT (vsext/vzext) lane bug where widening int conversions
(int8/int16 -> wider via tensor.to) returned scrambled/zero output. Signed and
uint8<128 only so it is independent of the separate uint8->int8 dtype issue (#238).

Requires the riscv-isa-sim fix (PSAL-POSTECH/riscv-isa-sim#4) in the pinned Spike;
add to the CI allowlist + bump thirdparty/github-releases.json spike tag once that
release is cut.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
v1.0.2 includes the riscv-isa-sim VI_VV_EXT fix (PSAL-POSTECH/riscv-isa-sim#4)
so integer widening conversions no longer scramble. Bumping the pin changes the
thirdparty base-image hash, so ensure-base rebuilds the base with the new Spike.
Wire test_widen_dtype.py into the allowlist now that the fixed Spike is pinned.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The wrapper2 test job passed vpu_num_lanes and vpu_spad_size_kb_per_lane
as docker env vars, but since the unified-config refactor (9db0f2c) the
frontend reads these values only from the TOGSim YAML, never from the
environment. The env vars were therefore silently ignored and wrapper2
ran with the default 128x128 config, making it an exact duplicate of
wrapper1.

Switch the reusable workflow to select a config YAML via TOGSIM_CONFIG so
that a single file drives both codegen and the TOGSim cycle model, in
line with the unified-config design. vpu_num_lanes also sets the systolic
array dimension, so each config exercises a different array size:

- configs: add systolic_ws_32x32_c1_simple_noc_tpuv3.yml (32x32 array,
  32 KB/lane SPAD) and systolic_ws_8x8_c1_simple_noc_tpuv3.yml (8x8
  array, 32 KB/lane SPAD); both identical to the tpuv3 default otherwise
- pytorchsim_test.yml: replace vector_lane/spad_size inputs with
  togsim_config (string) and run_accuracy (bool); every job now sets
  -e TOGSIM_CONFIG instead of the dead vpu_* env vars
- docker-image.yml: wrapper1 -> 128x128 config + run_accuracy true,
  wrapper2 -> 32x32 config, wrapper3 -> 8x8 config (run_accuracy false)
test_scheduler compiles resnet18 and EncoderBlock and launches each
model twice, so the per-kernel RISC-V ELFs, objdump disassembly dumps
and gem5 m5out directories accumulate within a single run. On the small
github-hosted root volume (~14G) this overflows during the RISC-V final
link step (ld: final link failed: No space left on device).

Free the preinstalled tool caches before the run and redirect the
PyTorchSim outputs/ and /tmp artifacts onto the larger /mnt scratch
disk (~70G) so the accumulated artifacts no longer fill the root volume.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
[CI] Run test_scheduler on github-hosted with extra disk
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants