Develop by YWHyuk · Pull Request #274 · PSAL-POSTECH/PyTorchSim

YWHyuk · 2026-06-25T05:23:05Z

No description provided.

Stop actions/checkout from leaving its short-lived ghs_* installation token in .git/config as http.<host>.extraheader after the workflow. On self-hosted runners (build-and-test, tag_release build, tutorial build) the work tree persists between jobs, so an interrupted or killed job can leave the (later-expired) token behind. None of the existing checkouts run authenticated git operations afterwards, so disabling persistence is safe. Public submodules are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drop-in reference for Claude Code (and similar AI tools) covering the three-simulator pipeline (Gem5 -> Spike -> TOGSim), directory map, test entry points, key env vars and YAML knobs, multi-tenant API contract, build steps, and known gotchas. Complements README.md with a denser cheat-sheet aimed at code navigation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

scripts/build_from_source.sh now reads release_tag from the manifest that the CI docker image is built against, and clones gem5, llvm-project, and riscv-isa-sim at those tags. Untagged HEAD clones had caused mlir-opt option-name drift (Pass-Options-Parser error followed by an assert(0) in extension_codecache.py). Documents the pin in README.md and CLAUDE.md, and adds a plain-text / ASCII rule for commit messages in CLAUDE.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…overage [Build] Pin third-party deps to thirdparty/github-releases.json tags

scripts/setup_worktree.sh creates a sibling worktree under $PARENT_DIR/<repo>-<purpose>/, wires per-worktree env vars via a generated .envrc (TORCHSIM_DIR, TORCHSIM_DUMP_PATH, TORCHSIM_LOG_PATH, TOGSIM_CONFIG, PYTHONPATH), unsets the upstream so the first `git push -u origin <branch>` creates the right remote branch, and symlinks the TOGSim binary from the worktree the script was run from to skip a ~10-minute TOGSim rebuild per fresh worktree (readlink -f resolves chains so the link points at the real binary). scripts/clear_codegen_cache.sh wipes Inductor's compile cache (.torchinductor/, set via TORCHINDUCTOR_CACHE_DIR in extension_config.py:139) and the per-source-hash dirs identified by the 11-char prefix from extension_codecache.hash_prefix. Run it between iterations on PyTorchSimFrontend/mlir/* or anything else that affects emitted MLIR -- otherwise the next torch.compile silently replays the previous compile. togsim_results/ and unrelated files under outputs/ are preserved. docs/worktrees.md documents the flow: activate via `source .envrc`, build the .so once per worktree, TOGSim binary sharing, codegen iteration loop, and the diagnostic for the common "forgot to source .envrc" case (tracebacks pointing at the canonical worktree path while editing elsewhere). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Running tests section now states that .github/workflows/pytorchsim_test.yml runs an explicit allowlist of tests/*.py (one Docker job per test, ~40 total), not a glob-discovered set, and calls out that test_gqa.py, test_gqa_decode.py, and test_eager.py exist in the repo but are not in CI. The Gotchas section adds a bullet explaining that codegen iteration requires wiping $TORCHSIM_DUMP_PATH between runs -- Inductor's compile cache and the per-source-hash MLIR/wrapper dirs both live under it and will silently replay the previous (possibly broken) compile otherwise. This prevents two common mistakes: assuming new tests/*.py files are automatically gated on PR, and assuming a fresh re-run will pick up a codegen fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…228) The second clause of the index-cast guard at mlir_ops.py:268 compared the whole [tile_size, dtype] list against the string "index", which is never true. With i64 on the lhs and index on the rhs the code fell through to the same-bit-width branch (MLIR_TO_BIT["i64"] == MLIR_TO_BIT["index"] == 64) and emitted arith.cmpi between vector<Nxi64> and vector<Nxindex>, which mlir-opt rejects. This blocked any MoE model whose (i64_buf == arange) expert-mask pattern landed with that operand orientation -- first observed on deepseek_v3. Fix replaces the dead clause with op_type2[1] == "index" so the operand2-side index_cast at lines 285-288 is reachable, normalizing the rhs to i64 before the cmpi. Add tests/test_expert_mask.py as a focused regression covering the (expert_idx_i64.unsqueeze(-1) == arange(N)) -> torch.where pattern.

Adds three top-level jobs (test_eager, test_exponent, test_sort) and two steps inside the test_fusion job (test_attention_fusion, test_matmul_vector). All five were verified locally before registration. Files in tests/ that remain intentionally out of CI: - test_gqa.py, test_gqa_decode.py: WIP GQA SDPA path, tracked by issue #198 - test_sdpa.py: overlaps with in-flight SDPA template work; ambiguous about which backend it actually exercises - test_topk.py: sort-family coverage now provided by test_sort (stable + unstable + duplicate-key); revisit if topk-specific shapes need gating - test_group_conv.py: not run locally yet (stress config); follow-up after runtime cost is understood - test_vectorops.py: imports from other tests (test_add, test_activation, test_reduce, test_layernorm, test_softmax) which is fragile; needs an independent helper extraction before going into CI Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ovided (issue #236) decompose_native_multi_head_attention referenced attn_bias before it was defined, so any masked-attention call raised NameError. The same block also computed attn_bias but never folded it into scores before softmax, so the mask would have had no effect even if attn_bias existed. Initialize attn_bias as zeros_like(scores), build the bias from the boolean/additive mask, and add it to scores ahead of softmax. Resolves issue #236.

…sue #237) 1. mlir_codegen_backend.py:909 - NotImplementedError was constructed but not raised, so kernels with multiple reduction axes silently fell through to incorrect codegen. Add the missing raise. 2. mlir_codegen_backend.py:171-179 - codegen_sram_plan_prefix called buf.get_size() before checking buf is None, so a None entry in graph_inputs would AttributeError. Reorder the None check first. 3. mlir_common.py - CSEProxy had two identical static check_bounds declarations; only the latter survived class-body evaluation. Drop the duplicate so future drift cannot hide which one is authoritative. 4. extension_codecache.py:274 - CustomAsyncCompile.mlir passed valdiation_wrapper_name (typo) carrying self.validation_binary_name to MLIRCodeCache.load. The misspelled kwarg was absorbed by **kwargs and validation_wrapper_name silently fell back to its default. Use the correct keyword and value. Resolves issue #237.

CONFIG_TORCHSIM_TOG_HOST_{CC,CFLAGS,LDFLAGS} were added in 54ccd4c but no caller ever consumed them; MLIRCodeCache._load_library is still `pass`. Issue #239 flagged the `if True:` shortcut in _default_tog_host_cflags, but the helper's return value is never read, so no .so is actually compiled with `-Og` as the issue assumed — removing the whole block is the honest fix instead of restoring the env-var gate around code nothing calls. If a host-side TOG .so compile path lands later, the gate can be reintroduced at the real call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

) `extension_config.CONFIG_TORCHSIM_DUMP_PATH` was resolved through the module-level `__getattr__`, which read `TORCHSIM_DUMP_PATH` and *mutated* `os.environ["TORCHINDUCTOR_CACHE_DIR"]` as a side effect on every attribute access. Generic attribute reads should not change process state, and the four codegen call sites trigger this on every compile. The dynamic re-derivation is load-bearing for the Jupyter tutorials (tutorial/session1/CompilerOptimization.ipynb, Mapping.ipynb) which flip `TORCHSIM_DUMP_PATH` between cells to compare fusion ON/OFF in separate output dirs, so the issue's "move to module top-level" fix would silently break them. Instead expose the side effect as an explicit `get_dump_path()` function — same behavior at the codegen call sites, but readers can no longer trigger an env mutation by accident, and the function name signals "this writes to os.environ". Also fixes the unrelated bug noted at the end of #240: `__getattr__` fell through to an implicit `None` return for unknown names. Now it raises `AttributeError` per PEP 562, so typos like `extension_config.CONFIG_NONEXISTENT` surface immediately instead of becoming downstream `NoneType` errors. Verified locally: `get_dump_path()` syncs `TORCHINDUCTOR_CACHE_DIR`, follows dynamic env changes between calls, leaves env untouched on unrelated attribute reads, and unknown attrs raise `AttributeError`. Closes #240. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…sion The guard at mlir_codegen_backend.py:909-910 if len(reductions.loops) > 1: raise NotImplementedError("Not support multiple reduction axis..") was a leftover from when the codegen was deliberately limited to a single reduction axis, but multi-axis support was restored months ago and the guard was never removed. Timeline: - 8eacf27 ("Optimize reduce/elementwise code", 2025): replaced `for reduction in reductions.loops` with `reductions.loops[0]`, added this guard (no `raise` — silently dead). - c61f67d ("Support multi reduction dim + Add Diffusion model test", 2025-08-14): restored the `for reduction_loop in reductions.loops` iteration explicitly to support multi-axis reduction. test_diffusion was added in this same commit and passed. The guard was left in place — harmless because still `raise`-less. - 5045837 ("Fix four codegen correctness issues surfaced by review", 2026-05-26, issue #237): added the missing `raise`, framing it as a "silently fell through to incorrect codegen" fix. In fact the codegen below the guard (which iterates all reduction loops) was correct — c61f67d had made it so — and adding `raise` broke the very test (test_diffusion) that c61f67d added to validate multi-axis support. Right fix is to delete the guard, not add `raise` to it. The other three items in #237 are unrelated and stand. Verified: file py_compiles. CI test_diffusion is the empirical proof that the post-guard codegen handles multi-axis correctly within the allclose tolerance the test uses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s.py Replaces 48 near-duplicate ~12-line `test_result(name, out, ...)` defs (two signature variants, three slightly divergent bodies — one had a 'pass message only' bug) with a single canonical helper that prints framed pass/fail messages and exits 1 on mismatch. Positional-argument compatible, so caller sites are unchanged. Each migrated file replaces its local def with: import os, sys sys.path.insert(0, os.path.join( os.environ.get("TORCHSIM_DIR", default="/workspace/PyTorchSim"), "tests")) from _pytorchsim_utils import test_result The module name is deliberately unique rather than `tests._utils`: ultralytics ships its own top-level `tests` package in site-packages, which shadows any generic `tests` import. `insert(0, .../tests)` puts the repo's tests dir ahead of site-packages so the helper resolves regardless of installed packages. Net diff: 50 files changed, 220 insertions(+), 729 deletions(-). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tests/ was a flat mix of op-level files, single-file model tests, and inconsistently-cased model directories. Adds a hierarchy: tests/ _pytorchsim_utils.py ops/ elementwise/ reduce/ gemm/ conv/ attention/ view/ sort/ sparsity/ misc/ fusion/ models/ DeepSeek/ Diffusion/ Llama/ MLP/ MoE/ MobileNet/ Mixtral8x7B/ Yolov5/ test_mlp.py test_resnet.py test_single_perceptron.py test_transformer.py test_vit.py system/ test_eager.py test_hetro.py test_scheduler.py test_stonne.py test_vectorops.py Mixtral_8x7B → Mixtral8x7B for consistency with the other PascalCase model dirs. Existing single-file model dirs are kept as dirs (they may grow companion files like the Mixtral model.py). All file moves use `git mv` to preserve history. External path references rewritten across .github/workflows/pytorchsim_test.yml, README.md, CLAUDE.md, .github/ISSUE_TEMPLATE/bug_report.md, and scripts/{sparsity,stonne}_experiment/. Cross-test imports drop the `tests.` prefix (the prior commit puts `<repo>/tests` on sys.path[0]). `__init__.py` added to each new subdir so e.g. `from ops.elementwise.test_add import test_vectoradd` resolves. Sample-verified locally on tests/ops/elementwise/test_add.py and tests/ops/fusion/test_matmul_vector.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PyTorchSim never executes CUDA: the npu PrivateUse1 backend (third_party/openreg) simulates CUDA on the CPU and forces math-only SDPA, and flash-attn is never imported. The only thing that required CUDA was find_package(Torch) inheriting it from the CUDA-built torch wheel; installing the CPU wheels removes it, so the whole CUDA base image can go. Dockerfile.base: - Base image ubuntu:22.04 instead of the cuda-cudnn-devel image, kept on 22.04 to match the prebuilt gem5/llvm/spike ABIs (libpython3.11, libprotobuf.so.23). - Python 3.11 via deadsnakes and CPU torch/torchvision wheels. - Remove flash-attn (unused, needed CUDA/nvcc to build). - Install psutil and the other yolov5 hub runtime deps (pyyaml, requests, tqdm, py-cpuinfo) that the old fat base provided implicitly; ultralytics is installed with --no-deps. - Remove duplicate onnx/matplotlib/conan/ninja installs. - Fix the RISC-V toolchain step to download and extract the elf toolchain once (the glibc tarball was downloaded but unused, the elf tarball was extracted twice). - Drop conda/nvidia paths from LD_LIBRARY_PATH. thirdparty/github-releases.json: pytorch_image -> ubuntu:22.04. DeepSeek: its remote modeling file (trust_remote_code) imports flash_attn, which transformers check_imports requires installed even though flash attention is never executed on the npu. Register an import shim in tests/models/DeepSeek/test_deepseek_v3_base.py that satisfies the static import check while leaving is_flash_attn_2_available() False. CI runners: the CPU-only image (~3.7 GB compressed) is small enough to pull on GitHub-hosted runners, so the base build, app build, and the op/model test jobs run on ubuntu-latest. Only test_deepseek (largest model), test_diffusion (UNet2D simulation OOMs the hosted runner), and test_accuracy (accuracy + speedup) stay on self-hosted.

The tests/ reorg (tests -> tests/ops, tests/models) left experiments/BERT.py importing EncoderBlock from the old paths (tests.Fusion.test_transformer_fusion, tests.test_transformer), so every BERT size raised ModuleNotFoundError. Put tests/ on sys.path and import from the new locations (ops.fusion.test_transformer_fusion, models.test_transformer), matching how the test files import their siblings. run_cycle.sh ran each model as 'python3 ... | tee log' under 'set -e' only, so a model crash was masked by tee's exit code and the job stayed green with empty logs (this is why BERT failures went unnoticed). Add 'set -o pipefail' so a failing model aborts the run. Verified: BERT --size base now runs end to end (TOGSim, 329573 cycles). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Run the MLIR -> LLVM pipeline in-process via the bindings PassManager. Add Python out-of-line passes: lower_to_llvm, lower_dma_to_gemmini, lower_vlane_idx. Auto-resolve the bindings path from TORCHSIM_LLVM_PATH and ship the bindings in the LLVM artifact. Add op_coverage tooling and the bindings smoke test. Bump the LLVM pin and rebuild the thirdparty base image.

Emit togsim.transfer for >4D DMA and decompose it to a <=4D customized memref.dma_start: unit-collapse fast path, unrolled-subview peel for >4 effective dims. Fix #258 by emitting affine.apply (not arith.addi) for the peeled DRAM offset so the TOG pass can walk the loop index through it.

Split a loop axis so aligned FloorDiv/ModularIndexing collapse to per-axis affine indices. Mixed-radix split over a divisibility chain; integer-typed split symbols; r-prefix innermost reduce dims. Reindex the collapsed LoopBody instead of re-tracing; fold residual floor/mod via tensor range info. Shared boundary helpers, rank guard, and an uncovered floor/mod ledger. Enabled by default with the recompile fallback instrumented.

… floor/mod Insert a copy to relayout an operand whose floor/mod cannot be removed by axis-split: incompatible-radix shared-axis access and cross-axis multi-variable arguments. Enabled by default alongside axis-split.

Port the analysis and IR-mutation halves of the C++ test-tile-operation-graph pass to Python, wire build_tog into the gem5 path, and drop the C++ pass. Node-id counter is thread-local for concurrent compilation.

…seek seed Add tests/ops/view/test_floormod_axis_split.py covering axis-split and graph-copy patterns. Seed the global RNG in the deepseek base test so config-random weights are deterministic.

…cal SRAM offset Rewrite the >4D peel to mirror the C++ -dma-fine-grained subtile loop: wrap the outer dims in an affine.for nest (marked inner_loop so build_tog/TOG registers the induction var) and emit one <=4D memref.dma_start per iteration. The slice SRAM offset is the lane-banked physical offset -- split-outer dims rescaled by the lane coeff (stride/old_size*new_size, the MVIN block_stride / buildSramAffineMap rule) -- delivered as the last SRAM index operand. The previous unrolled subview carried the offset in the subview, which extract_aligned_pointer_as_index strips in the gemmini lowering, so every slice aliased the same spad location (pixel_shuffle MISMATCH). The DRAM offset folds with the original index into one affine.apply so processDramIndices can walk the loop index (#258). Thread vectorlane (systolic-array size) through run_python_passes into the pass for the rescale's nr_outerloop. Drop the axis-split rank guard now that >4D is peeled correctly, and register tests/ops/view/test_floormod_axis_split.py in the CI allowlist. Validated end-to-end (Gem5+Spike+TOGSim): pixel_shuffle (>4D peel) and the full floor/mod suite pass; elementwise/gemm/conv2d/reduce/softmax/MLP regress clean.

Route every MVIN/MVOUT -- both the MLIRKernel load/store backend path and the template path (gemm/conv/bmm/maxpool/cat) -- through emit_transfer, so a single decompose-transfer pass lowers all DMAs to memref.dma_start. This drops the get_dma_code emitter, the _dma_needs_transfer instance flag, and format_dma_op_attributes. togsim.transfer now also carries subtile_size and async, which decompose propagates onto the lowered dma_start (subtile filtered to the kept axes when unit dims collapse). For <=4D tiles decompose emits the descriptor directly on the original SRAM buffer (no collapse_shape) so the C++ -dma-fine-grained subtile split, which walks the SRAM operand, sees a direct buffer as before. Validated end-to-end (Spike + TOGSim) on elementwise, gemm (matmul/addmm), bmm, conv2d, group_conv, pool, cat, reduce, softmax, layernorm, batchnorm. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Reflect a6b7ebb: the find_split_plan rank guard is gone (>4D index now lowers through the decompose-transfer affine.for peel, pixel_shuffle end-to-end), and the decompose-transfer peel <-> TOG incompatibility is resolved. Move it from Known-issues to Done; drop the >4D rank-guard caveat and the high-rank next-step. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Port mlir/test/lib/Analysis/TestDmaFineGrained.cpp to a Python out-of-line pass (passes/dma_fine_grained.py): split the matmul MVIN DMAs (input/weight/bias) into subtile affine.for nests and fuse the input/weight nests, replacing the C++ -dma-fine-grained pass. The MLIR Python bindings expose no IRMapping, so the fused nest is built directly (each DMA emitted with the fused induction vars) instead of cloning bodies -- structurally equivalent, not byte-exact SSA text. Pipeline: the single mlir-opt invocation is split around the Python pass (loop-padding -> run_fine_grained in place -> pytorchsim-to-vcix) in both the functional and gem5 paths (extension_codecache); vectorlane (systolic-array size) is threaded in for the lane-banked SRAM offset rescale. Validated against mlir-opt -dma-fine-grained on rank 2/3/4 fixtures (matmul / bmm / conv: same vcix dma_start and line counts) and end-to-end (Gem5+Spike+TOGSim): gemm/bmm/conv2d plus the resnet/transformer/vit/mlp models pass. Docs: dma-transfer-lowering.md -- >4D peel is affine.for + lane-banked physical SRAM offset via the last index operand; dma_fine_grained / build_tog are now Python passes; the #258 appendix is marked resolved.

…ings) Port mlir/test/lib/Conversion/PyTorchSimToVCIX/TestPyTorchSimToVCIXConversion.cpp to a Python out-of-line pass (passes/lower_to_vcix.py): lower linalg.matmul (gemm and conv2d) and the transcendental math ops (exp/erf/tanh/sin/cos) to VCIX dialect ops (RISC-V vector custom instructions), replacing the C++ -test-pytorchsim-to-vcix. The C++ pass is a dialect conversion (applyPartialConversion); the bindings expose no conversion framework, so each matchAndRewrite is reimplemented as imperative IR rewriting. The VCIX dialect is not in the Python bindings, so vcix ops are created as unregistered generic ops -- mlir-opt / mlir-translate (vcix registered) re-parse the {}-attr generic form fine, and run_standard_lowering already consumes vcix output via allow_unregistered_dialects, so this matches the existing pipeline. Pipeline: the vcix mlir-opt invocation is dropped; run_to_vcix runs in-process after the Python fine-grained pass and before the standard lowering (both functional and gem5 paths in extension_codecache). mlir-opt now runs only -test-loop-padding. Validated structurally against mlir-opt -test-pytorchsim-to-vcix (non-constant ops byte-identical including the dma_wait tag maps, on gemm and conv2d fixtures) and numerically end-to-end (Gem5+Spike+TOGSim allclose): gemm/bmm/conv2d (incl. large N/K), softmax, exp/erf/sin/cos, and the resnet18/vit/transformer/mlp models.

dma-fine-grained and pytorchsim-to-vcix are now Python passes (dma_fine_grained, lower_to_vcix); update the docstring listing -- only test-loop-padding still runs in mlir-opt.

axis-split + graph-copy (on by default) linearize aligned floor/mod at the scheduling layer, so the index reaching get_dma_info is affine and the FloorDiv/ModularIndexing tile-divisibility branches there are never entered (measured: 0 entries across elementwise, gemm, bmm, conv, cat, floor/mod, reduce, attention). Remove those dead branches and their orphans: - the FloorDiv and ModularIndexing tile-forcing + RecompileSignal blocks - the implicit-ModularIndexing index rewrite and implicit_local_dims - the dead ModularIndexing branch in the dram_stride computation - is_modular_indexing, the write-only implicit_dim_size, unused import sys Kept: the non-floor/mod recompile paths (index-divisibility, indirect access, non-power-of-2 vec size), RecompileSignal, and the retry loop. The upstream implicit_dim_ops tile-forcing is left untouched (separate change). Validated end-to-end (Spike + TOGSim): elementwise, gemm, bmm, conv2d, group_conv, pool, cat, floor/mod suite, reduce, softmax, layernorm, batchnorm, gqa -- all pass, 0 recompiles. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…-split) implicit_dim_ops/extract_dividers/apply_constraints forced the initial tile size to match a view's floor/mod divider, up front in compute_tile_size. axis-split now linearizes those views at the scheduling layer, so the forcing is redundant: disabling it leaves every test allclose-correct and, on the affected kernels, slightly faster (the forced tile was over-constrained -- batchnorm 1189->1114, layernorm 4092->3947 cycles; non-floor/mod kernels unchanged). Remove the machinery and its now-unused imports (ModularIndexing, FloorDiv, Mod, MemoryDep, StarDep, WeakDep). Validated end-to-end (Spike + TOGSim): elementwise, gemm, bmm, conv2d, group_conv, pool, cat, floor/mod suite, reduce, softmax, layernorm, batchnorm, gqa -- all pass, 0 recompiles. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add the TPU layout-assignment & padding investigation (docs/tpu_layout_padding_report.md) and the loop-padding design doc. Settled model: padding is two layers -- (A) lane/sublane 8x128 alignment is materialized (footprint + DMA traffic), (B) the compute-block (MXU tile) boundary tail is masked (compute-utilization only, not traffic). test-loop-padding's post-codegen heuristic is to be replaced by informed emission at the scheduling/codegen layer (decide early, materialize late); the two costs must be modeled by separate functions (do not double-count the compute-block tail as traffic).

…ases Three fixes from the max-effort review of this branch: - get_dma_info: after retiring the floor/mod recompile branches, a residual floor/mod (store-side ModularIndexing, reduction-axis floor/mod, incompatible radix) that axis-split/graph-copy did not linearize was silently bucketed by its base symbol in the dram_stride loop, emitting a wrong DRAM descriptor. Raise NotImplementedError instead of mis-striding silently. No test triggers it (0 floor/mod reach get_dma_info in the suite) -- it is a safety net. - decompose_transfer collapse fast path: keep=[g[-1]] picked the last dim of each reassociation group, which is a unit dim when trailing unit dims attach after the non-unit one (e.g. [..,4,1,1]); strides/subtile were read from the wrong axis. Pick the non-unit dim in each group. - decompose_transfer >4D peel: new_vlane fell back to 0 whenever the vlane split axis was not among the inner 4 dims, conflating peeled-into-the-outer-loop (genuinely unrepresentable -> raise) with a unit lane axis (default 0 is fine). Validated: elementwise, gemm, conv2d, cat, floor/mod suite (incl. pixel_shuffle >4D peel), softmax, layernorm, batchnorm -- all pass, no spurious raise. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

More fixes from the max-effort review, after verifying each against the C++ reference and reachability: - graph_copy _relayout_args: ranges picked the consumer iteration shape by rank alone (max key=len), so for two equal-rank operands with different per-dim extents the broadcast-from operand's smaller shape could win and the real incompatible-radix conflict on the broadcast-to dim was missed (order-dependent: a commutative reorder flipped correct relayout into a silent miss). Use per-dim max extent over the max-rank operands. - lower_to_vcix _sew/_legalize_vector_type: mirror the C++ legalizeVectorType -- F16/BF16 return sew 0 (transcendentals stay unlowered for -convert-math-to-llvm, as in the validated path) instead of being lowered to VCIX, and add the missing rank != 1 guard. - lower_to_vcix matmul: port the C++ guards as loud failures -- M/N/K must be a multiple of the systolic size when > SS (else the N//SS / K//SS loops drop the tail tile), and A vs B must agree on the K subtile (last-writer-wins would pick one silently). Latent today (heuristic/autotune only emit SS-multiple tiles). - Doc-only: graph-copy is default-on (TORCHSIM_GRAPH_COPY=0 to disable); fixed the two stale 'no-op unless set' comments. Validated: elementwise, gemm, bmm, conv2d, group_conv, cat, floor/mod suite, softmax, layernorm -- all pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…fix exp chunk Two fixes to the C++->Python vcix port (lower_to_vcix.py) that SDPA exercises but the gemm/bmm/conv tests do not: - _lower_matmul bailed with 'if ATag is None or BTag is None: return False', gating on an MVIN dma_start tag for both operands. In SDPA's fused scores.V matmul, operand B is the softmax output produced in place by affine.vector_store, not DMAed, so BTag stayed None and the matmul was left un-lowered -> wrong attention output. Mirror the C++ MatmulOpLowering: an operand is initialized by either a dma_start OR a preceding affine.vector_store into its root memref; bail only when an operand is truly uninitialized. BTag/BAsync stay None/0 and are only read under 'if BAsync:', so the B dma_wait is correctly skipped (as in C++). - _make_sf_vc_v_iv n>1 transcendental chunking called vector.ExtractStridedSliceOp(offsets, sizes, strides, vec) -- wrong arg order, missing the result type and vector operand, raising TypeError under these MLIR bindings. Pass (result=legal_ty, vector=vec, offsets, sizes, strides). Only reached by large transcendentals (n>1), e.g. SDPA softmax exp, so CI's small-tile (n==1) tests never hit it. Validated end-to-end (Spike+TOGSim allclose): SDPA 56 cases pass (was crash/wrong); matmul/bmm/conv2d regress clean. Bisected: C++ vcix passes SDPA, Python vcix did not; exp chunking and fine-grained ruled out separately. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…bug env vars axis-split and graph-copy are the floor/mod handling path and were default-on but still gated behind TORCHSIM_AXIS_SPLIT / TORCHSIM_GRAPH_COPY. Remove the gates so they run unconditionally, and delete the env vars that were only introduced for validation/debug during development: TORCHSIM_AXIS_SPLIT, TORCHSIM_GRAPH_COPY - default-on toggles TORCHSIM_AXIS_SPLIT_FORCE - force-split validation aid TORCHSIM_AXIS_LEDGER + axis_split.ledger() - coverage measurement TORCHSIM_DEBUG_AXIS_SPLIT + _dump_axis() - debug dump TORCHSIM_GRAPH_COPY_DEBUG - graph-copy debug prints TORCHSIM_RECOMPILE_LOG - vestigial recompile log Also drop the now-dead ledger() function, the _dump_axis() helper, and the unused os import in graph_copy.py. The floor/mod regression test no longer sets the removed env vars. Behavior is unchanged (the toggles were already on). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

dma_fine_grained and lower_to_vcix already exposed run(module, **opts) like the registered decompose_transfer/lower_vlane_idx, but were called file-based and directly from extension_codecache, so the .mlir was parsed+printed twice (once per pass) between loop-padding and the standard lowering, and the pipeline was hardcoded+duplicated across the functional and gem5 paths. Give both passes MARKERS and group the four rewrite passes into PRE_OPT_PASSES / POST_OPT_PASSES around the one remaining mlir-opt pass (-test-loop-padding). A single driver run_module_passes(in, out, passes, **opts) parses once, runs each marker-matched pass on the shared Module in order, prints once (copies through when no marker matches). run_python_passes is now PRE_OPT via that driver; the functional/gem5 fine-grained+vcix calls each become one run_module_passes. run_fine_grained / run_to_vcix stay re-exported for standalone/CLI use. Validated (Spike+TOGSim): elementwise, gemm, conv2d, softmax, floor/mod suite, SDPA -- all pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Guards the Spike VI_VV_EXT (vsext/vzext) lane bug where widening int conversions (int8/int16 -> wider via tensor.to) returned scrambled/zero output. Signed and uint8<128 only so it is independent of the separate uint8->int8 dtype issue (#238). Requires the riscv-isa-sim fix (PSAL-POSTECH/riscv-isa-sim#4) in the pinned Spike; add to the CI allowlist + bump thirdparty/github-releases.json spike tag once that release is cut. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

v1.0.2 includes the riscv-isa-sim VI_VV_EXT fix (PSAL-POSTECH/riscv-isa-sim#4) so integer widening conversions no longer scramble. Bumping the pin changes the thirdparty base-image hash, so ensure-base rebuilds the base with the new Spike. Wire test_widen_dtype.py into the allowlist now that the fixed Spike is pinned. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The wrapper2 test job passed vpu_num_lanes and vpu_spad_size_kb_per_lane as docker env vars, but since the unified-config refactor (9db0f2c) the frontend reads these values only from the TOGSim YAML, never from the environment. The env vars were therefore silently ignored and wrapper2 ran with the default 128x128 config, making it an exact duplicate of wrapper1. Switch the reusable workflow to select a config YAML via TOGSIM_CONFIG so that a single file drives both codegen and the TOGSim cycle model, in line with the unified-config design. vpu_num_lanes also sets the systolic array dimension, so each config exercises a different array size: - configs: add systolic_ws_32x32_c1_simple_noc_tpuv3.yml (32x32 array, 32 KB/lane SPAD) and systolic_ws_8x8_c1_simple_noc_tpuv3.yml (8x8 array, 32 KB/lane SPAD); both identical to the tpuv3 default otherwise - pytorchsim_test.yml: replace vector_lane/spad_size inputs with togsim_config (string) and run_accuracy (bool); every job now sets -e TOGSIM_CONFIG instead of the dead vpu_* env vars - docker-image.yml: wrapper1 -> 128x128 config + run_accuracy true, wrapper2 -> 32x32 config, wrapper3 -> 8x8 config (run_accuracy false)

test_scheduler compiles resnet18 and EncoderBlock and launches each model twice, so the per-kernel RISC-V ELFs, objdump disassembly dumps and gem5 m5out directories accumulate within a single run. On the small github-hosted root volume (~14G) this overflows during the RISC-V final link step (ld: final link failed: No space left on device). Free the preinstalled tool caches before the run and redirect the PyTorchSim outputs/ and /tmp artifacts onto the larger /mnt scratch disk (~70G) so the accumulated artifacts no longer fill the root volume. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

[CI] Run test_scheduler on github-hosted with extra disk

YWHyuk and others added 30 commits May 22, 2026 14:08

Merge pull request #227 from PSAL-POSTECH/feature/build-pins-and-op-c…

a31c756

…overage [Build] Pin third-party deps to thirdparty/github-releases.json tags

[Frontend] build_tog: port test-tile-operation-graph to Python

eb960c1

Port the analysis and IR-mutation halves of the C++ test-tile-operation-graph pass to Python, wire build_tog into the gem5 path, and drop the C++ pass. Node-id counter is thread-local for concurrent compilation.

[Test] floor/mod axis-split + graph-copy coverage; deterministic deep…

8fa923c

…seek seed Add tests/ops/view/test_floormod_axis_split.py covering axis-split and graph-copy patterns. Seed the global RNG in the deepseek base test so config-random weights are deterministic.

[Docs] lower_to_llvm: only test-loop-padding remains in mlir-opt

b9996df

dma-fine-grained and pytorchsim-to-vcix are now Python passes (dma_fine_grained, lower_to_vcix); update the docstring listing -- only test-loop-padding still runs in mlir-opt.

YWHyuk and others added 12 commits June 19, 2026 13:16

Merge pull request #279 from PSAL-POSTECH/feature/ci-scheduler-disk

13c93a7

[CI] Run test_scheduler on github-hosted with extra disk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Develop#274

Develop#274
YWHyuk wants to merge 42 commits into
masterfrom
develop

YWHyuk commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

YWHyuk commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants