Add loops for ATen compiler MHA asm gen to reduce instruction count by booth-algo · Pull Request #45 · AICrossSim/PLENA_Compiler

booth-algo · 2026-05-18T22:56:12Z

Summary

Loops the ATen per-head MHA attention helper emission to reduce instruction memory pressure while preserving the old Python-unrolled path.

Changed helpers:

_online_softmax_asm
_scale_o_asm
_final_scaling_asm
_pv_multiply_asm
_reset_fpsram_asm
_reset_vram_asm

Default emission now uses hardware loops. The existing public unroll path is preserved: ATEN_UNROLL=1 / unroll_loops=True now also sets unroll_attention=True, so the attention helpers use the Python-unrolled path. Tests and harnesses can still override prog.unroll_attention directly after construction for A/B comparisons.

CLM-60M Native Layer 0 Counts

Rerun locally from the simulator workspace with compile_hf_model(model, seq_len=64, hidden_size=None, inter_dim=None, num_layers=1). Native dims: hidden=384, inter=1408, heads=6, kv_heads=2, head_dim=64.

Metric	Previous	Current	Change
Total ASM source lines	35,479	15,367	-20,112 (-56.7%)
Actual static instruction lines	34,041	14,403	-19,638 (-57.7%, 2.36x smaller)
Comment / metadata lines	1,438	964	-474 (-33.0%)
Loop-expanded dynamic instructions	645,334	649,762	+4,428 (+0.69%)
Estimated cycles	8,915,366	8,919,794	+4,428 (+0.05%)
Estimated ms @ 1GHz	8.915366	8.919794	+0.004428 (+0.05%)
C_LOOP_START static lines	248	296	+48

Verification

Companion simulator branch asm-count-verification adds the harnesses and report. Results from that branch:

ATen MHA seq=64, head_dim=64: static instructions drop from 2,960 to 128; estimated cycles increase about 3%.
ATen MHA seq=64, head_dim=128: static instructions drop from 4,399 to 169; estimated cycles increase about 3%.
Transactional emulator golden check for looped ATen MHA passed against PyTorch SDPA with 100% allclose pass rate under repo thresholds.
Additional flag check: ATEN_UNROLL=1 constructs PlenaCompiler with unroll_loops=True and unroll_attention=True; ATEN_UNROLL=0 leaves both false.

booth-algo force-pushed the feat/codegen-addr-reg-init branch from b0c6806 to 55caddf Compare May 18, 2026 23:03

Loop ATen MHA attention helpers

6b8ae98

booth-algo force-pushed the feat/codegen-addr-reg-init branch from 55caddf to 6b8ae98 Compare May 18, 2026 23:11

booth-algo marked this pull request as ready for review May 18, 2026 23:13

booth-algo changed the title ~~NOT READY FOR MERGE: loop ATen MHA attention helpers~~ Add loops for ATen compiler MHA asm gen to reduce instruction count May 18, 2026

booth-algo merged commit d2817df into main May 18, 2026
3 checks passed

booth-algo deleted the feat/codegen-addr-reg-init branch May 18, 2026 23:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add loops for ATen compiler MHA asm gen to reduce instruction count#45

Add loops for ATen compiler MHA asm gen to reduce instruction count#45
booth-algo merged 1 commit into
mainfrom
feat/codegen-addr-reg-init

booth-algo commented May 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

booth-algo commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

CLM-60M Native Layer 0 Counts

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

booth-algo commented May 18, 2026 •

edited

Loading