[WIP] Refactor: migrate A5 examples and tests to SceneTestCase format by doraemonmj · Pull Request #577 · hw-native-sys/simpler

doraemonmj · 2026-04-16T03:08:17Z

Replace golden.py + kernel_config.py with unified test_*.py files
using @scene_test decorator and SceneTestCase base class
Covers examples/a5/{host_build_graph,tensormap_and_ringbuffer} (14 examples)
and tests/st/a5/{host_build_graph,tensormap_and_ringbuffer} (3 tests)
Add a5sim to platforms for all cases that support simulation
Cross-directory kernel references use relative paths (../)

全量 Case 表

#	runtime	用例名	Case	位置	sim	onboard	dtype	精度 (R/A)	block_dim	thread_num	迁移情况
1	host	dump_tensor	default	`tests/st/a5/host_build_graph/dump_tensor/`	Y	Y	fp32	N/A	3	3	不需要修改
2	host	paged_attention (st)	Case1	`tests/st/a5/host_build_graph/paged_attention/`		Y	bf16	1e-3/1e-3	24	3	已修改，不需迁移
3	host	paged_attention (st)	Case2	`tests/st/a5/host_build_graph/paged_attention/`		Y	bf16	1e-3/1e-3	24	3	已修改，不需迁移
4	host	paged_attention (example)	Case1	`examples/a5/host_build_graph/paged_attention/`		Y	fp16	1e-2/1e-2	3	3	已合并
5	host	paged_attention (example)	Case2	`examples/a5/host_build_graph/paged_attention/`		Y	fp16	1e-2/1e-2	3	3	已合并
6	tmrb	explicit_fatal (st)	default	`tests/st/a5/tensormap_and_ringbuffer/explicit_fatal/`	Y		N/A	N/A	24	4	不需要修改
7	tmrb	paged_attention (st)	Case1	`tests/st/a5/tensormap_and_ringbuffer/paged_attention/`		Y	bf16	1e-3/1e-3	24	4	修改并迁移
8	tmrb	paged_attention (st)	Case2	`tests/st/a5/tensormap_and_ringbuffer/paged_attention/`		Y	bf16	1e-3/1e-3	24	4	修改并迁移
9	tmrb	paged_attention (st)	Case3	`tests/st/a5/tensormap_and_ringbuffer/paged_attention/`		Y	bf16	1e-3/1e-3	24	4	修改并迁移
10	tmrb	paged_attention_unroll (st)	Case1	`tests/st/a5/tensormap_and_ringbuffer/paged_attention_unroll/`		Y	bf16	1e-3/1e-3	36	4	已修改，无需迁移
11	tmrb	paged_attention_unroll (st)	Case2	`tests/st/a5/tensormap_and_ringbuffer/paged_attention_unroll/`		Y	bf16	1e-3/1e-3	36	4	已修改，无需迁移
12	tmrb	paged_attention_unroll (st)	Case3	`tests/st/a5/tensormap_and_ringbuffer/paged_attention_unroll/`		Y	bf16	1e-3/1e-3	36	4	已修改，无需迁移
13	tmrb	bgemm (example)	default	`examples/a5/tensormap_and_ringbuffer/bgemm/`	Y	Y	fp32	1e-3/1e-3	3	4	已修改，无需迁移
14	tmrb	mixed_example (example)	case1	`examples/a5/tensormap_and_ringbuffer/mixed_example/`	Y	Y	fp32	1e-3/1e-3	3	4	合并，修改为bf16和1e-3
15	tmrb	mixed_example (example)	case2	`examples/a5/tensormap_and_ringbuffer/mixed_example/`	Y	Y	fp32	1e-3/1e-3	3	4	合并，修改为bf16和1e-3
16	tmrb	paged_attention (example)	Case1	`examples/a5/tensormap_and_ringbuffer/paged_attention/`		Y	fp16	1e-2/1e-2	24	4	合并，修改为bf16和1e-3
17	tmrb	paged_attention (example)	Case2	`examples/a5/tensormap_and_ringbuffer/paged_attention/`		Y	fp16	1e-2/1e-2	24	4	合并，修改为bf16和1e-3
18	tmrb	paged_attention (example)	CaseVarSeq2	`examples/a5/tensormap_and_ringbuffer/paged_attention/`		Y	fp16	1e-2/1e-2	24	4	合并，修改为bf16和1e-3
19	tmrb	paged_attention (example)	CaseVarSeq4	`examples/a5/tensormap_and_ringbuffer/paged_attention/`		Y	fp16	1e-2/1e-2	24	4	合并，修改为bf16和1e-3
20	tmrb	spmd_basic (example)	default	`examples/a5/tensormap_and_ringbuffer/spmd_basic/`		Y	N/A	0/0	24	4	修改并迁移
21	tmrb	spmd_multiblock_aiv (example)	default	`examples/a5/tensormap_and_ringbuffer/spmd_multiblock_aiv/`		Y	N/A	0/0	24	4	修改并迁移
22	tmrb	spmd_multiblock_mix (example)	default	`examples/a5/tensormap_and_ringbuffer/spmd_multiblock_mix/`		Y	N/A	0/0	24	4	修改并迁移
23	tmrb	spmd_starvation (example)	default	`examples/a5/tensormap_and_ringbuffer/spmd_starvation/`		Y	N/A	0/0	24	4	修改并迁移
24	tmrb	spmd_sync_start (example)	default	`examples/a5/tensormap_and_ringbuffer/spmd_sync_start/`		Y	N/A	0/0	24	4	修改并迁移
25	tmrb	spmd_sync_start_aiv (example)	default	`examples/a5/tensormap_and_ringbuffer/spmd_sync_start_aiv/`		Y	N/A	0/0	24	4	修改并迁移
26	tmrb	spmd_sync_start_edge (example)	default	`examples/a5/tensormap_and_ringbuffer/spmd_sync_start_edge/`		Y	N/A	0/0	24	4	修改并迁移
27	tmrb	spmd_sync_start_stress (example)	default	`examples/a5/tensormap_and_ringbuffer/spmd_sync_start_stress/`		Y	N/A	0/0	24	4	修改并迁移

gemini-code-assist

Code Review

This pull request introduces production-scale paged attention support for the A5 platform, refactoring kernels to use bfloat16 and implementing runtime dispatch for various tile configurations. It also adds a comprehensive suite of SPMD and mixed-core execution tests. Feedback highlights a critical data race in the orchestration logic due to improper scope guard usage, potential compilation failures on x86 simulation environments from ARM-specific assembly, and a regression in Grouped Query Attention (GQA) support. Additionally, improvements were suggested regarding test reproducibility through manual seeding and more accurate profiling by reading system counter frequency at runtime.

gemini-code-assist · 2026-04-16T03:11:48Z

+                CYCLE_COUNT_LAP(prof_submit_task);

                for (uint64_t bn = 0; bn < bn_this_batch; bn++) {
+                    PTO2_SCOPE_GUARD();


Using PTO2_SCOPE_GUARD() inside the bn loop will cause intermediate tensors (like sij, pij_f16, and oi_tmp) to be allocated at the same UB offsets in every iteration. Since the runtime is asynchronous and there are no explicit dependencies between tasks of different blocks (other than the UP chain), QK[bn+1] could overwrite the sij buffer while SF[bn] is still reading from it. This data race will lead to incorrect results. The scope guard should be moved outside the loop, or explicit dependencies must be added to ensure safe UB reuse.

gemini-code-assist · 2026-04-16T03:11:48Z

+inline uint64_t get_sys_cnt_aicpu() {
+    uint64_t ticks;
+    asm volatile("mrs %0, cntvct_el0" : "=r"(ticks));
+    return ticks;
+}


The get_sys_cnt_aicpu function uses ARM-specific inline assembly (cntvct_el0). This will cause compilation failures when building for x86-based simulation environments (a5sim), which are explicitly mentioned as supported platforms in the PR description. This code should be wrapped in architecture-specific macros (e.g., #ifdef __aarch64__) with a portable fallback for simulation.

gemini-code-assist · 2026-04-16T03:11:48Z

                    uint32_t kv_offsets[2] = {static_cast<uint32_t>(cur_block_idx * block_size), 0};
                    Tensor kj = key_cache.view(kv_shapes, kv_offsets);
                    Tensor vj = value_cache.view(kv_shapes, kv_offsets);


The logic for viewing key_cache and value_cache ignores the kv_head_idx, effectively regressing GQA support compared to the previous host_build_graph implementation. By flattening the cache to 2D {total_tokens, head_dim} at line 105 and using offsets based only on cur_block_idx, the code incorrectly assumes kv_head_num == 1 or that heads are not interleaved. This will produce incorrect results for models using Grouped Query Attention with multiple KV heads.

gemini-code-assist · 2026-04-16T03:11:48Z

+        A = torch.randn(BATCH, GRID_M, GRID_K, TILE_M, TILE_K, dtype=torch.float32) * 0.01
+        B = torch.randn(BATCH, GRID_K, GRID_N, TILE_K, TILE_N, dtype=torch.float32) * 0.01


The input tensors are generated using torch.randn without setting a manual seed. This can lead to non-deterministic test results and potential flakiness if the comparison with golden results is sensitive to specific random values. It is recommended to set a seed for reproducibility, as seen in other test files in this PR.

Suggested change

A = torch.randn(BATCH, GRID_M, GRID_K, TILE_M, TILE_K, dtype=torch.float32) * 0.01

B = torch.randn(BATCH, GRID_K, GRID_N, TILE_K, TILE_N, dtype=torch.float32) * 0.01

torch.manual_seed(42)

A = torch.randn(BATCH, GRID_M, GRID_K, TILE_M, TILE_K, dtype=torch.float32) * 0.01

B = torch.randn(BATCH, GRID_K, GRID_N, TILE_K, TILE_N, dtype=torch.float32) * 0.01

References

For trivial, non-tunable scaling factors used in test input generation, using a literal directly is acceptable as it may not warrant a named constant.

gemini-code-assist · 2026-04-16T03:11:49Z

 #define FUNC_SOFTMAX_PREPARE 1
 #define FUNC_PV_MATMUL 2
 #define FUNC_ONLINE_UPDATE 3
+constexpr uint64_t PLATFORM_PROF_SYS_CNT_FREQ = 50000000;  // 50 MHz


The system counter frequency PLATFORM_PROF_SYS_CNT_FREQ is hardcoded to 50 MHz. This value may vary across different hardware revisions or simulation environments, leading to inaccurate profiling results. It is safer to read the frequency from the system register cntfrq_el0 at runtime.

- Replace golden.py + kernel_config.py with unified test_*.py files using @scene_test decorator and SceneTestCase base class - Covers examples/a5/{host_build_graph,tensormap_and_ringbuffer} (14 examples) and tests/st/a5/{host_build_graph,tensormap_and_ringbuffer} (3 tests) - Add a5sim to platforms for all cases that support simulation - Cross-directory kernel references use relative paths (../)

…d attention - Move spmd_*, mixed_example from examples/tmr/ to tests/st/tmr/ - Remove duplicate HBG paged_attention from examples/ (already in tests/st/) - Remove old TMR paged_attention from tests/st/ (kept in examples/ as evolving reference) - Upgrade TMR paged_attention: fp16 -> bfloat16, multi-tile dispatch (16x128, 64x64), production-scale cases (batch=256, head_dim=128/256), tighter tolerances (1e-3) - Add small-tile (16,16,16) dispatch path to HBG paged_attention kernels with SmallCase1/SmallCase2 sim-compatible test cases

… migration process - During the previous use case migration process, some kernels lacked the definition of function names. - This submission has completed the missing names in the aic and aiv modules of test_*.py to maintain the integrity and consistency of the code.

gemini-code-assist bot reviewed Apr 16, 2026

View reviewed changes

majin0824 added 3 commits April 16, 2026 16:42

doraemonmj force-pushed the pytest branch from 60ff6aa to 51c201d Compare April 16, 2026 13:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Refactor: migrate A5 examples and tests to SceneTestCase format #577

[WIP] Refactor: migrate A5 examples and tests to SceneTestCase format #577
doraemonmj wants to merge 3 commits intohw-native-sys:mainfrom
doraemonmj:pytest

doraemonmj commented Apr 16, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		A = torch.randn(BATCH, GRID_M, GRID_K, TILE_M, TILE_K, dtype=torch.float32) * 0.01
		B = torch.randn(BATCH, GRID_K, GRID_N, TILE_K, TILE_N, dtype=torch.float32) * 0.01

Conversation

doraemonmj commented Apr 16, 2026

全量 Case 表

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant