Skip to content

[WIP] Refactor: migrate A5 examples and tests to SceneTestCase format #577

Open
doraemonmj wants to merge 3 commits intohw-native-sys:mainfrom
doraemonmj:pytest
Open

[WIP] Refactor: migrate A5 examples and tests to SceneTestCase format #577
doraemonmj wants to merge 3 commits intohw-native-sys:mainfrom
doraemonmj:pytest

Conversation

@doraemonmj
Copy link
Copy Markdown
Contributor

  • Replace golden.py + kernel_config.py with unified test_*.py files
    using @scene_test decorator and SceneTestCase base class
  • Covers examples/a5/{host_build_graph,tensormap_and_ringbuffer} (14 examples)
    and tests/st/a5/{host_build_graph,tensormap_and_ringbuffer} (3 tests)
  • Add a5sim to platforms for all cases that support simulation
  • Cross-directory kernel references use relative paths (../)

全量 Case 表

# runtime 用例名 Case 位置 sim onboard dtype 精度 (R/A) block_dim thread_num 迁移情况
1 host dump_tensor default tests/st/a5/host_build_graph/dump_tensor/ Y Y fp32 N/A 3 3 不需要修改
2 host paged_attention (st) Case1 tests/st/a5/host_build_graph/paged_attention/ Y bf16 1e-3/1e-3 24 3 已修改,不需迁移
3 host paged_attention (st) Case2 tests/st/a5/host_build_graph/paged_attention/ Y bf16 1e-3/1e-3 24 3 已修改,不需迁移
4 host paged_attention (example) Case1 examples/a5/host_build_graph/paged_attention/ Y fp16 1e-2/1e-2 3 3 已合并
5 host paged_attention (example) Case2 examples/a5/host_build_graph/paged_attention/ Y fp16 1e-2/1e-2 3 3 已合并
6 tmrb explicit_fatal (st) default tests/st/a5/tensormap_and_ringbuffer/explicit_fatal/ Y N/A N/A 24 4 不需要修改
7 tmrb paged_attention (st) Case1 tests/st/a5/tensormap_and_ringbuffer/paged_attention/ Y bf16 1e-3/1e-3 24 4 修改并迁移
8 tmrb paged_attention (st) Case2 tests/st/a5/tensormap_and_ringbuffer/paged_attention/ Y bf16 1e-3/1e-3 24 4 修改并迁移
9 tmrb paged_attention (st) Case3 tests/st/a5/tensormap_and_ringbuffer/paged_attention/ Y bf16 1e-3/1e-3 24 4 修改并迁移
10 tmrb paged_attention_unroll (st) Case1 tests/st/a5/tensormap_and_ringbuffer/paged_attention_unroll/ Y bf16 1e-3/1e-3 36 4 已修改,无需迁移
11 tmrb paged_attention_unroll (st) Case2 tests/st/a5/tensormap_and_ringbuffer/paged_attention_unroll/ Y bf16 1e-3/1e-3 36 4 已修改,无需迁移
12 tmrb paged_attention_unroll (st) Case3 tests/st/a5/tensormap_and_ringbuffer/paged_attention_unroll/ Y bf16 1e-3/1e-3 36 4 已修改,无需迁移
13 tmrb bgemm (example) default examples/a5/tensormap_and_ringbuffer/bgemm/ Y Y fp32 1e-3/1e-3 3 4 已修改,无需迁移
14 tmrb mixed_example (example) case1 examples/a5/tensormap_and_ringbuffer/mixed_example/ Y Y fp32 1e-3/1e-3 3 4 合并,修改为bf16和1e-3
15 tmrb mixed_example (example) case2 examples/a5/tensormap_and_ringbuffer/mixed_example/ Y Y fp32 1e-3/1e-3 3 4 合并,修改为bf16和1e-3
16 tmrb paged_attention (example) Case1 examples/a5/tensormap_and_ringbuffer/paged_attention/ Y fp16 1e-2/1e-2 24 4 合并,修改为bf16和1e-3
17 tmrb paged_attention (example) Case2 examples/a5/tensormap_and_ringbuffer/paged_attention/ Y fp16 1e-2/1e-2 24 4 合并,修改为bf16和1e-3
18 tmrb paged_attention (example) CaseVarSeq2 examples/a5/tensormap_and_ringbuffer/paged_attention/ Y fp16 1e-2/1e-2 24 4 合并,修改为bf16和1e-3
19 tmrb paged_attention (example) CaseVarSeq4 examples/a5/tensormap_and_ringbuffer/paged_attention/ Y fp16 1e-2/1e-2 24 4 合并,修改为bf16和1e-3
20 tmrb spmd_basic (example) default examples/a5/tensormap_and_ringbuffer/spmd_basic/ Y N/A 0/0 24 4 修改并迁移
21 tmrb spmd_multiblock_aiv (example) default examples/a5/tensormap_and_ringbuffer/spmd_multiblock_aiv/ Y N/A 0/0 24 4 修改并迁移
22 tmrb spmd_multiblock_mix (example) default examples/a5/tensormap_and_ringbuffer/spmd_multiblock_mix/ Y N/A 0/0 24 4 修改并迁移
23 tmrb spmd_starvation (example) default examples/a5/tensormap_and_ringbuffer/spmd_starvation/ Y N/A 0/0 24 4 修改并迁移
24 tmrb spmd_sync_start (example) default examples/a5/tensormap_and_ringbuffer/spmd_sync_start/ Y N/A 0/0 24 4 修改并迁移
25 tmrb spmd_sync_start_aiv (example) default examples/a5/tensormap_and_ringbuffer/spmd_sync_start_aiv/ Y N/A 0/0 24 4 修改并迁移
26 tmrb spmd_sync_start_edge (example) default examples/a5/tensormap_and_ringbuffer/spmd_sync_start_edge/ Y N/A 0/0 24 4 修改并迁移
27 tmrb spmd_sync_start_stress (example) default examples/a5/tensormap_and_ringbuffer/spmd_sync_start_stress/ Y N/A 0/0 24 4 修改并迁移

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces production-scale paged attention support for the A5 platform, refactoring kernels to use bfloat16 and implementing runtime dispatch for various tile configurations. It also adds a comprehensive suite of SPMD and mixed-core execution tests. Feedback highlights a critical data race in the orchestration logic due to improper scope guard usage, potential compilation failures on x86 simulation environments from ARM-specific assembly, and a regression in Grouped Query Attention (GQA) support. Additionally, improvements were suggested regarding test reproducibility through manual seeding and more accurate profiling by reading system counter frequency at runtime.

CYCLE_COUNT_LAP(prof_submit_task);

for (uint64_t bn = 0; bn < bn_this_batch; bn++) {
PTO2_SCOPE_GUARD();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Using PTO2_SCOPE_GUARD() inside the bn loop will cause intermediate tensors (like sij, pij_f16, and oi_tmp) to be allocated at the same UB offsets in every iteration. Since the runtime is asynchronous and there are no explicit dependencies between tasks of different blocks (other than the UP chain), QK[bn+1] could overwrite the sij buffer while SF[bn] is still reading from it. This data race will lead to incorrect results. The scope guard should be moved outside the loop, or explicit dependencies must be added to ensure safe UB reuse.

Comment on lines +40 to +44
inline uint64_t get_sys_cnt_aicpu() {
uint64_t ticks;
asm volatile("mrs %0, cntvct_el0" : "=r"(ticks));
return ticks;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The get_sys_cnt_aicpu function uses ARM-specific inline assembly (cntvct_el0). This will cause compilation failures when building for x86-based simulation environments (a5sim), which are explicitly mentioned as supported platforms in the PR description. This code should be wrapped in architecture-specific macros (e.g., #ifdef __aarch64__) with a portable fallback for simulation.

Comment on lines 172 to 174
uint32_t kv_offsets[2] = {static_cast<uint32_t>(cur_block_idx * block_size), 0};
Tensor kj = key_cache.view(kv_shapes, kv_offsets);
Tensor vj = value_cache.view(kv_shapes, kv_offsets);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for viewing key_cache and value_cache ignores the kv_head_idx, effectively regressing GQA support compared to the previous host_build_graph implementation. By flattening the cache to 2D {total_tokens, head_dim} at line 105 and using offsets based only on cur_block_idx, the code incorrectly assumes kv_head_num == 1 or that heads are not interleaved. This will produce incorrect results for models using Grouped Query Attention with multiple KV heads.

Comment on lines +63 to +64
A = torch.randn(BATCH, GRID_M, GRID_K, TILE_M, TILE_K, dtype=torch.float32) * 0.01
B = torch.randn(BATCH, GRID_K, GRID_N, TILE_K, TILE_N, dtype=torch.float32) * 0.01
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The input tensors are generated using torch.randn without setting a manual seed. This can lead to non-deterministic test results and potential flakiness if the comparison with golden results is sensitive to specific random values. It is recommended to set a seed for reproducibility, as seen in other test files in this PR.

Suggested change
A = torch.randn(BATCH, GRID_M, GRID_K, TILE_M, TILE_K, dtype=torch.float32) * 0.01
B = torch.randn(BATCH, GRID_K, GRID_N, TILE_K, TILE_N, dtype=torch.float32) * 0.01
torch.manual_seed(42)
A = torch.randn(BATCH, GRID_M, GRID_K, TILE_M, TILE_K, dtype=torch.float32) * 0.01
B = torch.randn(BATCH, GRID_K, GRID_N, TILE_K, TILE_N, dtype=torch.float32) * 0.01
References
  1. For trivial, non-tunable scaling factors used in test input generation, using a literal directly is acceptable as it may not warrant a named constant.

#define FUNC_SOFTMAX_PREPARE 1
#define FUNC_PV_MATMUL 2
#define FUNC_ONLINE_UPDATE 3
constexpr uint64_t PLATFORM_PROF_SYS_CNT_FREQ = 50000000; // 50 MHz
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The system counter frequency PLATFORM_PROF_SYS_CNT_FREQ is hardcoded to 50 MHz. This value may vary across different hardware revisions or simulation environments, leading to inaccurate profiling results. It is safer to read the frequency from the system register cntfrq_el0 at runtime.

majin0824 added 3 commits April 16, 2026 16:42
- Replace golden.py + kernel_config.py with unified test_*.py files
  using @scene_test decorator and SceneTestCase base class
- Covers examples/a5/{host_build_graph,tensormap_and_ringbuffer} (14 examples)
  and tests/st/a5/{host_build_graph,tensormap_and_ringbuffer} (3 tests)
- Add a5sim to platforms for all cases that support simulation
- Cross-directory kernel references use relative paths (../)
…d attention

- Move spmd_*, mixed_example from examples/tmr/ to tests/st/tmr/
- Remove duplicate HBG paged_attention from examples/ (already in tests/st/)
- Remove old TMR paged_attention from tests/st/ (kept in examples/ as evolving reference)
- Upgrade TMR paged_attention: fp16 -> bfloat16, multi-tile dispatch (16x128, 64x64),
  production-scale cases (batch=256, head_dim=128/256), tighter tolerances (1e-3)
- Add small-tile (16,16,16) dispatch path to HBG paged_attention kernels
  with SmallCase1/SmallCase2 sim-compatible test cases
… migration process

- During the previous use case migration process, some kernels lacked the definition of function names.

- This submission has completed the missing names in the aic and aiv modules of test_*.py to maintain the integrity and consistency of the code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant