[WIP] Refactor: migrate A5 examples and tests to SceneTestCase format #577
[WIP] Refactor: migrate A5 examples and tests to SceneTestCase format #577doraemonmj wants to merge 3 commits intohw-native-sys:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces production-scale paged attention support for the A5 platform, refactoring kernels to use bfloat16 and implementing runtime dispatch for various tile configurations. It also adds a comprehensive suite of SPMD and mixed-core execution tests. Feedback highlights a critical data race in the orchestration logic due to improper scope guard usage, potential compilation failures on x86 simulation environments from ARM-specific assembly, and a regression in Grouped Query Attention (GQA) support. Additionally, improvements were suggested regarding test reproducibility through manual seeding and more accurate profiling by reading system counter frequency at runtime.
| CYCLE_COUNT_LAP(prof_submit_task); | ||
|
|
||
| for (uint64_t bn = 0; bn < bn_this_batch; bn++) { | ||
| PTO2_SCOPE_GUARD(); |
There was a problem hiding this comment.
Using PTO2_SCOPE_GUARD() inside the bn loop will cause intermediate tensors (like sij, pij_f16, and oi_tmp) to be allocated at the same UB offsets in every iteration. Since the runtime is asynchronous and there are no explicit dependencies between tasks of different blocks (other than the UP chain), QK[bn+1] could overwrite the sij buffer while SF[bn] is still reading from it. This data race will lead to incorrect results. The scope guard should be moved outside the loop, or explicit dependencies must be added to ensure safe UB reuse.
| inline uint64_t get_sys_cnt_aicpu() { | ||
| uint64_t ticks; | ||
| asm volatile("mrs %0, cntvct_el0" : "=r"(ticks)); | ||
| return ticks; | ||
| } |
There was a problem hiding this comment.
The get_sys_cnt_aicpu function uses ARM-specific inline assembly (cntvct_el0). This will cause compilation failures when building for x86-based simulation environments (a5sim), which are explicitly mentioned as supported platforms in the PR description. This code should be wrapped in architecture-specific macros (e.g., #ifdef __aarch64__) with a portable fallback for simulation.
| uint32_t kv_offsets[2] = {static_cast<uint32_t>(cur_block_idx * block_size), 0}; | ||
| Tensor kj = key_cache.view(kv_shapes, kv_offsets); | ||
| Tensor vj = value_cache.view(kv_shapes, kv_offsets); |
There was a problem hiding this comment.
The logic for viewing key_cache and value_cache ignores the kv_head_idx, effectively regressing GQA support compared to the previous host_build_graph implementation. By flattening the cache to 2D {total_tokens, head_dim} at line 105 and using offsets based only on cur_block_idx, the code incorrectly assumes kv_head_num == 1 or that heads are not interleaved. This will produce incorrect results for models using Grouped Query Attention with multiple KV heads.
| A = torch.randn(BATCH, GRID_M, GRID_K, TILE_M, TILE_K, dtype=torch.float32) * 0.01 | ||
| B = torch.randn(BATCH, GRID_K, GRID_N, TILE_K, TILE_N, dtype=torch.float32) * 0.01 |
There was a problem hiding this comment.
The input tensors are generated using torch.randn without setting a manual seed. This can lead to non-deterministic test results and potential flakiness if the comparison with golden results is sensitive to specific random values. It is recommended to set a seed for reproducibility, as seen in other test files in this PR.
| A = torch.randn(BATCH, GRID_M, GRID_K, TILE_M, TILE_K, dtype=torch.float32) * 0.01 | |
| B = torch.randn(BATCH, GRID_K, GRID_N, TILE_K, TILE_N, dtype=torch.float32) * 0.01 | |
| torch.manual_seed(42) | |
| A = torch.randn(BATCH, GRID_M, GRID_K, TILE_M, TILE_K, dtype=torch.float32) * 0.01 | |
| B = torch.randn(BATCH, GRID_K, GRID_N, TILE_K, TILE_N, dtype=torch.float32) * 0.01 |
References
- For trivial, non-tunable scaling factors used in test input generation, using a literal directly is acceptable as it may not warrant a named constant.
| #define FUNC_SOFTMAX_PREPARE 1 | ||
| #define FUNC_PV_MATMUL 2 | ||
| #define FUNC_ONLINE_UPDATE 3 | ||
| constexpr uint64_t PLATFORM_PROF_SYS_CNT_FREQ = 50000000; // 50 MHz |
There was a problem hiding this comment.
- Replace golden.py + kernel_config.py with unified test_*.py files
using @scene_test decorator and SceneTestCase base class
- Covers examples/a5/{host_build_graph,tensormap_and_ringbuffer} (14 examples)
and tests/st/a5/{host_build_graph,tensormap_and_ringbuffer} (3 tests)
- Add a5sim to platforms for all cases that support simulation
- Cross-directory kernel references use relative paths (../)
…d attention - Move spmd_*, mixed_example from examples/tmr/ to tests/st/tmr/ - Remove duplicate HBG paged_attention from examples/ (already in tests/st/) - Remove old TMR paged_attention from tests/st/ (kept in examples/ as evolving reference) - Upgrade TMR paged_attention: fp16 -> bfloat16, multi-tile dispatch (16x128, 64x64), production-scale cases (batch=256, head_dim=128/256), tighter tolerances (1e-3) - Add small-tile (16,16,16) dispatch path to HBG paged_attention kernels with SmallCase1/SmallCase2 sim-compatible test cases
… migration process - During the previous use case migration process, some kernels lacked the definition of function names. - This submission has completed the missing names in the aic and aiv modules of test_*.py to maintain the integrity and consistency of the code.
using @scene_test decorator and SceneTestCase base class
and tests/st/a5/{host_build_graph,tensormap_and_ringbuffer} (3 tests)
全量 Case 表
tests/st/a5/host_build_graph/dump_tensor/tests/st/a5/host_build_graph/paged_attention/tests/st/a5/host_build_graph/paged_attention/examples/a5/host_build_graph/paged_attention/examples/a5/host_build_graph/paged_attention/tests/st/a5/tensormap_and_ringbuffer/explicit_fatal/tests/st/a5/tensormap_and_ringbuffer/paged_attention/tests/st/a5/tensormap_and_ringbuffer/paged_attention/tests/st/a5/tensormap_and_ringbuffer/paged_attention/tests/st/a5/tensormap_and_ringbuffer/paged_attention_unroll/tests/st/a5/tensormap_and_ringbuffer/paged_attention_unroll/tests/st/a5/tensormap_and_ringbuffer/paged_attention_unroll/examples/a5/tensormap_and_ringbuffer/bgemm/examples/a5/tensormap_and_ringbuffer/mixed_example/examples/a5/tensormap_and_ringbuffer/mixed_example/examples/a5/tensormap_and_ringbuffer/paged_attention/examples/a5/tensormap_and_ringbuffer/paged_attention/examples/a5/tensormap_and_ringbuffer/paged_attention/examples/a5/tensormap_and_ringbuffer/paged_attention/examples/a5/tensormap_and_ringbuffer/spmd_basic/examples/a5/tensormap_and_ringbuffer/spmd_multiblock_aiv/examples/a5/tensormap_and_ringbuffer/spmd_multiblock_mix/examples/a5/tensormap_and_ringbuffer/spmd_starvation/examples/a5/tensormap_and_ringbuffer/spmd_sync_start/examples/a5/tensormap_and_ringbuffer/spmd_sync_start_aiv/examples/a5/tensormap_and_ringbuffer/spmd_sync_start_edge/examples/a5/tensormap_and_ringbuffer/spmd_sync_start_stress/