Background
Develop a PagedAttention SPMD example in the simpler framework, aligned with the AscendC native implementation paged_attention_antiquantkv.h (located at ops-transformer/attention/incre_flash_attention/op_kernel/arch32/).
The original AscendC implementation is approximately 1984 lines, targeting the Ascend V220 (arch32) architecture. It is a Flash Attention operator kernel for the incremental decoding phase, deeply optimized for INT8 quantized KV Cache scenarios.
Key Features to Align With
Based on the analysis in paged_attention_antiquantkv_analysis.md, the following core features need to be implemented:
1. Paged KV Cache Management
- Block Table addressing: logical block index → physical page index mapping
- Non-contiguous physical memory page management for KV Cache
- Dynamic sequence length support
2. AIC/AIV Dual-Core Parallel Pipeline
- AIC core (matrix computation): Q×K^T matmul + P×V matmul
- AIV core (vector computation): Softmax computation + output reduction
- Memory hierarchy utilization: GM → L1 → L0A/L0B/L0C
3. Cross-Core Synchronization (FFTS)
QK_READY_FLAG: AIC → AIV, score matrix computation complete
SOFTMAX_READY_D: AIV → AIC, softmax probability matrix ready
UPDATE_READY_D: AIV → AIC, output update complete
VEC_DEQ_K0/K1_READY: AIV → AIC, K ping/pong buffer ready
VEC_DEQ_V0/V1_READY: AIV → AIC, V ping/pong buffer ready
4. Online Softmax
- Streaming softmax computation without materializing the full attention matrix
- Temperature scaling, mask application, and numerical stability handling
- Output accumulation and final normalization
5. Memory Management & Optimization
- Ping-pong double buffering
- Fine-grained UB memory layout (score/probability matrices, accumulators, etc.)
- Buffer specifications: L0A/L0B 32KB each, L0C 16KB
Task Breakdown
References
paged_attention_antiquantkv_analysis.md: Detailed analysis of the AscendC original implementation
- AscendC source:
ops-transformer/attention/incre_flash_attention/op_kernel/arch32/paged_attention_antiquantkv.h
Background
Develop a PagedAttention SPMD example in the simpler framework, aligned with the AscendC native implementation
paged_attention_antiquantkv.h(located atops-transformer/attention/incre_flash_attention/op_kernel/arch32/).The original AscendC implementation is approximately 1984 lines, targeting the Ascend V220 (arch32) architecture. It is a Flash Attention operator kernel for the incremental decoding phase, deeply optimized for INT8 quantized KV Cache scenarios.
Key Features to Align With
Based on the analysis in
paged_attention_antiquantkv_analysis.md, the following core features need to be implemented:1. Paged KV Cache Management
2. AIC/AIV Dual-Core Parallel Pipeline
3. Cross-Core Synchronization (FFTS)
QK_READY_FLAG: AIC → AIV, score matrix computation completeSOFTMAX_READY_D: AIV → AIC, softmax probability matrix readyUPDATE_READY_D: AIV → AIC, output update completeVEC_DEQ_K0/K1_READY: AIV → AIC, K ping/pong buffer readyVEC_DEQ_V0/V1_READY: AIV → AIC, V ping/pong buffer ready4. Online Softmax
5. Memory Management & Optimization
Task Breakdown
References
paged_attention_antiquantkv_analysis.md: Detailed analysis of the AscendC original implementationops-transformer/attention/incre_flash_attention/op_kernel/arch32/paged_attention_antiquantkv.h