Skip to content

[Feature] Develop PagedAttention SPMD Example (Aligned with AscendC paged_attention_antiquantkv) #487

@chenshengxin2026

Description

@chenshengxin2026

Background

Develop a PagedAttention SPMD example in the simpler framework, aligned with the AscendC native implementation paged_attention_antiquantkv.h (located at ops-transformer/attention/incre_flash_attention/op_kernel/arch32/).

The original AscendC implementation is approximately 1984 lines, targeting the Ascend V220 (arch32) architecture. It is a Flash Attention operator kernel for the incremental decoding phase, deeply optimized for INT8 quantized KV Cache scenarios.

Key Features to Align With

Based on the analysis in paged_attention_antiquantkv_analysis.md, the following core features need to be implemented:

1. Paged KV Cache Management

  • Block Table addressing: logical block index → physical page index mapping
  • Non-contiguous physical memory page management for KV Cache
  • Dynamic sequence length support

2. AIC/AIV Dual-Core Parallel Pipeline

  • AIC core (matrix computation): Q×K^T matmul + P×V matmul
  • AIV core (vector computation): Softmax computation + output reduction
  • Memory hierarchy utilization: GM → L1 → L0A/L0B/L0C

3. Cross-Core Synchronization (FFTS)

  • QK_READY_FLAG: AIC → AIV, score matrix computation complete
  • SOFTMAX_READY_D: AIV → AIC, softmax probability matrix ready
  • UPDATE_READY_D: AIV → AIC, output update complete
  • VEC_DEQ_K0/K1_READY: AIV → AIC, K ping/pong buffer ready
  • VEC_DEQ_V0/V1_READY: AIV → AIC, V ping/pong buffer ready

4. Online Softmax

  • Streaming softmax computation without materializing the full attention matrix
  • Temperature scaling, mask application, and numerical stability handling
  • Output accumulation and final normalization

5. Memory Management & Optimization

  • Ping-pong double buffering
  • Fine-grained UB memory layout (score/probability matrices, accumulators, etc.)
  • Buffer specifications: L0A/L0B 32KB each, L0C 16KB

Task Breakdown

  • Analyze the complete computation flow and data flow of the AscendC original implementation
  • Design the SPMD version architecture, determining how to map AIC/AIV dual-core logic to the SPMD programming model
  • Implement Paged KV Cache Block Table addressing logic
  • Implement AIC-side matrix computation (Q×K^T, P×V)
  • Implement AIV-side vector computation (Softmax, reduction)
  • Implement cross-core synchronization mechanism
  • Implement Online Softmax streaming computation
  • End-to-end functional verification and correctness testing
  • Performance benchmarking (compared against AscendC original implementation)

References

  • paged_attention_antiquantkv_analysis.md: Detailed analysis of the AscendC original implementation
  • AscendC source: ops-transformer/attention/incre_flash_attention/op_kernel/arch32/paged_attention_antiquantkv.h

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions