[Feature] Develop PagedAttention SPMD Example (Aligned with AscendC paged_attention_antiquantkv)

## Background

Develop a **PagedAttention SPMD example** in the simpler framework, aligned with the AscendC native implementation `paged_attention_antiquantkv.h` (located at `ops-transformer/attention/incre_flash_attention/op_kernel/arch32/`).

The original AscendC implementation is approximately 1984 lines, targeting the Ascend V220 (arch32) architecture. It is a Flash Attention operator kernel for the incremental decoding phase, deeply optimized for INT8 quantized KV Cache scenarios.

## Key Features to Align With

Based on the analysis in `paged_attention_antiquantkv_analysis.md`, the following core features need to be implemented:

### 1. Paged KV Cache Management
- Block Table addressing: logical block index → physical page index mapping
- Non-contiguous physical memory page management for KV Cache
- Dynamic sequence length support

### 2. AIC/AIV Dual-Core Parallel Pipeline
- **AIC core (matrix computation)**: Q×K^T matmul + P×V matmul
- **AIV core (vector computation)**: Softmax computation + output reduction
- Memory hierarchy utilization: GM → L1 → L0A/L0B/L0C

### 3. Cross-Core Synchronization (FFTS)
- `QK_READY_FLAG`: AIC → AIV, score matrix computation complete
- `SOFTMAX_READY_D`: AIV → AIC, softmax probability matrix ready
- `UPDATE_READY_D`: AIV → AIC, output update complete
- `VEC_DEQ_K0/K1_READY`: AIV → AIC, K ping/pong buffer ready
- `VEC_DEQ_V0/V1_READY`: AIV → AIC, V ping/pong buffer ready

### 4. Online Softmax
- Streaming softmax computation without materializing the full attention matrix
- Temperature scaling, mask application, and numerical stability handling
- Output accumulation and final normalization

### 5. Memory Management & Optimization
- Ping-pong double buffering
- Fine-grained UB memory layout (score/probability matrices, accumulators, etc.)
- Buffer specifications: L0A/L0B 32KB each, L0C 16KB

## Task Breakdown

- [ ] Analyze the complete computation flow and data flow of the AscendC original implementation
- [ ] Design the SPMD version architecture, determining how to map AIC/AIV dual-core logic to the SPMD programming model
- [ ] Implement Paged KV Cache Block Table addressing logic
- [ ] Implement AIC-side matrix computation (Q×K^T, P×V)
- [ ] Implement AIV-side vector computation (Softmax, reduction)
- [ ] Implement cross-core synchronization mechanism
- [ ] Implement Online Softmax streaming computation
- [ ] End-to-end functional verification and correctness testing
- [ ] Performance benchmarking (compared against AscendC original implementation)

## References

- `paged_attention_antiquantkv_analysis.md`: Detailed analysis of the AscendC original implementation
- AscendC source: `ops-transformer/attention/incre_flash_attention/op_kernel/arch32/paged_attention_antiquantkv.h`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Develop PagedAttention SPMD Example (Aligned with AscendC paged_attention_antiquantkv) #487

Background

Key Features to Align With

1. Paged KV Cache Management

2. AIC/AIV Dual-Core Parallel Pipeline

3. Cross-Core Synchronization (FFTS)

4. Online Softmax

5. Memory Management & Optimization

Task Breakdown

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Develop PagedAttention SPMD Example (Aligned with AscendC paged_attention_antiquantkv) #487

Description

Background

Key Features to Align With

1. Paged KV Cache Management

2. AIC/AIV Dual-Core Parallel Pipeline

3. Cross-Core Synchronization (FFTS)

4. Online Softmax

5. Memory Management & Optimization

Task Breakdown

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions