Overview
This issue tracks ongoing runtime performance optimization work for the tensormap_and_ringbuffer runtime on the a2a3 (Ascend 910B/C) platform. Each subtask below represents an independent optimization point.
Platform
All / Unknown
Runtime Variant
tensormap_and_ringbuffer
Git Commit ID
6644bc7
CANN Version
8.5.0.alpha001
Host Platform
Linux (aarch64)
Optimization Tasks
Subtask 1: Parallel for dependence optimization
Optimize parallel_for loops by analyzing data dependences to enable more aggressive parallelism.
Currently, parallel_for constructs may be overly conservative in their dependence assumptions, preventing loop iterations from running in parallel when they could safely do so. By introducing dependence analysis, we can identify loops with no loop-carried dependences and schedule them with full parallelism.
Status: Open
Subtask 2: Dual slot scheduling for mix subgraph tasks
Support dual slot scheduling for mix subgraphs (subgraphs that contain both AIC and AIV tasks).
Currently, mix subgraph tasks are scheduled conservatively with a single slot, serializing AIC and AIV work even when they could be dispatched concurrently into two hardware slots. Enabling dual slot scheduling for mix subgraphs would allow AIC and AIV kernels to overlap in execution, reducing end-to-end latency.
Status: ✅ Done
Related PRs:
Reproduction
python examples/scripts/run_example.py \
-k tests/st/a2a3/tensormap_and_ringbuffer/paged_attention/kernels \
-g tests/st/a2a3/tensormap_and_ringbuffer/paged_attention/golden.py \
-p a2a3 -d 5 -n 10
Expected Performance
Each subtask is expected to reduce end-to-end latency. Specific numbers TBD after profiling each optimization.
Actual Performance
Current baseline (before optimizations). No regression — these are proactive optimization opportunities.
Overview
This issue tracks ongoing runtime performance optimization work for the
tensormap_and_ringbufferruntime on the a2a3 (Ascend 910B/C) platform. Each subtask below represents an independent optimization point.Platform
All / Unknown
Runtime Variant
tensormap_and_ringbuffer
Git Commit ID
6644bc7
CANN Version
8.5.0.alpha001
Host Platform
Linux (aarch64)
Optimization Tasks
Subtask 1: Parallel for dependence optimization
Optimize
parallel_forloops by analyzing data dependences to enable more aggressive parallelism.Currently,
parallel_forconstructs may be overly conservative in their dependence assumptions, preventing loop iterations from running in parallel when they could safely do so. By introducing dependence analysis, we can identify loops with no loop-carried dependences and schedule them with full parallelism.Status: Open
Subtask 2: Dual slot scheduling for mix subgraph tasks
Support dual slot scheduling for mix subgraphs (subgraphs that contain both AIC and AIV tasks).
Currently, mix subgraph tasks are scheduled conservatively with a single slot, serializing AIC and AIV work even when they could be dispatched concurrently into two hardware slots. Enabling dual slot scheduling for mix subgraphs would allow AIC and AIV kernels to overlap in execution, reducing end-to-end latency.
Status: ✅ Done
Related PRs:
Reproduction
python examples/scripts/run_example.py \ -k tests/st/a2a3/tensormap_and_ringbuffer/paged_attention/kernels \ -g tests/st/a2a3/tensormap_and_ringbuffer/paged_attention/golden.py \ -p a2a3 -d 5 -n 10Expected Performance
Each subtask is expected to reduce end-to-end latency. Specific numbers TBD after profiling each optimization.
Actual Performance
Current baseline (before optimizations). No regression — these are proactive optimization opportunities.