[Performance] Runtime performance optimization tracking

## Overview

This issue tracks ongoing runtime performance optimization work for the `tensormap_and_ringbuffer` runtime on the a2a3 (Ascend 910B/C) platform. Each subtask below represents an independent optimization point.

### Platform

All / Unknown

### Runtime Variant

tensormap_and_ringbuffer

### Git Commit ID

6644bc70cc867d00ae8b13091354d77141bbcb28

### CANN Version

8.5.0.alpha001

### Host Platform

Linux (aarch64)

---

## Optimization Tasks

### Subtask 1: Parallel for dependence optimization

Optimize `parallel_for` loops by analyzing data dependences to enable more aggressive parallelism.

Currently, `parallel_for` constructs may be overly conservative in their dependence assumptions, preventing loop iterations from running in parallel when they could safely do so. By introducing dependence analysis, we can identify loops with no loop-carried dependences and schedule them with full parallelism.

**Status:** Open

---

### Subtask 2: Dual slot scheduling for mix subgraph tasks

Support dual slot scheduling for mix subgraphs (subgraphs that contain both AIC and AIV tasks).

Currently, mix subgraph tasks are scheduled conservatively with a single slot, serializing AIC and AIV work even when they could be dispatched concurrently into two hardware slots. Enabling dual slot scheduling for mix subgraphs would allow AIC and AIV kernels to overlap in execution, reducing end-to-end latency.

**Status:** ✅ Done

**Related PRs:**
- #477 Add: dual-slot AICPU dispatch payload and two-phase pipelining scheduler
- #553 Refactor: unify two-phase dispatch and fix sync_start spin loop

---

## Reproduction

```bash
python examples/scripts/run_example.py \
    -k tests/st/a2a3/tensormap_and_ringbuffer/paged_attention/kernels \
    -g tests/st/a2a3/tensormap_and_ringbuffer/paged_attention/golden.py \
    -p a2a3 -d 5 -n 10
```

## Expected Performance

Each subtask is expected to reduce end-to-end latency. Specific numbers TBD after profiling each optimization.

## Actual Performance

Current baseline (before optimizations). No regression — these are proactive optimization opportunities.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Runtime performance optimization tracking #545

Overview

Platform

Runtime Variant

Git Commit ID

CANN Version

Host Platform

Optimization Tasks

Subtask 1: Parallel for dependence optimization

Subtask 2: Dual slot scheduling for mix subgraph tasks

Reproduction

Expected Performance

Actual Performance

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance] Runtime performance optimization tracking #545

Description

Overview

Platform

Runtime Variant

Git Commit ID

CANN Version

Host Platform

Optimization Tasks

Subtask 1: Parallel for dependence optimization

Subtask 2: Dual slot scheduling for mix subgraph tasks

Reproduction

Expected Performance

Actual Performance

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions