Skip to content

Feature: variable seq length, ring buffer flow control, configurable task window, and swimlane tooling#113

Closed
hengliao1972 wants to merge 6 commits into
hw-native-sys:mainfrom
hengliao1972:feature/variable-seq-length
Closed

Feature: variable seq length, ring buffer flow control, configurable task window, and swimlane tooling#113
hengliao1972 wants to merge 6 commits into
hw-native-sys:mainfrom
hengliao1972:feature/variable-seq-length

Conversation

@hengliao1972
Copy link
Copy Markdown

@hengliao1972 hengliao1972 commented Feb 26, 2026

Summary

This PR adds several enhancements to the batched paged attention pipeline:

Per-batch variable sequence length

  • Support individual context_len per batch item via PA_SEQ_LEN environment variable
  • Example: PA_SEQ_LEN=16,32,48,64,... for heterogeneous sequence lengths

Ring buffer flow control

  • Add last_task_alive advancement in scheduler completion handler using lock-free CAS to reclaim ring buffer slots
  • Enables back-pressure flow control for small task windows (e.g. task_window=16 with batch=256, 208 tasks)
  • Add completed_by_task tracking array to prevent stale completion state from recycled slots corrupting the early-return dependency path
  • Reset completed/completed_by_task in orchestrator at slot allocation time (safe after fanout protocol completes) so scanner CAS(0→1) works for root tasks at recycled slots

Thread startup synchronization

  • Add orch_pointers_ready_ flag to ensure scheduler threads wait for Thread 3 to finish configuring shared memory pointers before entering the scheduling loop

Configurable ring buffer sizes

  • Support PTO2_RING_TASK_WINDOW, PTO2_RING_HEAP, PTO2_RING_DEP_POOL environment variables
  • Allows runtime tuning of ring buffer sizes without recompilation

Perfetto swimlane tooling

  • Add generate_full_swimlane.py for Perfetto visualization with dedicated lanes for:
    • Orchestrator (AICPU Thread 3)
    • 3 Scheduler threads (AICPU 0-2)
    • Each AIC core (18 lanes)
    • Each AIV core (38 lanes)

Test plan

  • Case1 passes with default and small task windows
  • CaseBatch256 with PTO2_RING_TASK_WINDOW=16 completes without deadlock (flow control active)
  • Mismatch count with small window (~100-170/65536) matches baseline floating-point noise (~80-120)
  • Perfetto swimlane JSON loads correctly in https://ui.perfetto.dev/

- Add cache flush (dc cvac) for tensor_copies in orchestrator to ensure
  AICore sees correct tensor metadata via HBM
- Improve AICPU executor with cycle-accurate profiling, scheduler phase
  breakdown (dispatch/complete/scan/yield), and enhanced task statistics
- Extend memory allocator with larger heap support and alignment helpers
- Add platform config tuning for device runner and register access
- Add CaseBatch2/4/8/16 test cases with varying batch sizes
- Clean up kernel code: remove unused printf, fix pipe barriers
- Add TROUBLESHOOTING.md documenting known issues and fixes
Implement a new batch_paged_attention architecture that moves the batch
iteration loop inside each kernel, eliminating task count explosion.

Key changes:
- Orchestrator submits constant 13 tasks regardless of batch size
- QK, Softmax, PV, Online-Update kernels process all batches internally
  via pointer arithmetic on batched tensors
- block_table and context_lens passed as scalar pointers to avoid
  exceeding PTO2 tensor parameter limits
- Kernel memory (L1/L0/UB tiles) reused across batch iterations
- Supports batch sizes from 1 to 256 with Exec/Sched ratio up to 93%

Previously batch>=16 caused AICPU scheduler hang (208+ tasks).
- Enhance swimlane_converter with task statistics and profiling output
- Add tail_oh_breakdown.py for scheduler overhead analysis
- Add Case1 Tail OH breakdown documentation
- Add batch paged attention performance summary (batch 1-256)
- Add scheduler overhead analysis notes
Add support for variable sequence lengths across batches in paged attention,
controlled via PA_SEQ_LEN environment variable. Also introduces IN_CORE_BATCH
chunking for improved multi-core parallelism and configurable ready queue shards.

Key changes:
- golden.py: PA_SEQ_LEN env var for per-batch variable sequence lengths
  (e.g. PA_SEQ_LEN=33,64,17,128 for 4 different lengths)
- aiv_softmax_prepare.cpp: fix valid_len=0 bug when block is beyond a
  batch's sequence, output mij=-1e30/lij=0/pij=0 to avoid NaN from
  exp(-inf - (-inf))
- Orchestrator: IN_CORE_BATCH=16 chunking splits large batches into
  parallel chunks across multiple cores
- All kernels: accept batch_start offset for chunked processing
- aicpu_executor: configurable ready queue shards via PTO2_READY_QUEUE_SHARDS
  env var, passed through Runtime struct from host to device
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @hengliao1972, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

此拉取请求引入了一个新的批处理分页注意力示例,支持每批次可变序列长度,显著提升了处理大批次数据的效率。核心工作包括对 AICPU 调度器架构的全面优化,通过分片就绪队列和精确的缓存同步机制,解决了任务数爆炸和调度效率低下的问题。此外,还详细记录并规避了一个关键的硬件填充缺陷,并改进了运行时资源管理和错误处理,以确保在各种场景下的稳定性和正确性。这些改进共同为高性能深度学习推理提供了更健壮和高效的基础。

Highlights

  • 支持每批次可变序列长度: 引入了对每批次可变序列长度的支持,通过 PA_SEQ_LEN 环境变量配置,允许在同一批次中处理不同长度的序列,提高了灵活性和资源利用率。
  • 新增批处理分页注意力示例: 添加了一个全新的批处理分页注意力(Batched Paged Attention)示例,包括 AIC 和 AIV 内核以及编排函数,旨在优化大批次场景下的性能。
  • 调度器性能优化与稳定性增强: AICPU 调度器进行了重大改进,包括引入分片就绪队列(sharded ready queues)、优化内存屏障语义以防止数据竞争,并明确添加了缓存刷新机制,以确保 AICPU 和 AICore 之间的数据可见性,从而显著提升了调度效率并消除了大批次任务数爆炸的问题。
  • 硬件缺陷修复与规避: 详细分析并规避了 TFILLPAD_INPLACE 函数在特定硬件配置下存在的填充错误,通过组合 TFILLPAD_INPLACE 和标量 SetValue 写入确保了填充数据的正确性。
  • 运行时资源管理与错误处理改进: 重构了 DeviceRunner::finalize() 的资源释放顺序,以避免在 CANN 运行时出现 rtFree failed: 507899 错误。同时,改进了 halMemCtl 失败时的错误日志,并修正了平台物理核心数的配置,增强了整体运行时稳定性。
  • 详细性能分析文档与工具: 新增了关于批处理分页注意力性能对比和 AICPU 调度器尾部开销(Tail OH)详细分解的文档,并更新了 swimlane_converter.py 工具以支持从设备日志中解析调度器开销,为性能分析提供了更全面的支持。

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • docs/batch_paged_attention_perf_summary.md
    • 新增了批处理分页注意力性能对比总结文档。
  • docs/case1_tail_oh_breakdown.md
    • 新增了 Case1 尾部开销(Tail OH)完整分解的详细分析文档。
  • examples/tensormap_and_ringbuffer/batch_paged_attention/TFILLPAD_INPLACE_BUG.md
    • 新增了关于 TFILLPAD_INPLACE 在小瓦片宽度下存在硬件缺陷的文档,并描述了其规避方案。
  • examples/tensormap_and_ringbuffer/batch_paged_attention/golden.py
    • 新增了批处理分页注意力示例的 Python 黄金参考实现,支持每批次可变序列长度。
  • examples/tensormap_and_ringbuffer/batch_paged_attention/kernels/aic/aic_hub.cpp
    • 新增了 AIC Hub 内核的实现。
  • examples/tensormap_and_ringbuffer/batch_paged_attention/kernels/aic/aic_pv_matmul.cpp
    • 新增了批处理 PV 矩阵乘法 AIC 内核的实现。
  • examples/tensormap_and_ringbuffer/batch_paged_attention/kernels/aic/aic_qk_matmul.cpp
    • 新增了批处理 QK 矩阵乘法 AIC 内核的实现。
  • examples/tensormap_and_ringbuffer/batch_paged_attention/kernels/aiv/aiv_hub.cpp
    • 新增了 AIV Hub 内核的实现。
  • examples/tensormap_and_ringbuffer/batch_paged_attention/kernels/aiv/aiv_online_update.cpp
    • 新增了批处理在线 Softmax 更新和归一化 AIV 内核的实现。
  • examples/tensormap_and_ringbuffer/batch_paged_attention/kernels/aiv/aiv_softmax_prepare.cpp
    • 新增了批处理 Softmax 准备 AIV 内核的实现,包含了 TFILLPAD_INPLACE 缺陷的规避逻辑。
  • examples/tensormap_and_ringbuffer/batch_paged_attention/kernels/kernel_config.py
    • 新增了批处理分页注意力示例的内核和编排配置。
  • examples/tensormap_and_ringbuffer/batch_paged_attention/kernels/orchestration/paged_attention_orch.cpp
    • 新增了批处理分页注意力编排函数的实现,支持分块处理和每批次可变序列长度。
  • examples/tensormap_and_ringbuffer/paged_attention/golden.py
    • 更新了黄金参考实现,增加了 CaseBatch2CaseBatch4 等批处理测试用例。
  • src/platform/a2a3/host/device_runner.cpp
    • 重构了 finalize 函数中的资源释放顺序,以避免 rtFree 错误。
    • 增加了 rtDeviceSynchronize 调用,确保所有设备操作完成后再释放资源。
    • 改进了日志输出,提示用户检查设备日志以诊断挂起问题。
  • src/platform/a2a3/host/host_regs.cpp
    • 更新了 PLATFORM_MAX_PHYSICAL_CORES 的值,并改进了 halMemCtl 失败时的错误日志,增加了警告信息。
  • src/platform/a2a3/host/memory_allocator.cpp
    • 新增了 untrack 方法,用于在不释放内存的情况下从跟踪列表中移除指针。
    • 改进了 rtFree 错误日志,将 507899 错误视为已知现象并记录为警告。
  • src/platform/a2a3sim/host/memory_allocator.cpp
    • 新增了 untrack 方法,用于模拟器环境下的内存分配器。
  • src/platform/include/common/platform_config.h
    • 修正了 PLATFORM_MAX_PHYSICAL_CORES 的定义为 24,并添加了静态断言以强制执行此值。
  • src/platform/include/host/memory_allocator.h
    • MemoryAllocator 类中声明了 untrack 方法。
  • src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
    • 实现了分片就绪队列,以减少调度器锁竞争并提高并发性。
    • 更新了内存屏障语义为 __ATOMIC_SEQ_CST,以解决数据竞争问题。
    • 增加了显式缓存刷新 (dc cvac, dsb sy),确保 AICPU 和 AICore 之间的数据一致性。
    • 改进了调度器停滞状态的诊断日志,提供了更详细的调试信息。
  • src/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp
    • 增加了通过环境变量 PTO2_READY_QUEUE_SHARDS 配置 AICPU 调度器就绪队列分片数量的功能。
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
    • 重构了 task_fanout_lock/unlock 以使用通用的辅助函数。
    • pto2_submit_task 中为 tensor_copies 添加了显式缓存刷新,确保数据可见性。
    • fanin_count 的存储语义更改为 __ATOMIC_SEQ_CST,以解决 IRIW 危害。
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h
    • 新增了 pto2_fanout_lockpto2_fanout_unlock 辅助函数,用于任务扇出自旋锁的统一管理。
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_types.h
    • 澄清了 make_tensor 函数和 OUTPUT 参数处理的注释,强调不进行内存拷贝。
  • src/runtime/tensormap_and_ringbuffer/runtime/runtime.cpp
    • 初始化了 ready_queue_shards 成员变量的默认值为 3。
  • src/runtime/tensormap_and_ringbuffer/runtime/runtime.h
    • Runtime 类中添加了 ready_queue_shards 成员变量,用于配置就绪队列分片数量。
  • src/runtime/tensormap_and_ringbuffer/runtime/tensor.h
    • 澄清了 make_tensor 函数的注释,强调其不进行内存分配,仅记录张量描述符。
  • tests/device_tests/tensormap_and_ringbuffer/paged_attention/TROUBLESHOOTING.md
    • 新增了分页注意力测试的故障排除指南,涵盖了常见的测试失败、性能问题和日志分析。
  • tests/device_tests/tensormap_and_ringbuffer/paged_attention/kernels/aic/aic_pv_matmul.cpp
    • 移除了不再需要的注释行。
  • tests/device_tests/tensormap_and_ringbuffer/paged_attention/kernels/aic/aic_qk_matmul.cpp
    • 移除了不再需要的注释行。
  • tests/device_tests/tensormap_and_ringbuffer/paged_attention/kernels/aiv/aiv_online_update.cpp
    • 更新了注释以反映操作的瓦片尺寸,并移除了不再需要的注释行。
  • tests/device_tests/tensormap_and_ringbuffer/paged_attention/kernels/orchestration/paged_attention_orch.cpp
    • 更新了注释,解释了 ONLINE_UPDATE 的处理方式,并移除了不再需要的注释行。
  • tools/swimlane_converter.py
    • 新增了从设备日志中解析调度器开销的功能。
    • 改进了任务统计信息的打印,包括执行时间与调度器开销的对比。
  • tools/tail_oh_breakdown.py
    • 新增了一个 Python 脚本,用于详细分析 PTO2 调度器的尾部开销(Tail OH)分解。
Activity
  • 作者 hengliao1972 创建了此拉取请求,但未提供描述。
  • 目前没有评论或审查活动。
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant feature: support for variable sequence lengths per batch in the paged attention example. This is a substantial architectural improvement that moves from a per-batch task generation model to a more efficient in-kernel batching approach, drastically reducing the number of scheduled tasks and enabling much larger batch sizes. The PR also includes a wealth of performance improvements, critical correctness fixes for concurrency and cache coherence, and enhanced debugging capabilities for the underlying runtime. The documentation and analysis files are very thorough and well-written.

I have found one critical issue related to an out-of-bounds memory access in the new orchestration logic, which I've detailed in a comment. Otherwise, the changes are excellent and represent a major step forward in performance and robustness.

Comment on lines +75 to +104
size_t key_cache_size = (size_t)args[8];

uint64_t batch = (uint64_t)(int)host_config[0];
uint64_t num_heads = (uint64_t)(int)host_config[1];
uint64_t head_dim = (uint64_t)(int)host_config[3];
uint64_t block_size = (uint64_t)(int)host_config[4];
uint64_t block_num = (uint64_t)(int)host_config[5];
union { uint32_t u; float f; } scale_conv;
scale_conv.u = (uint32_t)host_config[6];
float scale_value = scale_conv.f;

uint64_t q_tile = 16;
uint64_t q_loop = (num_heads + q_tile - 1) / q_tile;
DataType data_type = DataType::FLOAT16;
uint64_t elem_size = get_element_size(data_type);

LOG_INFO(rt, "batch_paged_attention: batch=%lu, num_heads=%lu",
(unsigned long)batch, (unsigned long)num_heads);

uint64_t max_bn = 0;
for (uint64_t b = 0; b < batch; b++) {
uint64_t cur_seq = host_context_lens[b];
uint64_t bn_b = (cur_seq + block_size - 1) / block_size;
if (bn_b > max_bn) max_bn = bn_b;
}

uint64_t query_shapes[2] = {batch * num_heads, head_dim};
uint64_t kv_total_rows = key_cache_size / (head_dim * elem_size);
uint64_t key_cache_shapes[2] = {kv_total_rows, head_dim};
uint64_t value_cache_shapes[2] = {kv_total_rows, head_dim};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This block of code contains a critical out-of-bounds memory access.

Specifically, line 75 reads key_cache_size from args[8]. However, the aicpu_orchestration_config function specifies expected_arg_count = 7, so this is an out-of-bounds read, which leads to undefined behavior and can cause crashes or incorrect results.

The correct approach is to calculate the required cache size from the input parameters (batch, host_context_lens, host_config), which makes the orchestration logic self-contained and corrects the memory safety issue. The suggested change implements this calculation, aligning it with the logic in the Python golden implementation.

    uint64_t batch = (uint64_t)(int)host_config[0];
    uint64_t num_heads = (uint64_t)(int)host_config[1];
    uint64_t kv_head_num = (uint64_t)(int)host_config[2];
    uint64_t head_dim = (uint64_t)(int)host_config[3];
    uint64_t block_size = (uint64_t)(int)host_config[4];
    uint64_t block_num = (uint64_t)(int)host_config[5];
    union { uint32_t u; float f; } scale_conv;
    scale_conv.u = (uint32_t)host_config[6];
    float scale_value = scale_conv.f;

    uint64_t q_tile = 16;
    uint64_t q_loop = (num_heads + q_tile - 1) / q_tile;
    DataType data_type = DataType::FLOAT16;
    uint64_t elem_size = get_element_size(data_type);

    LOG_INFO(rt, "batch_paged_attention: batch=%lu, num_heads=%lu",
             (unsigned long)batch, (unsigned long)num_heads);

    uint64_t max_bn = 0;
    for (uint64_t b = 0; b < batch; b++) {
        uint64_t cur_seq = host_context_lens[b];
        uint64_t bn_b = (cur_seq + block_size - 1) / block_size;
        if (bn_b > max_bn) max_bn = bn_b;
    }

    // This logic matches the golden.py implementation for determining the total number of blocks.
    uint64_t total_blocks = batch * max_bn;
    uint64_t kv_total_rows = total_blocks * block_size * kv_head_num;

    uint64_t query_shapes[2] = {batch * num_heads, head_dim};
    uint64_t key_cache_shapes[2] = {kv_total_rows, head_dim};
    uint64_t value_cache_shapes[2] = {kv_total_rows, head_dim};

- Add last_task_alive advancement in scheduler completion handler with
  lock-free CAS to reclaim ring buffer slots and enable back-pressure
  flow control for small task windows
- Add completed_by_task tracking array to prevent stale completion state
  from recycled slots from corrupting the early-return dependency path
- Reset completed/completed_by_task in orchestrator at slot allocation
  time (safe after fanout protocol completes) so scanner CAS(0->1) works
  for root tasks at recycled slots
- Add orch_pointers_ready_ synchronization flag to ensure scheduler
  threads wait for Thread 3 to finish configuring shared memory pointers
  before entering the scheduling loop
- Support configurable ring buffer sizes via environment variables:
  PTO2_RING_TASK_WINDOW, PTO2_RING_HEAP, PTO2_RING_DEP_POOL
- Add generate_full_swimlane.py tool for Perfetto visualization with
  dedicated lanes for orchestrator, scheduler threads, and per-core
  AIC/AIV execution
@hengliao1972 hengliao1972 changed the title Feature/variable seq length per batch support in batched_paged_attention example. Feature: variable seq length, ring buffer flow control, configurable task window, and swimlane tooling Feb 26, 2026
@ChaoWao
Copy link
Copy Markdown
Collaborator

ChaoWao commented Mar 1, 2026

The code has been split to small PRs to commit. #134 #136 #137 #141 #143 #145 #146 #147 #148 #149 .

@ChaoWao ChaoWao closed this Mar 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants