Feature: variable seq length, ring buffer flow control, configurable task window, and swimlane tooling#113
Conversation
- Add cache flush (dc cvac) for tensor_copies in orchestrator to ensure AICore sees correct tensor metadata via HBM - Improve AICPU executor with cycle-accurate profiling, scheduler phase breakdown (dispatch/complete/scan/yield), and enhanced task statistics - Extend memory allocator with larger heap support and alignment helpers - Add platform config tuning for device runner and register access
- Add CaseBatch2/4/8/16 test cases with varying batch sizes - Clean up kernel code: remove unused printf, fix pipe barriers - Add TROUBLESHOOTING.md documenting known issues and fixes
Implement a new batch_paged_attention architecture that moves the batch iteration loop inside each kernel, eliminating task count explosion. Key changes: - Orchestrator submits constant 13 tasks regardless of batch size - QK, Softmax, PV, Online-Update kernels process all batches internally via pointer arithmetic on batched tensors - block_table and context_lens passed as scalar pointers to avoid exceeding PTO2 tensor parameter limits - Kernel memory (L1/L0/UB tiles) reused across batch iterations - Supports batch sizes from 1 to 256 with Exec/Sched ratio up to 93% Previously batch>=16 caused AICPU scheduler hang (208+ tasks).
- Enhance swimlane_converter with task statistics and profiling output - Add tail_oh_breakdown.py for scheduler overhead analysis - Add Case1 Tail OH breakdown documentation - Add batch paged attention performance summary (batch 1-256) - Add scheduler overhead analysis notes
Add support for variable sequence lengths across batches in paged attention, controlled via PA_SEQ_LEN environment variable. Also introduces IN_CORE_BATCH chunking for improved multi-core parallelism and configurable ready queue shards. Key changes: - golden.py: PA_SEQ_LEN env var for per-batch variable sequence lengths (e.g. PA_SEQ_LEN=33,64,17,128 for 4 different lengths) - aiv_softmax_prepare.cpp: fix valid_len=0 bug when block is beyond a batch's sequence, output mij=-1e30/lij=0/pij=0 to avoid NaN from exp(-inf - (-inf)) - Orchestrator: IN_CORE_BATCH=16 chunking splits large batches into parallel chunks across multiple cores - All kernels: accept batch_start offset for chunked processing - aicpu_executor: configurable ready queue shards via PTO2_READY_QUEUE_SHARDS env var, passed through Runtime struct from host to device
Summary of ChangesHello @hengliao1972, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! 此拉取请求引入了一个新的批处理分页注意力示例,支持每批次可变序列长度,显著提升了处理大批次数据的效率。核心工作包括对 AICPU 调度器架构的全面优化,通过分片就绪队列和精确的缓存同步机制,解决了任务数爆炸和调度效率低下的问题。此外,还详细记录并规避了一个关键的硬件填充缺陷,并改进了运行时资源管理和错误处理,以确保在各种场景下的稳定性和正确性。这些改进共同为高性能深度学习推理提供了更健壮和高效的基础。 Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant feature: support for variable sequence lengths per batch in the paged attention example. This is a substantial architectural improvement that moves from a per-batch task generation model to a more efficient in-kernel batching approach, drastically reducing the number of scheduled tasks and enabling much larger batch sizes. The PR also includes a wealth of performance improvements, critical correctness fixes for concurrency and cache coherence, and enhanced debugging capabilities for the underlying runtime. The documentation and analysis files are very thorough and well-written.
I have found one critical issue related to an out-of-bounds memory access in the new orchestration logic, which I've detailed in a comment. Otherwise, the changes are excellent and represent a major step forward in performance and robustness.
| size_t key_cache_size = (size_t)args[8]; | ||
|
|
||
| uint64_t batch = (uint64_t)(int)host_config[0]; | ||
| uint64_t num_heads = (uint64_t)(int)host_config[1]; | ||
| uint64_t head_dim = (uint64_t)(int)host_config[3]; | ||
| uint64_t block_size = (uint64_t)(int)host_config[4]; | ||
| uint64_t block_num = (uint64_t)(int)host_config[5]; | ||
| union { uint32_t u; float f; } scale_conv; | ||
| scale_conv.u = (uint32_t)host_config[6]; | ||
| float scale_value = scale_conv.f; | ||
|
|
||
| uint64_t q_tile = 16; | ||
| uint64_t q_loop = (num_heads + q_tile - 1) / q_tile; | ||
| DataType data_type = DataType::FLOAT16; | ||
| uint64_t elem_size = get_element_size(data_type); | ||
|
|
||
| LOG_INFO(rt, "batch_paged_attention: batch=%lu, num_heads=%lu", | ||
| (unsigned long)batch, (unsigned long)num_heads); | ||
|
|
||
| uint64_t max_bn = 0; | ||
| for (uint64_t b = 0; b < batch; b++) { | ||
| uint64_t cur_seq = host_context_lens[b]; | ||
| uint64_t bn_b = (cur_seq + block_size - 1) / block_size; | ||
| if (bn_b > max_bn) max_bn = bn_b; | ||
| } | ||
|
|
||
| uint64_t query_shapes[2] = {batch * num_heads, head_dim}; | ||
| uint64_t kv_total_rows = key_cache_size / (head_dim * elem_size); | ||
| uint64_t key_cache_shapes[2] = {kv_total_rows, head_dim}; | ||
| uint64_t value_cache_shapes[2] = {kv_total_rows, head_dim}; |
There was a problem hiding this comment.
This block of code contains a critical out-of-bounds memory access.
Specifically, line 75 reads key_cache_size from args[8]. However, the aicpu_orchestration_config function specifies expected_arg_count = 7, so this is an out-of-bounds read, which leads to undefined behavior and can cause crashes or incorrect results.
The correct approach is to calculate the required cache size from the input parameters (batch, host_context_lens, host_config), which makes the orchestration logic self-contained and corrects the memory safety issue. The suggested change implements this calculation, aligning it with the logic in the Python golden implementation.
uint64_t batch = (uint64_t)(int)host_config[0];
uint64_t num_heads = (uint64_t)(int)host_config[1];
uint64_t kv_head_num = (uint64_t)(int)host_config[2];
uint64_t head_dim = (uint64_t)(int)host_config[3];
uint64_t block_size = (uint64_t)(int)host_config[4];
uint64_t block_num = (uint64_t)(int)host_config[5];
union { uint32_t u; float f; } scale_conv;
scale_conv.u = (uint32_t)host_config[6];
float scale_value = scale_conv.f;
uint64_t q_tile = 16;
uint64_t q_loop = (num_heads + q_tile - 1) / q_tile;
DataType data_type = DataType::FLOAT16;
uint64_t elem_size = get_element_size(data_type);
LOG_INFO(rt, "batch_paged_attention: batch=%lu, num_heads=%lu",
(unsigned long)batch, (unsigned long)num_heads);
uint64_t max_bn = 0;
for (uint64_t b = 0; b < batch; b++) {
uint64_t cur_seq = host_context_lens[b];
uint64_t bn_b = (cur_seq + block_size - 1) / block_size;
if (bn_b > max_bn) max_bn = bn_b;
}
// This logic matches the golden.py implementation for determining the total number of blocks.
uint64_t total_blocks = batch * max_bn;
uint64_t kv_total_rows = total_blocks * block_size * kv_head_num;
uint64_t query_shapes[2] = {batch * num_heads, head_dim};
uint64_t key_cache_shapes[2] = {kv_total_rows, head_dim};
uint64_t value_cache_shapes[2] = {kv_total_rows, head_dim};- Add last_task_alive advancement in scheduler completion handler with lock-free CAS to reclaim ring buffer slots and enable back-pressure flow control for small task windows - Add completed_by_task tracking array to prevent stale completion state from recycled slots from corrupting the early-return dependency path - Reset completed/completed_by_task in orchestrator at slot allocation time (safe after fanout protocol completes) so scanner CAS(0->1) works for root tasks at recycled slots - Add orch_pointers_ready_ synchronization flag to ensure scheduler threads wait for Thread 3 to finish configuring shared memory pointers before entering the scheduling loop - Support configurable ring buffer sizes via environment variables: PTO2_RING_TASK_WINDOW, PTO2_RING_HEAP, PTO2_RING_DEP_POOL - Add generate_full_swimlane.py tool for Perfetto visualization with dedicated lanes for orchestrator, scheduler threads, and per-core AIC/AIV execution
Summary
This PR adds several enhancements to the batched paged attention pipeline:
Per-batch variable sequence length
context_lenper batch item viaPA_SEQ_LENenvironment variablePA_SEQ_LEN=16,32,48,64,...for heterogeneous sequence lengthsRing buffer flow control
last_task_aliveadvancement in scheduler completion handler using lock-free CAS to reclaim ring buffer slotstask_window=16withbatch=256, 208 tasks)completed_by_tasktracking array to prevent stale completion state from recycled slots corrupting the early-return dependency pathcompleted/completed_by_taskin orchestrator at slot allocation time (safe after fanout protocol completes) so scannerCAS(0→1)works for root tasks at recycled slotsThread startup synchronization
orch_pointers_ready_flag to ensure scheduler threads wait for Thread 3 to finish configuring shared memory pointers before entering the scheduling loopConfigurable ring buffer sizes
PTO2_RING_TASK_WINDOW,PTO2_RING_HEAP,PTO2_RING_DEP_POOLenvironment variablesPerfetto swimlane tooling
generate_full_swimlane.pyfor Perfetto visualization with dedicated lanes for:Test plan
Case1passes with default and small task windowsCaseBatch256withPTO2_RING_TASK_WINDOW=16completes without deadlock (flow control active)