Refactor: migrate executor to scheduler API with CAS-based ring advancement#160
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant refactoring of the AICPU executor's task scheduling and ring buffer management. The core changes involve migrating the executor to a new, centralized PTO2SchedulerState API, which simplifies task state handling and ready queue management. A key improvement is the adoption of a lock-free, CAS-based mechanism for advancing ring pointers, replacing the previous CONSUMED-dependent logic. This change, along with decoupling scope_end from ring advancement, directly addresses and resolves a critical task ring deadlock issue that occurred with small task windows. Additionally, a minor but important fix to the orchestrator's task state checking enhances the overall robustness of the system. These updates collectively aim to improve concurrency, stability, and efficiency in task execution. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the PTO2 task scheduling mechanism by migrating task completion, ready queue management, and ring buffer advancement logic from aicpu_executor.cpp to a new centralized scheduler API (pto_scheduler.cpp). Key changes include removing local sharded ready queues, spinlocks, and associated profiling counters from AicpuExecutor, replacing them with calls to pto2_scheduler_get_ready_task and pto2_scheduler_on_task_complete. The orchestrator's pto2_submit_task function is updated to interact directly with the new scheduler for fanin/fanout management and task state updates, eliminating the need for an orchestrator-specific ready queue and associated synchronization flags. Profiling output is simplified, and AicpuExecutor now uses a global PTO2Runtime pointer for scheduler interaction, with a new runtime_init_ready_ flag for synchronization during initialization. Review comments highlight the removal of redundant code blocks and variables that are no longer necessary due to the adoption of the new scheduler API.
I am having trouble creating individual review comments. Click here to see my feedback.
src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (657-790)
This entire block of code, which handles task completion, fanout traversal, and TaskRing advancement, can be removed as it's now managed by the scheduler API.
src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (970-1008)
This entire block of code, which handles early-ready drain, can be removed as it's no longer needed with the scheduler API.
src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (817-889)
This entire block of code, which handles task dispatch from local ready queues, can be removed as it's now managed by the scheduler API.
src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1081)
Consider removing tasks_per_loop since it's not being used after this change.
src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (637)
This line is no longer needed, as the task completion logic is now handled by the scheduler API.
src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (573)
Consider removing ready_pop_own and ready_pop_steal since they are not being used after this change.
src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (570-573)
Consider removing sched_complete_ready_wait and related variables since they are not being used after this change.
src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1079)
Consider removing sched_early_ready_cycle since it's not being used after this change.
src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (558)
Consider removing sched_early_ready_cycle since it's not being used after this change.
src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1093-1094)
Consider removing early_ready related logs since it's not being used after this change.
src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1100-1102)
Consider updating the dispatch log message to reflect the new task dispatch mechanism.
src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1105-1109)
Consider removing the lock contention logs since the local ready queues are removed.
src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1110-1127)
Consider removing the lock contention logs since the local ready queues are removed.
src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1445-1450)
This block of code, which cleans up runtime execution state, can be removed as it's no longer needed with the scheduler API.
src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1465-1471)
This block of code, which resets orchestrator ready queue pointers, can be removed as it's no longer needed with the scheduler API.
src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1507-1510)
This entire block of code, which calculates ready queue sizes from local state, can be removed as it's now managed by the scheduler API.
…cement Migrate the AICPU executor from local state arrays (sharded ready queues, fanin refcounts, task completion tracking) to the centralized scheduler API (pto2_scheduler_on_task_complete, pto2_scheduler_get_ready_task, init_task). Ring advancement uses CAS-based lock-free writes to header->last_task_alive with ticket-based heap_tail serialization, matching the pre-migration executor's concurrency model and avoiding the scope_end dependency that would deadlock with small task windows. - Replace per-thread state arrays with PTO2SchedulerState member calls - Thread 3 (orchestrator) creates PTO2Runtime; threads 0-2 wait on init flag - Tasks transition directly to CONSUMED in on_task_complete after fanout notifications, enabling immediate slot reuse without scope_end - Fix prod_state == COMPLETED to >= COMPLETED in orchestrator early-return - Simplify orchestrator by removing non-scheduler codepath for AICPU mode
56891b1 to
d2d5fd0
Compare
…cement (hw-native-sys#160) Migrate the AICPU executor from local state arrays (sharded ready queues, fanin refcounts, task completion tracking) to the centralized scheduler API (pto2_scheduler_on_task_complete, pto2_scheduler_get_ready_task, init_task). Ring advancement uses CAS-based lock-free writes to header->last_task_alive with ticket-based heap_tail serialization, matching the pre-migration executor's concurrency model and avoiding the scope_end dependency that would deadlock with small task windows. - Replace per-thread state arrays with PTO2SchedulerState member calls - Thread 3 (orchestrator) creates PTO2Runtime; threads 0-2 wait on init flag - Tasks transition directly to CONSUMED in on_task_complete after fanout notifications, enabling immediate slot reuse without scope_end - Fix prod_state == COMPLETED to >= COMPLETED in orchestrator early-return - Simplify orchestrator by removing non-scheduler codepath for AICPU mode
Summary
PTO2SchedulerStateAPI (on_task_complete,get_ready_task,init_task)advance_ring_pointerswith CAS-based lock-free direct advancement onheader->last_task_alive, matching the pre-migration concurrency modelprod_state == COMPLETEDto>= COMPLETEDin orchestrator early-return pathTesting
paged_attention_taskringwith 16-slot window that was previously deadlocking)Fixes #159