Refactor: migrate scheduling to scheduler API, C++ member API, inline hot paths#155
Refactor: migrate scheduling to scheduler API, C++ member API, inline hot paths#155ChaoWao wants to merge 3 commits into
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request undertakes a significant refactoring of the PTO2 runtime, centralizing scheduling logic into a dedicated API and modernizing memory management components. By converting TensorMap and ring buffer operations to C++ member functions and inlining critical code paths, the changes aim to enhance code structure, boost performance, and improve maintainability. The update also standardizes logging and expands profiling capabilities to provide deeper insights into runtime behavior. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request significantly refactors the scheduling logic by migrating it from aicpu_executor.cpp to a dedicated scheduler API (pto_scheduler.h/cpp). It also converts several free functions into C++ member functions for TensorMap, HeapRing, and TaskRing, improving modularity and enabling inlining for hot paths. The changes enhance thread safety by introducing spinlocks for ready queue operations and refining memory ordering for atomic operations. Overall, these changes contribute to better code organization, maintainability, and correctness in a multi-threaded environment.
|
|
||
| static PTO2Runtime *rt{nullptr}; |
There was a problem hiding this comment.
The use of a global static PTO2Runtime *rt{nullptr}; can introduce challenges for testability and maintainability, especially in a multi-threaded context where its lifecycle and access patterns need careful management. While static limits its scope to the translation unit, its global nature still warrants careful consideration for potential side effects or unexpected interactions if not strictly controlled.
| #if PTO2_ORCH_PROFILING | ||
| sched_yield_count++; | ||
| #endif | ||
| CYCLE_COUNT_LAP(sched_yield_cycle); |
There was a problem hiding this comment.
The comment // RISK: Multiple entries on line 662 is vague. It would be beneficial to elaborate on the specific risk associated with multiple entries, such as potential race conditions, data corruption, or performance implications, and how this risk is mitigated or handled within the current design.
| if (__atomic_load_n(&orch->aicpu_task_completed[prod_slot], __ATOMIC_ACQUIRE) >= 2 && | ||
| // RELAXED is sufficient: the ACQUIRE on aicpu_task_completed above | ||
| // synchronizes with the RELEASE on task_completed in the scheduler, | ||
| // and completed_by_task is stored (with RELEASE) sequenced-before | ||
| // task_completed — so it is visible after the ACQUIRE load above. |
There was a problem hiding this comment.
The removal of comments explaining the memory ordering for aicpu_completed_by_task and aicpu_fanin_refcount is concerning. These comments provided crucial context for understanding the synchronization guarantees in a multi-threaded environment. If the underlying logic has changed such that these specific explanations are no longer relevant, it should be explicitly stated, or the new synchronization rationale should be documented.
| int32_t slot = sched->pto2_task_slot(task_id); | ||
|
|
||
| int32_t early_finished = 0; | ||
| task->fanin_count = fanin_count + 1; // +1 redundance for not being ready too early |
| pto2_fanout_unlock(producer); | ||
| } | ||
| if (early_finished > 0) { | ||
| __atomic_fetch_add(&sched->fanin_refcount[slot], early_finished, __ATOMIC_SEQ_CST); |
There was a problem hiding this comment.
The use of __ATOMIC_SEQ_CST for __atomic_fetch_add(&sched->fanin_refcount[slot], early_finished, __ATOMIC_SEQ_CST); might be overly strong. In pto_scheduler.h, the release_fanin_and_check_ready function uses __ATOMIC_ACQ_REL for similar fanin_refcount operations. If ACQ_REL is sufficient there, it might also be sufficient here, potentially offering a performance improvement. Please ensure consistency or provide a justification for the stronger memory order here.
| // Try to advance ring pointers | ||
| if (task_id == sched->last_task_alive) { | ||
| pto2_scheduler_advance_ring_pointers(sched); | ||
| pto2_scheduler_advance_ring_pointers(sched); // RISK: Multiple entries |
There was a problem hiding this comment.
| if (!is_same_memref(pre_task_output)) { | ||
| return OverlapStatus::NO_OVERLAP; | ||
| } | ||
| debug_assert(is_same_memref(pre_task_output)); |
There was a problem hiding this comment.
Changing the is_overlap function to assert is_same_memref rather than returning NO_OVERLAP implies a stronger precondition for calling this function. This is a correctness improvement, but it relies on all calling sites guaranteeing is_same_memref is true before calling is_overlap. Ensure that all call sites adhere to this new invariant to prevent assertion failures in production.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cb8b4eb to
d4752ff
Compare
Summary
aicpu_executor.cppintopto_scheduler.h/cpp, reducing executor from ~1600 to ~1100 linesHeapRing,TaskRing,DepListPoolalloc/try_alloc implementations from.cppto.hfor inlining__ATOMIC_ACQ_RELfor fanin refcount increments with CAS-based PENDING→READY state transition to prevent double-enqueueEXTRACT_TASK_STATE/EXTRACT_TASK_ID) instead ofAICoreStatus::IDLE/BUSYfprintf/printfwithLOG_INFO/LOG_WARN/LOG_ERRORmacros throughout ring buffer and tensormap modulesperf_aicpu_init_phase_profiling, TensorMap lookup stats, and orchestrator summary export to shared memoryTesting
./ci.sh -p a2a3sim— 10/10)./ci.sh -p a2a3 -d 4-7 --parallel)