Feature: AICPU scheduler phase profiling and orchestrator summary#150
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the performance profiling capabilities for the AICPU scheduler and orchestrator. It introduces new shared memory data structures and APIs to capture fine-grained timing information for critical scheduler phases and the orchestrator's internal operations. The collected data is then integrated into the existing host-side performance collection and export mechanism, enabling detailed visualization in Perfetto. This allows developers to gain deeper insights into the performance bottlenecks and behavior of the AICPU scheduling process. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces detailed phase profiling for the AICPU scheduler and a cumulative summary for the orchestrator. A critical security vulnerability was identified where the host-side collection logic trusts num_sched_threads from shared memory for loop bounds and array indexing without validation, potentially leading to an out-of-bounds read. It is essential to implement bounds checking for all data retrieved from shared memory. Additionally, there are a few suggestions to enhance code conciseness and idiomatic C++ usage.
8be37ca to
9cb82e0
Compare
…nd dependency arrows - Add scheduler phase profiling: record COMPLETE/DISPATCH/SCAN/EARLY_READY phases per loop iteration with per-thread buffers in shared memory - Add per-task orchestrator phase recording (sync/alloc/params/lookup/heap/ insert/fanin/finalize/scope_end) using AicpuPhaseRecord with dedicated buffer slot, exported as aicpu_orchestrator_phases JSON array - Write cumulative AicpuOrchSummary to shared memory for backward compat - Host-side collection reads both scheduler and orchestrator phase records, exports version 2 JSON with scheduler phases, orchestrator summary, and per-task orchestrator phases - Swimlane converter renders scheduler phases as color-coded bars on pid=3, per-task orchestrator phases on pid=4 with per-phase colors - Add AICPU View (pid=2) fanout dependency arrows mirroring AICore View - Add Scheduler DISPATCH to AICore/AICPU task execution flow arrows - Set process sort order: Orchestrator, Scheduler, AICPU View, AICore View
9cb82e0 to
a7b6126
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces detailed phase profiling for the AICPU scheduler and a summary for the orchestrator. The changes include new data structures for profiling, instrumentation in the AICPU executor and orchestrator, and updates to the host-side collector and visualization script to support the new data.
My review focuses on C++ best practices and code maintainability. I've suggested replacing a memset call with modern C++ value initialization, which is safer and also resolves a compiler warning that was being suppressed. I've also recommended a small refactoring to reduce code duplication when calculating the base timestamp for profiling data.
Overall, the changes are well-structured and add valuable profiling capabilities.
| s_phase_header->buffer_counts[i] = 0; | ||
| } | ||
|
|
||
| memset(&s_phase_header->orch_summary, 0, sizeof(AicpuOrchSummary)); |
There was a problem hiding this comment.
Using memset to zero-out a C++ struct can be problematic if the struct is not a POD (Plain Old Data) type and can trigger compiler warnings like -Wclass-memaccess. A safer and more idiomatic C++ way to zero-initialize the struct is to use value initialization. This change will also allow you to remove the -Wno-error=class-memaccess flag from src/platform/a2a3sim/aicpu/CMakeLists.txt.
| memset(&s_phase_header->orch_summary, 0, sizeof(AicpuOrchSummary)); | |
| s_phase_header->orch_summary = {}; |
| if (has_phase_data_) { | ||
| for (const auto& thread_records : collected_phase_records_) { | ||
| for (const auto& pr : thread_records) { | ||
| if (pr.start_time > 0 && pr.start_time < base_time_cycles) { | ||
| base_time_cycles = pr.start_time; | ||
| } | ||
| } | ||
| } | ||
| for (const auto& pr : collected_orch_phase_records_) { | ||
| if (pr.start_time > 0 && pr.start_time < base_time_cycles) { | ||
| base_time_cycles = pr.start_time; | ||
| } | ||
| } | ||
| if (collected_orch_summary_.magic == AICPU_PHASE_MAGIC && | ||
| collected_orch_summary_.start_time > 0 && | ||
| collected_orch_summary_.start_time < base_time_cycles) { | ||
| base_time_cycles = collected_orch_summary_.start_time; | ||
| } | ||
| } |
There was a problem hiding this comment.
The logic to find the minimum base_time_cycles is repeated across several loops. To improve readability and maintainability, you could extract this logic into a small helper function or a lambda.
| if (has_phase_data_) { | |
| for (const auto& thread_records : collected_phase_records_) { | |
| for (const auto& pr : thread_records) { | |
| if (pr.start_time > 0 && pr.start_time < base_time_cycles) { | |
| base_time_cycles = pr.start_time; | |
| } | |
| } | |
| } | |
| for (const auto& pr : collected_orch_phase_records_) { | |
| if (pr.start_time > 0 && pr.start_time < base_time_cycles) { | |
| base_time_cycles = pr.start_time; | |
| } | |
| } | |
| if (collected_orch_summary_.magic == AICPU_PHASE_MAGIC && | |
| collected_orch_summary_.start_time > 0 && | |
| collected_orch_summary_.start_time < base_time_cycles) { | |
| base_time_cycles = collected_orch_summary_.start_time; | |
| } | |
| } | |
| if (has_phase_data_) { | |
| auto update_base_time = [&](uint64_t new_time) { | |
| if (new_time > 0 && new_time < base_time_cycles) { | |
| base_time_cycles = new_time; | |
| } | |
| }; | |
| for (const auto& thread_records : collected_phase_records_) { | |
| for (const auto& pr : thread_records) { | |
| update_base_time(pr.start_time); | |
| } | |
| } | |
| for (const auto& pr : collected_orch_phase_records_) { | |
| update_base_time(pr.start_time); | |
| } | |
| if (collected_orch_summary_.magic == AICPU_PHASE_MAGIC) { | |
| update_base_time(collected_orch_summary_.start_time); | |
| } | |
| } |
…nd dependency arrows (hw-native-sys#150) - Add scheduler phase profiling: record COMPLETE/DISPATCH/SCAN/EARLY_READY phases per loop iteration with per-thread buffers in shared memory - Add per-task orchestrator phase recording (sync/alloc/params/lookup/heap/ insert/fanin/finalize/scope_end) using AicpuPhaseRecord with dedicated buffer slot, exported as aicpu_orchestrator_phases JSON array - Write cumulative AicpuOrchSummary to shared memory for backward compat - Host-side collection reads both scheduler and orchestrator phase records, exports version 2 JSON with scheduler phases, orchestrator summary, and per-task orchestrator phases - Swimlane converter renders scheduler phases as color-coded bars on pid=3, per-task orchestrator phases on pid=4 with per-phase colors - Add AICPU View (pid=2) fanout dependency arrows mirroring AICore View - Add Scheduler DISPATCH to AICore/AICPU task execution flow arrows - Set process sort order: Orchestrator, Scheduler, AICPU View, AICore View
Summary
AicpuPhaseRecord,AicpuOrchSummary,AicpuPhaseHeader) appended after DoubleBuffer array in shared memoryaicpu_executorcollect_phase_data) and version 2 JSON export withphase_us(microseconds)Testing