Skip to content

Feature: AICPU scheduler phase profiling and orchestrator summary#150

Merged
ChaoWao merged 1 commit into
mainfrom
profiling-refactor-and-phase
Mar 2, 2026
Merged

Feature: AICPU scheduler phase profiling and orchestrator summary#150
ChaoWao merged 1 commit into
mainfrom
profiling-refactor-and-phase

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented Mar 1, 2026

Summary

  • Add phase profiling data structures (AicpuPhaseRecord, AicpuOrchSummary, AicpuPhaseHeader) appended after DoubleBuffer array in shared memory
  • Implement AICPU-side recording API with cached pointers for hot-path efficiency
  • Instrument 4 scheduler phases (COMPLETE, DISPATCH, SCAN, EARLY_READY) and orchestrator cumulative summary in aicpu_executor
  • Host-side collection (collect_phase_data) and version 2 JSON export with phase_us (microseconds)
  • Perfetto visualization: pid=3 scheduler phase bars, pid=4 orchestrator with phase sub-events

Testing

  • Simulation tests pass (10/10)
  • Hardware tests pass (if applicable)

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance profiling capabilities for the AICPU scheduler and orchestrator. It introduces new shared memory data structures and APIs to capture fine-grained timing information for critical scheduler phases and the orchestrator's internal operations. The collected data is then integrated into the existing host-side performance collection and export mechanism, enabling detailed visualization in Perfetto. This allows developers to gain deeper insights into the performance bottlenecks and behavior of the AICPU scheduling process.

Highlights

  • AICPU Phase Profiling Data Structures: Added new data structures (AicpuPhaseRecord, AicpuOrchSummary, AicpuPhaseHeader) to shared memory for detailed AICPU scheduler phase and orchestrator profiling.
  • AICPU-side Recording APIs and Instrumentation: Implemented AICPU-side APIs to record scheduler phases (COMPLETE, DISPATCH, SCAN, EARLY_READY) and orchestrator cumulative summary, with cached pointers for hot-path efficiency.
  • Host-side Data Collection and JSON Export: Introduced host-side logic to collect the new phase profiling data and updated the swimlane JSON export to version 2, including phase_us (microseconds) for visualization.
  • Perfetto Visualization Support: Enhanced the swimlane_converter.py script to visualize AICPU scheduler phases (pid=3) and orchestrator activity (pid=4 with phase sub-events) in Perfetto.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/platform/a2a3/host/device_runner.cpp
    • Invoked perf_collector_.collect_phase_data() when profiling is enabled.
  • src/platform/a2a3sim/host/device_runner.cpp
    • Invoked perf_collector_.collect_phase_data() when profiling is enabled in the simulator.
  • src/platform/include/aicpu/performance_collector_aicpu.h
    • Added perf_aicpu_init_phase_profiling function to initialize AICPU phase profiling.
    • Introduced perf_aicpu_record_phase function to record individual scheduler phases.
    • Added perf_aicpu_write_orch_summary function to write orchestrator cumulative summary data.
  • src/platform/include/common/perf_profiling.h
    • Updated memory layout documentation to include the optional phase profiling region.
    • Defined AicpuPhaseId enum for scheduler phase identification.
    • Introduced AicpuPhaseRecord struct for single scheduler phase records.
    • Added AicpuOrchSummary struct for orchestrator cumulative profiling data.
    • Defined AicpuPhaseHeader struct for phase profiling metadata.
    • Provided helper functions (calc_perf_data_size_with_phases, get_phase_header, get_phase_records) for accessing phase profiling data in shared memory.
  • src/platform/include/host/performance_collector.h
    • Declared collect_phase_data method to gather AICPU phase profiling data.
    • Added collected_phase_records_, collected_orch_summary_, and has_phase_data_ members to store collected phase profiling data.
  • src/platform/src/aicpu/performance_collector_aicpu.cpp
    • Implemented perf_aicpu_init_phase_profiling to set up phase header and clear record buffers.
    • Implemented perf_aicpu_record_phase to append phase records to thread-specific buffers.
    • Implemented perf_aicpu_write_orch_summary to store orchestrator summary in shared memory.
    • Introduced static cached pointers (s_phase_header, s_phase_records) for efficient hot-path access.
  • src/platform/src/host/performance_collector.cpp
    • Updated initialize to calculate total shared memory size including the new phase profiling region.
    • Implemented collect_phase_data to read and validate phase records and orchestrator summary from shared memory.
    • Modified export_swimlane_json to include phase record timestamps in base time calculation and export version 2 JSON with aicpu_scheduler_phases and aicpu_orchestrator data.
    • Cleared collected phase data in the finalize method.
  • src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
    • Called perf_aicpu_init_phase_profiling during one-time initialization for scheduler threads.
    • Added local counters (phase_complete_count, phase_dispatch_count, phase_scan_count, phase_early_ready_count) for tasks processed in each phase.
    • Instrumented scheduler loop phases (COMPLETE, DISPATCH, SCAN, EARLY_READY) with perf_aicpu_record_phase calls.
    • Wrote orchestrator cumulative summary to shared memory using perf_aicpu_write_orch_summary.
  • tools/swimlane_converter.py
    • Updated read_perf_data to support version 2 of the JSON format.
    • Modified generate_chrome_trace_json to accept scheduler_phases and orchestrator_data.
    • Added logic to generate Perfetto trace events for AICPU scheduler phases (pid=3) with distinct colors.
    • Added logic to generate Perfetto trace events for AICPU orchestrator summary (pid=4) including sub-events for different internal phases.
    • Updated main function to extract and pass version 2 profiling data to the trace generation.
Activity
  • Simulation tests have passed (10/10).
  • Hardware tests are marked as pending or not applicable.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces detailed phase profiling for the AICPU scheduler and a cumulative summary for the orchestrator. A critical security vulnerability was identified where the host-side collection logic trusts num_sched_threads from shared memory for loop bounds and array indexing without validation, potentially leading to an out-of-bounds read. It is essential to implement bounds checking for all data retrieved from shared memory. Additionally, there are a few suggestions to enhance code conciseness and idiomatic C++ usage.

Comment thread src/platform/src/host/performance_collector.cpp
Comment thread src/platform/src/aicpu/performance_collector_aicpu.cpp Outdated
Comment thread src/platform/src/host/performance_collector.cpp Outdated
@ChaoWao ChaoWao force-pushed the profiling-refactor-and-phase branch 3 times, most recently from 8be37ca to 9cb82e0 Compare March 1, 2026 13:13
…nd dependency arrows

- Add scheduler phase profiling: record COMPLETE/DISPATCH/SCAN/EARLY_READY
  phases per loop iteration with per-thread buffers in shared memory
- Add per-task orchestrator phase recording (sync/alloc/params/lookup/heap/
  insert/fanin/finalize/scope_end) using AicpuPhaseRecord with dedicated
  buffer slot, exported as aicpu_orchestrator_phases JSON array
- Write cumulative AicpuOrchSummary to shared memory for backward compat
- Host-side collection reads both scheduler and orchestrator phase records,
  exports version 2 JSON with scheduler phases, orchestrator summary, and
  per-task orchestrator phases
- Swimlane converter renders scheduler phases as color-coded bars on pid=3,
  per-task orchestrator phases on pid=4 with per-phase colors
- Add AICPU View (pid=2) fanout dependency arrows mirroring AICore View
- Add Scheduler DISPATCH to AICore/AICPU task execution flow arrows
- Set process sort order: Orchestrator, Scheduler, AICPU View, AICore View
@ChaoWao ChaoWao force-pushed the profiling-refactor-and-phase branch from 9cb82e0 to a7b6126 Compare March 2, 2026 01:41
@ChaoWao
Copy link
Copy Markdown
Collaborator Author

ChaoWao commented Mar 2, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces detailed phase profiling for the AICPU scheduler and a summary for the orchestrator. The changes include new data structures for profiling, instrumentation in the AICPU executor and orchestrator, and updates to the host-side collector and visualization script to support the new data.

My review focuses on C++ best practices and code maintainability. I've suggested replacing a memset call with modern C++ value initialization, which is safer and also resolves a compiler warning that was being suppressed. I've also recommended a small refactoring to reduce code duplication when calculating the base timestamp for profiling data.

Overall, the changes are well-structured and add valuable profiling capabilities.

s_phase_header->buffer_counts[i] = 0;
}

memset(&s_phase_header->orch_summary, 0, sizeof(AicpuOrchSummary));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using memset to zero-out a C++ struct can be problematic if the struct is not a POD (Plain Old Data) type and can trigger compiler warnings like -Wclass-memaccess. A safer and more idiomatic C++ way to zero-initialize the struct is to use value initialization. This change will also allow you to remove the -Wno-error=class-memaccess flag from src/platform/a2a3sim/aicpu/CMakeLists.txt.

Suggested change
memset(&s_phase_header->orch_summary, 0, sizeof(AicpuOrchSummary));
s_phase_header->orch_summary = {};

Comment on lines +412 to +430
if (has_phase_data_) {
for (const auto& thread_records : collected_phase_records_) {
for (const auto& pr : thread_records) {
if (pr.start_time > 0 && pr.start_time < base_time_cycles) {
base_time_cycles = pr.start_time;
}
}
}
for (const auto& pr : collected_orch_phase_records_) {
if (pr.start_time > 0 && pr.start_time < base_time_cycles) {
base_time_cycles = pr.start_time;
}
}
if (collected_orch_summary_.magic == AICPU_PHASE_MAGIC &&
collected_orch_summary_.start_time > 0 &&
collected_orch_summary_.start_time < base_time_cycles) {
base_time_cycles = collected_orch_summary_.start_time;
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to find the minimum base_time_cycles is repeated across several loops. To improve readability and maintainability, you could extract this logic into a small helper function or a lambda.

Suggested change
if (has_phase_data_) {
for (const auto& thread_records : collected_phase_records_) {
for (const auto& pr : thread_records) {
if (pr.start_time > 0 && pr.start_time < base_time_cycles) {
base_time_cycles = pr.start_time;
}
}
}
for (const auto& pr : collected_orch_phase_records_) {
if (pr.start_time > 0 && pr.start_time < base_time_cycles) {
base_time_cycles = pr.start_time;
}
}
if (collected_orch_summary_.magic == AICPU_PHASE_MAGIC &&
collected_orch_summary_.start_time > 0 &&
collected_orch_summary_.start_time < base_time_cycles) {
base_time_cycles = collected_orch_summary_.start_time;
}
}
if (has_phase_data_) {
auto update_base_time = [&](uint64_t new_time) {
if (new_time > 0 && new_time < base_time_cycles) {
base_time_cycles = new_time;
}
};
for (const auto& thread_records : collected_phase_records_) {
for (const auto& pr : thread_records) {
update_base_time(pr.start_time);
}
}
for (const auto& pr : collected_orch_phase_records_) {
update_base_time(pr.start_time);
}
if (collected_orch_summary_.magic == AICPU_PHASE_MAGIC) {
update_base_time(collected_orch_summary_.start_time);
}
}

@ChaoWao ChaoWao merged commit 7df48e9 into main Mar 2, 2026
3 checks passed
@ChaoWao ChaoWao deleted the profiling-refactor-and-phase branch March 2, 2026 01:56
PKUZHOU pushed a commit to PKUZHOU/simpler that referenced this pull request Mar 31, 2026
…nd dependency arrows (hw-native-sys#150)

- Add scheduler phase profiling: record COMPLETE/DISPATCH/SCAN/EARLY_READY
  phases per loop iteration with per-thread buffers in shared memory
- Add per-task orchestrator phase recording (sync/alloc/params/lookup/heap/
  insert/fanin/finalize/scope_end) using AicpuPhaseRecord with dedicated
  buffer slot, exported as aicpu_orchestrator_phases JSON array
- Write cumulative AicpuOrchSummary to shared memory for backward compat
- Host-side collection reads both scheduler and orchestrator phase records,
  exports version 2 JSON with scheduler phases, orchestrator summary, and
  per-task orchestrator phases
- Swimlane converter renders scheduler phases as color-coded bars on pid=3,
  per-task orchestrator phases on pid=4 with per-phase colors
- Add AICPU View (pid=2) fanout dependency arrows mirroring AICore View
- Add Scheduler DISPATCH to AICore/AICPU task execution flow arrows
- Set process sort order: Orchestrator, Scheduler, AICPU View, AICore View
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant