Skip to content

Refactor: migrate executor to scheduler API and simplify orchestrator#159

Closed
ChaoWao wants to merge 1 commit into
mainfrom
refactor/executor-scheduler-api-migration
Closed

Refactor: migrate executor to scheduler API and simplify orchestrator#159
ChaoWao wants to merge 1 commit into
mainfrom
refactor/executor-scheduler-api-migration

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented Mar 2, 2026

Summary

  • Replace inline scheduling logic in aicpu_executor (ready queue,
    SpinLock, fanout traversal) with scheduler API calls
    (get_ready_task, on_task_complete)
  • Restructure orchestrator STEP 5 fanin path to use scheduler
    task_state for early-return optimization with +1 fanin protocol
  • Remove orch_ready_queue fields and pto2_add_consumer_to_producer
    declaration from orchestrator header
  • Remove dead pto2_alloc_packed_buffer free function implementation

This is the final commit in the 3-part split of #155.

Testing

  • Simulation tests pass (./ci.sh -p a2a3sim — 10/10)
  • Hardware tests pass (if applicable)

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the task scheduling and orchestration mechanisms within the system. By migrating the aicpu_executor to utilize a dedicated scheduler API, it centralizes and streamlines task management, removing complex inline logic. The orchestrator has also been simplified by integrating with the new scheduler's state tracking, leading to cleaner dependency handling and the removal of obsolete components. This change improves the modularity and maintainability of the scheduling infrastructure.

Highlights

  • Executor Migration to Scheduler API: The aicpu_executor's inline scheduling logic, including ready queue management and spinlocks, has been replaced with calls to the new scheduler API functions like get_ready_task and on_task_complete.
  • Orchestrator Simplification: The orchestrator's fanin path (STEP 5) has been restructured to leverage the scheduler's task_state for early-return optimizations, simplifying dependency management.
  • Removal of Redundant Code: Several unused or deprecated components have been removed, including orch_ready_queue fields and the pto2_add_consumer_to_producer declaration from the orchestrator header, as well as a dead pto2_alloc_packed_buffer implementation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
    • Removed custom ready queue structures, spinlocks, and related management logic.
    • Introduced a static PTO2Runtime pointer for scheduler interaction.
    • Eliminated various atomic flags and pointers related to orchestrator-executor synchronization and ready queues.
    • Removed the enqueue_ready_task_with_profiling helper function.
    • Removed static arrays for s_pto2_fanin_refcount, s_pto2_task_completed, and s_pto2_completed_by_task.
    • Updated the init function to remove ready queue initialization.
    • Removed sm_header_ready_ and orch_pointers_ready_ wait loops.
    • Simplified profiling counters and removed phase-specific profiling variables.
    • Integrated pto2_scheduler_on_task_complete for task completion handling.
    • Replaced manual ready queue polling and work-stealing with pto2_scheduler_get_ready_task.
    • Removed the incremental scan for root tasks and the early-ready drain logic.
    • Updated stall diagnosis to query the scheduler's task state and fanin refcount.
    • Adjusted profiling summary output to reflect new metrics.
    • Added runtime_init_ready_ atomic flag for synchronization.
    • Removed ready queue and orchestrator pointer resets from the deinit function.
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
    • Removed a redundant comment block in pto2_add_consumer_to_producer.
    • Deleted the unused pto2_alloc_packed_buffer function implementation.
    • Refactored output buffer allocation logic within pto2_submit_task to occur earlier.
    • Modified the fanin list finalization (STEP 5) to use the scheduler API for dependency tracking and early-return optimization when a scheduler is present.
    • Removed the explicit pto2_add_consumer_to_producer call within the pto2_submit_task loop when a scheduler is active.
    • Eliminated the 'STEP 5b: Check if task is already ready' block, as its functionality is now handled by the scheduler.
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h
    • Removed orch_ready_queue related member variables and constants from PTO2OrchestratorState.
    • Removed the declaration of the pto2_add_consumer_to_producer helper function.
Activity
  • Simulation tests passed with 10/10 results, indicating functional correctness of the refactored logic.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant refactoring that migrates the scheduling logic out of aicpu_executor and into a dedicated scheduler API. The changes remove the executor's internal ready queues, spinlocks, and fanout traversal logic, replacing them with calls to pto2_scheduler_get_ready_task and pto2_scheduler_on_task_complete. The orchestrator's fanin logic is also updated to integrate with the new scheduler state, including an early-return optimization for already-completed producer tasks. This centralization of scheduling logic greatly simplifies the executor and improves the overall structure and maintainability of the code. The changes are well-implemented. I have a couple of minor suggestions to improve code clarity and encapsulation.

I am having trouble creating individual review comments. Click here to see my feedback.

src/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp (419)

medium

The comment on this line appears to be incorrect. It says "prepend consumer to producer's fanout list", but the code on the next line (420) is prepending the producer to the current task's (the consumer's) fanin list. This could be confusing for future maintainers.

            // Normal path: prepend producer to this task's fanin list

src/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h (240-241)

medium

This function declaration is being removed from the header, which is a good step for encapsulation. To fully restrict its visibility to pto_orchestrator.cpp, please also declare the function as static in its definition in pto_orchestrator.cpp. This will prevent other parts of the codebase from accidentally linking against it and improve maintainability.

@ChaoWao ChaoWao force-pushed the refactor/executor-scheduler-api-migration branch from c24e92d to b0b1ad4 Compare March 2, 2026 13:25
@ChaoWao ChaoWao force-pushed the refactor/executor-scheduler-api-migration branch 2 times, most recently from 90317a2 to df9618a Compare March 3, 2026 02:21
- Replace inline scheduling logic in aicpu_executor (ready queue,
  SpinLock, fanout traversal) with scheduler API calls
  (get_ready_task, on_task_complete)
- Restructure orchestrator STEP 5 fanin path to use scheduler
  task_state for early-return optimization with +1 fanin protocol
- Add lock-free orch_pending queue (SPSC producer, MPMC consumer)
  so orchestrator never touches ready_queue spinlock in init_task
- Split release_fanin_and_check_ready into mark_ready (CAS only)
  and check_ready (CAS + enqueue) for orch vs scheduler paths
- Reorder scheduler loop: complete -> scan (drain orch_pending) ->
  dispatch -> idle, ensuring pending tasks dispatch same iteration
- Add scan_enqueue_count profiling counter with [enqueue: N] output
- Rename SCHED_EARLY_READY phase to SCHED_IDLE_WAIT for clarity
- Move PTO-ISA fallback retry into ci.sh: -c flag specifies fallback
  commit, first failure pins and retries, subsequent examples reuse
- Update sched_overhead_analysis.py parser for new log format
@ChaoWao ChaoWao force-pushed the refactor/executor-scheduler-api-migration branch from df9618a to 3be34a0 Compare March 3, 2026 02:24
@ChaoWao
Copy link
Copy Markdown
Collaborator Author

ChaoWao commented Mar 3, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the inline scheduling logic from aicpu_executor to a new, cleaner scheduler API, aiming to simplify the executor and improve modularity. However, it introduces several critical security vulnerabilities related to integer overflows and missing null checks in the memory management and scheduling logic. Specifically, unvalidated 64-bit integers from shared memory are used in size calculations for heap allocations, which can lead to heap overflows. Furthermore, 32-bit signed integers are used to accumulate 64-bit sizes, potentially causing overflows and subsequent null pointer dereferences or invalid memory address usage, which could be exploited for memory corruption or arbitrary code execution. Additionally, while profiling and analysis tools were updated, the ci.sh script has an opportunity for improvement to reduce code duplication.


// Initialize orch_pending lock-free queue
sched->orch_pending_capacity = window_size;
sched->orch_pending = (int32_t*)malloc(window_size * sizeof(int32_t));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

The allocation of the orch_pending buffer does not account for potential integer overflow when multiplying window_size by sizeof(int32_t). Since window_size is a 64-bit integer obtained from shared memory and is not capped before this allocation, a sufficiently large value can cause the multiplication to wrap around, resulting in a much smaller buffer being allocated than expected. Subsequent operations that use window_size as the capacity for indexing into this buffer (e.g., in orch_pending_push and orch_pending_try_pop in pto_scheduler.h) will then cause a heap-based buffer overflow.

Comment on lines +309 to +323
int32_t total_output_size = 0;
for (int i = 0; i < num_params; i++) {
PTOParam& p = task->params[i];
if (p.type != PTOParamType::OUTPUT) {
continue;
}
auto& tensor_data = p.tensor.data();
// Only allocate from ring buffer when caller did not provide an address
if (tensor_data.buffer.addr == 0) {
total_output_size += PTO2_ALIGN_UP(tensor_data.buffer.size, PTO2_PACKED_OUTPUT_ALIGN);
}
}

if (total_output_size > 0) {
task->packed_buffer_base = orch->pto2_alloc_packed_buffer(total_output_size);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The variable total_output_size is declared as a signed 32-bit integer (int32_t), but it is used to accumulate the sizes of multiple output tensors, which are 64-bit integers (uint64_t). If the sum of these sizes exceeds the maximum value of a 32-bit signed integer, total_output_size will overflow and potentially become negative. If it is negative, the check if (total_output_size > 0) at line 322 will fail, leaving task->packed_buffer_base as NULL. Subsequently, at line 377, this NULL pointer is used in pointer arithmetic: (char*)task->packed_buffer_base + offset. This results in alloc_addr being set to a small integer value (the offset), which is then used as a memory address for tensor data. This can lead to memory corruption or crashes when the hardware attempts to access these invalid addresses.

Comment on lines +377 to +378
uint64_t alloc_addr = reinterpret_cast<uint64_t>((char*)task->packed_buffer_base + offset);
tensor_data.buffer.addr = alloc_addr;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The result of orch->pto2_alloc_packed_buffer(total_output_size) is used here without checking if it returned NULL. If total_output_size was invalid or if the allocation failed, task->packed_buffer_base will be NULL, leading to invalid memory addresses being calculated and used in subsequent steps.

Comment thread ci.sh
Comment on lines +243 to +251
elif pin_pto_isa_on_failure; then
echo "[CI] Retrying: $name with pinned PTO-ISA"
if python examples/scripts/run_example.py \
-k "${dir}/kernels" -g "${dir}/golden.py" \
-p a2a3sim "${commit_flag[@]}"; then
echo "${name}:a2a3sim|PASS" >> "$RESULTS_FILE"
else
echo "${name}:a2a3sim|FAIL" >> "$RESULTS_FILE"
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to run the simulation test is duplicated. The python examples/scripts/run_example.py ... command and its success/failure handling appear both in the initial if block and again here in the elif block for retrying. This duplication makes the script harder to maintain.

Consider extracting the test execution into a shell function to avoid repetition. For example:

run_sim_test() {
    if python examples/scripts/run_example.py \
        -k "${dir}/kernels" -g "${dir}/golden.py" \
        -p a2a3sim "${commit_flag[@]}"; then
        echo "${name}:a2a3sim|PASS" >> "$RESULTS_FILE"
        return 0
    else
        return 1
    fi
}

The main loop could then be simplified, making the intent of the retry logic clearer.

@jvjhfhg jvjhfhg closed this in #160 Mar 3, 2026
@ChaoWao ChaoWao reopened this Mar 3, 2026
ChaoWao added a commit that referenced this pull request Mar 3, 2026
- Restore perf_aicpu_record_phase calls for all scheduler phases
  (complete, dispatch, scan, idle) lost during scheduler API migration
- Replace old yield/orch_drain counters with notify/pop/idle metrics
- Upgrade DEV_ALWAYS output with per-phase breakdown and detailed stats
  (notify edges, pop hit/miss rates)
- Add "Thread 3:" prefix to orchestrator/TensorMap log output
- Add TensorMap lookup/insert profiling counters behind new
  PTO2_TENSORMAP_PROFILING macro (default off, enable with -D flag)
- Fix include ordering bug where inline lookup/insert methods were
  compiled before PTO2_ORCH_PROFILING was defined
- Add task count validation in swimlane_converter to suppress misleading
  Sched CPU metric when device log mismatches perf JSON
- Rewrite sched_overhead_analysis.py parser to match new output format
- Update device_log_profiling.md examples

Closes #159
ChaoWao added a commit that referenced this pull request Mar 3, 2026
- Restore perf_aicpu_record_phase calls for all scheduler phases
  (complete, dispatch, scan, idle) lost during scheduler API migration
- Replace old yield/orch_drain counters with notify/pop/idle metrics
- Upgrade DEV_ALWAYS output with per-phase breakdown and detailed stats
  (notify edges, pop hit/miss rates)
- Add "Thread 3:" prefix to orchestrator/TensorMap log output
- Add TensorMap lookup/insert profiling counters behind new
  PTO2_TENSORMAP_PROFILING macro (default off, enable with -D flag)
- Rename PTO2_ORCH_PROFILING to PTO2_PROFILING for clarity since it
  controls both orchestrator and scheduler profiling
- Fix include ordering bug where inline lookup/insert methods were
  compiled before the profiling macro was defined
- Add task count validation in swimlane_converter to suppress misleading
  Sched CPU metric when device log mismatches perf JSON
- Rewrite sched_overhead_analysis.py parser to match new output format
- Update device_log_profiling.md examples

Closes #159
ChaoWao added a commit that referenced this pull request Mar 3, 2026
- Restore perf_aicpu_record_phase calls for all scheduler phases
  (complete, dispatch, scan, idle) lost during scheduler API migration
- Replace old yield/orch_drain counters with notify/pop/idle metrics
- Upgrade DEV_ALWAYS output with per-phase breakdown and detailed stats
  (notify edges, pop hit/miss rates)
- Add "Thread 3:" prefix to orchestrator/TensorMap log output
- Add TensorMap lookup/insert profiling counters behind new
  PTO2_TENSORMAP_PROFILING macro (default off, enable with -D flag)
- Rename PTO2_ORCH_PROFILING to PTO2_PROFILING for clarity since it
  controls both orchestrator and scheduler profiling
- Fix include ordering bug where inline lookup/insert methods were
  compiled before the profiling macro was defined
- Add task count validation in swimlane_converter to suppress misleading
  Sched CPU metric when device log mismatches perf JSON
- Rewrite sched_overhead_analysis.py parser to match new output format
- Update device_log_profiling.md examples

Closes #159
ChaoWao added a commit that referenced this pull request Mar 3, 2026
- Restore perf_aicpu_record_phase calls for all scheduler phases
  (complete, dispatch, scan, idle) lost during scheduler API migration
- Replace old yield/orch_drain counters with notify/pop/idle metrics
- Upgrade DEV_ALWAYS output with per-phase breakdown and detailed stats
  (notify edges, pop hit/miss rates)
- Add "Thread 3:" prefix to orchestrator/TensorMap log output
- Add TensorMap lookup/insert profiling counters behind new
  PTO2_TENSORMAP_PROFILING macro (default off, enable with -D flag)
- Rename PTO2_ORCH_PROFILING to PTO2_PROFILING for clarity since it
  controls both orchestrator and scheduler profiling
- Fix include ordering bug where inline lookup/insert methods were
  compiled before the profiling macro was defined
- Add task count validation in swimlane_converter to suppress misleading
  Sched CPU metric when device log mismatches perf JSON
- Rewrite sched_overhead_analysis.py parser to match new output format
- Update device_log_profiling.md examples

Closes #159
ChaoWao added a commit that referenced this pull request Mar 3, 2026
- Restore perf_aicpu_record_phase calls for all scheduler phases
  (complete, dispatch, scan, idle) lost during scheduler API migration
- Replace old yield/orch_drain counters with notify/pop/idle metrics
- Upgrade DEV_ALWAYS output with per-phase breakdown and detailed stats
  (notify edges, pop hit/miss rates)
- Add "Thread 3:" prefix to orchestrator/TensorMap log output
- Add TensorMap lookup/insert profiling counters behind new
  PTO2_TENSORMAP_PROFILING macro (default off, enable with -D flag)
- Rename PTO2_ORCH_PROFILING to PTO2_PROFILING for clarity since it
  controls both orchestrator and scheduler profiling
- Fix include ordering bug where inline lookup/insert methods were
  compiled before the profiling macro was defined
- Add task count validation in swimlane_converter to suppress misleading
  Sched CPU metric when device log mismatches perf JSON
- Rewrite sched_overhead_analysis.py parser to match new output format
- Update device_log_profiling.md examples

Closes #159
@ChaoWao ChaoWao closed this in #167 Mar 3, 2026
@ChaoWao ChaoWao deleted the refactor/executor-scheduler-api-migration branch March 5, 2026 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant