Skip to content

Refactor: migrate scheduling to scheduler API, C++ member API, inline hot paths#155

Closed
ChaoWao wants to merge 3 commits into
mainfrom
refactor/scheduler-api-tensormap-cpp-inline
Closed

Refactor: migrate scheduling to scheduler API, C++ member API, inline hot paths#155
ChaoWao wants to merge 3 commits into
mainfrom
refactor/scheduler-api-tensormap-cpp-inline

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented Mar 2, 2026

Summary

  • Scheduler API migration: Move scheduling logic (fanin/fanout traversal, ready queue management, watermark advancement) from aicpu_executor.cpp into pto_scheduler.h/cpp, reducing executor from ~1600 to ~1100 lines
  • TensorMap C++ member API: Convert free functions to struct member functions with hot paths (lookup, insert, hash) inlined in the header
  • Ring buffer inlining: Move HeapRing, TaskRing, DepListPool alloc/try_alloc implementations from .cpp to .h for inlining
  • Atomic fanin arrive: Use __ATOMIC_ACQ_REL for fanin refcount increments with CAS-based PENDING→READY state transition to prevent double-enqueue
  • ACK/FIN protocol: Update executor to use register bit-packed task state (EXTRACT_TASK_STATE/EXTRACT_TASK_ID) instead of AICoreStatus::IDLE/BUSY
  • Unified logging: Replace fprintf/printf with LOG_INFO/LOG_WARN/LOG_ERROR macros throughout ring buffer and tensormap modules
  • Phase profiling integration: Add perf_aicpu_init_phase_profiling, TensorMap lookup stats, and orchestrator summary export to shared memory

Testing

  • Simulation tests pass (./ci.sh -p a2a3sim — 10/10)
  • Hardware tests pass (./ci.sh -p a2a3 -d 4-7 --parallel)

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request undertakes a significant refactoring of the PTO2 runtime, centralizing scheduling logic into a dedicated API and modernizing memory management components. By converting TensorMap and ring buffer operations to C++ member functions and inlining critical code paths, the changes aim to enhance code structure, boost performance, and improve maintainability. The update also standardizes logging and expands profiling capabilities to provide deeper insights into runtime behavior.

Highlights

  • Scheduler API Migration: Core scheduling logic, including fanin/fanout traversal, ready queue management, and watermark advancement, was extracted from aicpu_executor.cpp into a new pto_scheduler.h/cpp API, significantly reducing the executor's codebase from ~1600 to ~1100 lines.
  • TensorMap C++ Member API: The pto_tensormap free functions were refactored into PTO2TensorMap class member functions, with performance-critical paths such as lookup, insert, and hash operations inlined directly in the header for improved efficiency.
  • Ring Buffer Inlining: Implementations for HeapRing, TaskRing, and DepListPool allocation and try-allocation were moved from their respective .cpp files to their .h headers, enabling compiler inlining for hot paths.
  • Atomic Fanin Handling: The fanin refcount increments now utilize __ATOMIC_ACQ_REL memory ordering, coupled with a CAS-based PENDING to READY state transition, to prevent tasks from being double-enqueued in ready queues.
  • ACK/FIN Protocol Update: The executor was updated to use bit-packed task states (EXTRACT_TASK_STATE/EXTRACT_TASK_ID) instead of the older AICoreStatus::IDLE/BUSY enumeration for task acknowledgment and finalization.
  • Unified Logging: All fprintf/printf calls within the ring buffer and tensormap modules were replaced with standardized LOG_INFO/LOG_WARN/LOG_ERROR macros for consistent logging practices.
  • Profiling Enhancements: New profiling integrations include perf_aicpu_init_phase_profiling, detailed TensorMap lookup statistics, and the export of orchestrator summary data to shared memory.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
    • Removed legacy ready queue structures and related logic.
    • Migrated task completion and ready task retrieval to pto_scheduler API calls.
    • Streamlined initialization and deinitialization by removing orchestrator-specific synchronization flags and ready queue pointers.
    • Updated profiling counters and stall diagnosis to reflect the new scheduler architecture.
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
    • Updated calls to TensorMap and TaskRing functions to use their new C++ member API.
    • Refactored output buffer allocation within pto2_submit_task.
    • Adjusted fanin list finalization to integrate with the new scheduler state management.
    • Replaced direct fprintf/printf calls with unified logging macros.
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h
    • Removed orchestrator-specific ready queue declarations.
    • Converted pto2_alloc_packed_buffer into a member function of PTO2OrchestratorState.
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.cpp
    • Removed standalone implementations of heap and task ring allocation/query functions.
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h
    • Moved heap and task ring allocation/query functions into their respective PTO2HeapRing and PTO2TaskRing structs as inline member functions.
    • Consolidated spinlock-related definitions.
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h
    • Removed output_index and num_outputs fields from PTO2TaskDescriptor.
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.cpp
    • Added spinlocks to PTO2ReadyQueue push and pop operations for thread safety.
    • Adjusted memory allocation sizes for per-task state arrays.
    • Refactored task state management functions (init_task, check_ready) into PTO2SchedulerState member methods.
    • Updated atomic memory orders for fanin/fanout refcounts and task state transitions.
    • Introduced fanout locking during task completion to ensure consistency.
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h
    • Introduced a spinlock member to PTO2ReadyQueue.
    • Converted PTO2SchedulerState to a struct and integrated task slot calculation and fanin/task initialization logic as member functions.
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_tensormap.cpp
    • Refactored all TensorMap-related free functions into PTO2TensorMap member functions.
    • Updated internal logic to use pointers instead of integer offsets for linked list management within the TensorMap.
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_tensormap.h
    • Converted the header to use #pragma once.
    • Transformed PTO2TensorMapEntry and PTO2TensorMap into C++ classes/structs with member functions for initialization, destruction, lookup, insertion, and cleanup.
    • Updated linked list management within PTO2TensorMapEntry to use direct pointers.
  • src/runtime/tensormap_and_ringbuffer/runtime/tensor_pool.h
    • Initialized ref_count in TensorData constructors.
    • Simplified overlap detection logic in is_overlap.
Activity
  • Simulation tests passed (./ci.sh -p a2a3sim — 10/10).
  • Hardware tests are pending (./ci.sh -p a2a3 -d 4-7 --parallel).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly refactors the scheduling logic by migrating it from aicpu_executor.cpp to a dedicated scheduler API (pto_scheduler.h/cpp). It also converts several free functions into C++ member functions for TensorMap, HeapRing, and TaskRing, improving modularity and enabling inlining for hot paths. The changes enhance thread safety by introducing spinlocks for ready queue operations and refining memory ordering for atomic operations. Overall, these changes contribute to better code organization, maintainability, and correctness in a multi-threaded environment.

Comment on lines +74 to +75

static PTO2Runtime *rt{nullptr};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of a global static PTO2Runtime *rt{nullptr}; can introduce challenges for testability and maintainability, especially in a multi-threaded context where its lifecycle and access patterns need careful management. While static limits its scope to the translation unit, its global nature still warrants careful consideration for potential side effects or unexpected interactions if not strictly controlled.

Comment on lines +659 to +662
#if PTO2_ORCH_PROFILING
sched_yield_count++;
#endif
CYCLE_COUNT_LAP(sched_yield_cycle);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment // RISK: Multiple entries on line 662 is vague. It would be beneficial to elaborate on the specific risk associated with multiple entries, such as potential race conditions, data corruption, or performance implications, and how this risk is mitigated or handled within the current design.

Comment on lines 215 to -218
if (__atomic_load_n(&orch->aicpu_task_completed[prod_slot], __ATOMIC_ACQUIRE) >= 2 &&
// RELAXED is sufficient: the ACQUIRE on aicpu_task_completed above
// synchronizes with the RELEASE on task_completed in the scheduler,
// and completed_by_task is stored (with RELEASE) sequenced-before
// task_completed — so it is visible after the ACQUIRE load above.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The removal of comments explaining the memory ordering for aicpu_completed_by_task and aicpu_fanin_refcount is concerning. These comments provided crucial context for understanding the synchronization guarantees in a multi-threaded environment. If the underlying logic has changed such that these specific explanations are no longer relevant, it should be explicitly stated, or the new synchronization rationale should be documented.

int32_t slot = sched->pto2_task_slot(task_id);

int32_t early_finished = 0;
task->fanin_count = fanin_count + 1; // +1 redundance for not being ready too early
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment // +1 redundance for not being ready too early is unclear. Please clarify why this redundancy is needed and how it prevents premature readiness. A more detailed explanation would improve understanding of this specific synchronization mechanism.

pto2_fanout_unlock(producer);
}
if (early_finished > 0) {
__atomic_fetch_add(&sched->fanin_refcount[slot], early_finished, __ATOMIC_SEQ_CST);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of __ATOMIC_SEQ_CST for __atomic_fetch_add(&sched->fanin_refcount[slot], early_finished, __ATOMIC_SEQ_CST); might be overly strong. In pto_scheduler.h, the release_fanin_and_check_ready function uses __ATOMIC_ACQ_REL for similar fanin_refcount operations. If ACQ_REL is sufficient there, it might also be sufficient here, potentially offering a performance improvement. Please ensure consistency or provide a justification for the stronger memory order here.

// Try to advance ring pointers
if (task_id == sched->last_task_alive) {
pto2_scheduler_advance_ring_pointers(sched);
pto2_scheduler_advance_ring_pointers(sched); // RISK: Multiple entries
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment // RISK: Multiple entries indicates a potential issue or complexity. Please provide more context on what this risk entails, such as potential race conditions, incorrect pointer advancement, or performance bottlenecks, and how it is being addressed or mitigated.

if (!is_same_memref(pre_task_output)) {
return OverlapStatus::NO_OVERLAP;
}
debug_assert(is_same_memref(pre_task_output));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Changing the is_overlap function to assert is_same_memref rather than returning NO_OVERLAP implies a stronger precondition for calling this function. This is a correctness improvement, but it relies on all calling sites guaranteeing is_same_memref is true before calling is_overlap. Ensure that all call sites adhere to this new invariant to prevent assertion failures in production.

ChaoWao and others added 3 commits March 2, 2026 17:17
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ChaoWao ChaoWao force-pushed the refactor/scheduler-api-tensormap-cpp-inline branch from cb8b4eb to d4752ff Compare March 2, 2026 12:00
@ChaoWao ChaoWao marked this pull request as draft March 2, 2026 12:27
@ChaoWao
Copy link
Copy Markdown
Collaborator Author

ChaoWao commented Mar 3, 2026

Split into #156 #158 #160 .

@ChaoWao ChaoWao closed this Mar 3, 2026
@ChaoWao ChaoWao deleted the refactor/scheduler-api-tensormap-cpp-inline branch March 3, 2026 06:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant