Refactor: migrate scheduling to scheduler API, C++ member API, inline hot paths by ChaoWao · Pull Request #155 · hw-native-sys/simpler

ChaoWao · 2026-03-02T07:45:46Z

Summary

Scheduler API migration: Move scheduling logic (fanin/fanout traversal, ready queue management, watermark advancement) from aicpu_executor.cpp into pto_scheduler.h/cpp, reducing executor from ~1600 to ~1100 lines
TensorMap C++ member API: Convert free functions to struct member functions with hot paths (lookup, insert, hash) inlined in the header
Ring buffer inlining: Move HeapRing, TaskRing, DepListPool alloc/try_alloc implementations from .cpp to .h for inlining
Atomic fanin arrive: Use __ATOMIC_ACQ_REL for fanin refcount increments with CAS-based PENDING→READY state transition to prevent double-enqueue
ACK/FIN protocol: Update executor to use register bit-packed task state (EXTRACT_TASK_STATE/EXTRACT_TASK_ID) instead of AICoreStatus::IDLE/BUSY
Unified logging: Replace fprintf/printf with LOG_INFO/LOG_WARN/LOG_ERROR macros throughout ring buffer and tensormap modules
Phase profiling integration: Add perf_aicpu_init_phase_profiling, TensorMap lookup stats, and orchestrator summary export to shared memory

Testing

Simulation tests pass (./ci.sh -p a2a3sim — 10/10)
Hardware tests pass (./ci.sh -p a2a3 -d 4-7 --parallel)

gemini-code-assist · 2026-03-02T07:46:30Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request undertakes a significant refactoring of the PTO2 runtime, centralizing scheduling logic into a dedicated API and modernizing memory management components. By converting TensorMap and ring buffer operations to C++ member functions and inlining critical code paths, the changes aim to enhance code structure, boost performance, and improve maintainability. The update also standardizes logging and expands profiling capabilities to provide deeper insights into runtime behavior.

Highlights

Scheduler API Migration: Core scheduling logic, including fanin/fanout traversal, ready queue management, and watermark advancement, was extracted from aicpu_executor.cpp into a new pto_scheduler.h/cpp API, significantly reducing the executor's codebase from ~1600 to ~1100 lines.
TensorMap C++ Member API: The pto_tensormap free functions were refactored into PTO2TensorMap class member functions, with performance-critical paths such as lookup, insert, and hash operations inlined directly in the header for improved efficiency.
Ring Buffer Inlining: Implementations for HeapRing, TaskRing, and DepListPool allocation and try-allocation were moved from their respective .cpp files to their .h headers, enabling compiler inlining for hot paths.
Atomic Fanin Handling: The fanin refcount increments now utilize __ATOMIC_ACQ_REL memory ordering, coupled with a CAS-based PENDING to READY state transition, to prevent tasks from being double-enqueued in ready queues.
ACK/FIN Protocol Update: The executor was updated to use bit-packed task states (EXTRACT_TASK_STATE/EXTRACT_TASK_ID) instead of the older AICoreStatus::IDLE/BUSY enumeration for task acknowledgment and finalization.
Unified Logging: All fprintf/printf calls within the ring buffer and tensormap modules were replaced with standardized LOG_INFO/LOG_WARN/LOG_ERROR macros for consistent logging practices.
Profiling Enhancements: New profiling integrations include perf_aicpu_init_phase_profiling, detailed TensorMap lookup statistics, and the export of orchestrator summary data to shared memory.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
- Removed legacy ready queue structures and related logic.
- Migrated task completion and ready task retrieval to pto_scheduler API calls.
- Streamlined initialization and deinitialization by removing orchestrator-specific synchronization flags and ready queue pointers.
- Updated profiling counters and stall diagnosis to reflect the new scheduler architecture.
src/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
- Updated calls to TensorMap and TaskRing functions to use their new C++ member API.
- Refactored output buffer allocation within pto2_submit_task.
- Adjusted fanin list finalization to integrate with the new scheduler state management.
- Replaced direct fprintf/printf calls with unified logging macros.
src/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h
- Removed orchestrator-specific ready queue declarations.
- Converted pto2_alloc_packed_buffer into a member function of PTO2OrchestratorState.
src/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.cpp
- Removed standalone implementations of heap and task ring allocation/query functions.
src/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h
- Moved heap and task ring allocation/query functions into their respective PTO2HeapRing and PTO2TaskRing structs as inline member functions.
- Consolidated spinlock-related definitions.
src/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h
- Removed output_index and num_outputs fields from PTO2TaskDescriptor.
src/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.cpp
- Added spinlocks to PTO2ReadyQueue push and pop operations for thread safety.
- Adjusted memory allocation sizes for per-task state arrays.
- Refactored task state management functions (init_task, check_ready) into PTO2SchedulerState member methods.
- Updated atomic memory orders for fanin/fanout refcounts and task state transitions.
- Introduced fanout locking during task completion to ensure consistency.
src/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h
- Introduced a spinlock member to PTO2ReadyQueue.
- Converted PTO2SchedulerState to a struct and integrated task slot calculation and fanin/task initialization logic as member functions.
src/runtime/tensormap_and_ringbuffer/runtime/pto_tensormap.cpp
- Refactored all TensorMap-related free functions into PTO2TensorMap member functions.
- Updated internal logic to use pointers instead of integer offsets for linked list management within the TensorMap.
src/runtime/tensormap_and_ringbuffer/runtime/pto_tensormap.h
- Converted the header to use #pragma once.
- Transformed PTO2TensorMapEntry and PTO2TensorMap into C++ classes/structs with member functions for initialization, destruction, lookup, insertion, and cleanup.
- Updated linked list management within PTO2TensorMapEntry to use direct pointers.
src/runtime/tensormap_and_ringbuffer/runtime/tensor_pool.h
- Initialized ref_count in TensorData constructors.
- Simplified overlap detection logic in is_overlap.

Activity

Simulation tests passed (./ci.sh -p a2a3sim — 10/10).
Hardware tests are pending (./ci.sh -p a2a3 -d 4-7 --parallel).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request significantly refactors the scheduling logic by migrating it from aicpu_executor.cpp to a dedicated scheduler API (pto_scheduler.h/cpp). It also converts several free functions into C++ member functions for TensorMap, HeapRing, and TaskRing, improving modularity and enabling inlining for hot paths. The changes enhance thread safety by introducing spinlocks for ready queue operations and refining memory ordering for atomic operations. Overall, these changes contribute to better code organization, maintainability, and correctness in a multi-threaded environment.

gemini-code-assist · 2026-03-02T07:54:31Z

+
+static PTO2Runtime *rt{nullptr};


The use of a global static PTO2Runtime *rt{nullptr}; can introduce challenges for testability and maintainability, especially in a multi-threaded context where its lifecycle and access patterns need careful management. While static limits its scope to the translation unit, its global nature still warrants careful consideration for potential side effects or unexpected interactions if not strictly controlled.

gemini-code-assist · 2026-03-02T07:54:31Z

+#if PTO2_ORCH_PROFILING
+            sched_yield_count++;
+#endif
+            CYCLE_COUNT_LAP(sched_yield_cycle);


The comment // RISK: Multiple entries on line 662 is vague. It would be beneficial to elaborate on the specific risk associated with multiple entries, such as potential race conditions, data corruption, or performance implications, and how this risk is mitigated or handled within the current design.

gemini-code-assist · 2026-03-02T07:54:31Z

        if (__atomic_load_n(&orch->aicpu_task_completed[prod_slot], __ATOMIC_ACQUIRE) >= 2 &&
-            // RELAXED is sufficient: the ACQUIRE on aicpu_task_completed above
-            // synchronizes with the RELEASE on task_completed in the scheduler,
-            // and completed_by_task is stored (with RELEASE) sequenced-before
-            // task_completed — so it is visible after the ACQUIRE load above.


The removal of comments explaining the memory ordering for aicpu_completed_by_task and aicpu_fanin_refcount is concerning. These comments provided crucial context for understanding the synchronization guarantees in a multi-threaded environment. If the underlying logic has changed such that these specific explanations are no longer relevant, it should be explicitly stated, or the new synchronization rationale should be documented.

gemini-code-assist · 2026-03-02T07:54:31Z

+        int32_t slot = sched->pto2_task_slot(task_id);
+
+        int32_t early_finished = 0;
+        task->fanin_count = fanin_count + 1; // +1 redundance for not being ready too early


The comment // +1 redundance for not being ready too early is unclear. Please clarify why this redundancy is needed and how it prevents premature readiness. A more detailed explanation would improve understanding of this specific synchronization mechanism.

gemini-code-assist · 2026-03-02T07:54:31Z

+            pto2_fanout_unlock(producer);
+        }
+        if (early_finished > 0) {
+            __atomic_fetch_add(&sched->fanin_refcount[slot], early_finished, __ATOMIC_SEQ_CST);


The use of __ATOMIC_SEQ_CST for __atomic_fetch_add(&sched->fanin_refcount[slot], early_finished, __ATOMIC_SEQ_CST); might be overly strong. In pto_scheduler.h, the release_fanin_and_check_ready function uses __ATOMIC_ACQ_REL for similar fanin_refcount operations. If ACQ_REL is sufficient there, it might also be sufficient here, potentially offering a performance improvement. Please ensure consistency or provide a justification for the stronger memory order here.

gemini-code-assist · 2026-03-02T07:54:31Z

    // Try to advance ring pointers
    if (task_id == sched->last_task_alive) {
-        pto2_scheduler_advance_ring_pointers(sched);
+        pto2_scheduler_advance_ring_pointers(sched); // RISK: Multiple entries


The comment // RISK: Multiple entries indicates a potential issue or complexity. Please provide more context on what this risk entails, such as potential race conditions, incorrect pointer advancement, or performance bottlenecks, and how it is being addressed or mitigated.

gemini-code-assist · 2026-03-02T07:54:31Z

-        if (!is_same_memref(pre_task_output)) {
-            return OverlapStatus::NO_OVERLAP;
-        }
+        debug_assert(is_same_memref(pre_task_output));


Changing the is_overlap function to assert is_same_memref rather than returning NO_OVERLAP implies a stronger precondition for calling this function. This is a correctness improvement, but it relies on all calling sites guaranteeing is_same_memref is true before calling is_overlap. Ensure that all call sites adhere to this new invariant to prevent assertion failures in production.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ChaoWao · 2026-03-03T06:45:59Z

Split into #156 #158 #160 .

gemini-code-assist Bot reviewed Mar 2, 2026

View reviewed changes

ChaoWao and others added 3 commits March 2, 2026 17:17

Refactor: convert TensorMap to C++ member API and inline hot paths

f0fd81c

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Refactor: inline ring buffer hot paths and expand scheduler API

c8502a7

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Refactor: migrate executor to scheduler API and simplify orchestrator

d4752ff

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ChaoWao force-pushed the refactor/scheduler-api-tensormap-cpp-inline branch from cb8b4eb to d4752ff Compare March 2, 2026 12:00

ChaoWao marked this pull request as draft March 2, 2026 12:27

ChaoWao mentioned this pull request Mar 2, 2026

Refactor: migrate executor to scheduler API and simplify orchestrator #159

Closed

2 tasks

ChaoWao closed this Mar 3, 2026

ChaoWao deleted the refactor/scheduler-api-tensormap-cpp-inline branch March 3, 2026 06:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: migrate scheduling to scheduler API, C++ member API, inline hot paths#155

Refactor: migrate scheduling to scheduler API, C++ member API, inline hot paths#155
ChaoWao wants to merge 3 commits into
mainfrom
refactor/scheduler-api-tensormap-cpp-inline

ChaoWao commented Mar 2, 2026

Uh oh!

gemini-code-assist Bot commented Mar 2, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 2, 2026

Uh oh!

gemini-code-assist Bot Mar 2, 2026

Uh oh!

gemini-code-assist Bot Mar 2, 2026

Uh oh!

gemini-code-assist Bot Mar 2, 2026

Uh oh!

gemini-code-assist Bot Mar 2, 2026

Uh oh!

gemini-code-assist Bot Mar 2, 2026

Uh oh!

gemini-code-assist Bot Mar 2, 2026

Uh oh!

ChaoWao commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChaoWao commented Mar 2, 2026

Summary

Testing

Uh oh!

gemini-code-assist Bot commented Mar 2, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

ChaoWao commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant