Refactor: migrate executor to scheduler API with CAS-based ring advancement by ChaoWao · Pull Request #160 · hw-native-sys/simpler

ChaoWao · 2026-03-02T15:29:00Z

Summary

Migrate AICPU executor from local state arrays to centralized PTO2SchedulerState API (on_task_complete, get_ready_task, init_task)
Replace CONSUMED-dependent advance_ring_pointers with CAS-based lock-free direct advancement on header->last_task_alive, matching the pre-migration concurrency model
Fix task ring deadlock with small windows (16 slots) where scope_end dependency created a circular wait: orchestrator blocked on full ring → tasks can't reach CONSUMED → scope_end can't run
Fix prod_state == COMPLETED to >= COMPLETED in orchestrator early-return path

Testing

Simulation tests pass (4/4 tensormap_and_ringbuffer)
Hardware device tests pass (7/7 tensormap_and_ringbuffer, including paged_attention_taskring with 16-slot window that was previously deadlocking)
Python unit tests pass (21/21)

Fixes #159

gemini-code-assist · 2026-03-02T15:29:35Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant refactoring of the AICPU executor's task scheduling and ring buffer management. The core changes involve migrating the executor to a new, centralized PTO2SchedulerState API, which simplifies task state handling and ready queue management. A key improvement is the adoption of a lock-free, CAS-based mechanism for advancing ring pointers, replacing the previous CONSUMED-dependent logic. This change, along with decoupling scope_end from ring advancement, directly addresses and resolves a critical task ring deadlock issue that occurred with small task windows. Additionally, a minor but important fix to the orchestrator's task state checking enhances the overall robustness of the system. These updates collectively aim to improve concurrency, stability, and efficiency in task execution.

Highlights

Scheduler API Migration: The AICPU executor has been migrated from using local state arrays and custom ready queues to a centralized PTO2SchedulerState API, streamlining task management through on_task_complete, get_ready_task, and init_task functions.
CAS-based Ring Advancement: The previous CONSUMED-dependent advance_ring_pointers mechanism has been replaced with a lock-free, Compare-and-Swap (CAS) based direct advancement on header->last_task_alive, which aligns with the pre-migration concurrency model and improves efficiency.
Deadlock Resolution: A task ring deadlock, previously occurring with small window sizes (e.g., 16 slots) due to scope_end dependencies creating circular waits, has been fixed by decoupling scope_end from ring advancement and handling task consumption directly upon completion.
Robust Task State Check: The orchestrator's early-return path now uses prod_state >= COMPLETED instead of prod_state == COMPLETED, making task state checking more robust against various completion states.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
- Removed local ready queue structures and related constants.
- Eliminated the SpinLock struct and its usage for local ready queues.
- Removed the enqueue_ready_task_with_profiling helper function.
- Removed sm_header_ready_, orch_pointers_ready_, and orchestrator ready queue pointers from the AicpuExecutor struct.
- Added a runtime_init_ready_ atomic flag to synchronize runtime initialization.
- Removed the initialization of local ready queues in the init() method.
- Removed the wait for sm_header_ready_ in resolve_and_dispatch_pto2().
- Removed initializations for s_pto2_fanin_refcount, s_pto2_task_completed, and s_pto2_completed_by_task.
- Removed the wait for orch_pointers_ready_.
- Updated profiling counters and removed detailed phase profiling.
- Integrated pto2_scheduler_on_task_complete for task completion handling.
- Replaced the old fanout traversal and last_task_alive advancement logic with calls to the new scheduler API.
- Removed next_scan_index_ and the associated task scanning logic.
- Removed the orch_ready_queue_ draining mechanism.
- Updated the stall diagnosis logic to utilize the PTO2SchedulerState.
- Revised the profiling output format for clarity.
- Initialized the global rt pointer to nullptr in the orchestrator thread.
- Added an argument count validation check during orchestrator setup.
- Removed a redundant argument count check.
- Removed the signal for sm_header_ready_.
- Removed the orchestrator's direct access to aicpu_fanin_refcount, aicpu_task_completed, aicpu_completed_by_task, and orch_ready_queue.
- Implemented runtime_init_ready_ signaling and waiting for proper synchronization.
- Removed local ready queue cleanup in the deinit() method.
- Removed resets for next_scan_index_, sm_header_ready_, and orch_pointers_ready_ in deinit().
- Updated diagnose_stuck_state to query ready queue counts from the scheduler.
src/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
- Removed specific comments regarding RELAXED memory order in pto2_add_consumer_to_producer.
- Removed the pto2_alloc_packed_buffer function.
- Refactored pto2_submit_task to remove total_output_size initialization and related logic from its first pass.
- Moved the output buffer allocation and address assignment earlier within pto2_submit_task.
- Removed the direct call to pto2_add_consumer_to_producer from pto2_submit_task when a scheduler is active.
- Modified pto2_submit_task to directly manage fanin/fanout lists and integrate with PTO2SchedulerState when a scheduler is present.
- Removed the STEP 5b logic for early-ready task draining.
src/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h
- Removed the ORCH_READY_QUEUE and associated members from the PTO2OrchestratorState struct.
- Removed the declaration of pto2_add_consumer_to_producer.
src/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.cpp
- Initialized sched->last_heap_consumed to zero.
- Reset sched->last_heap_consumed to zero during scheduler reset.
- Removed the check_and_handle_consumed static helper function.
- Refactored pto2_scheduler_on_task_complete to directly manage the CONSUMED state transition and implement CAS-based ring advancement.
- Modified pto2_scheduler_on_scope_end to become a no-operation.
- Removed the pto2_scheduler_release_producer function.
- Modified pto2_scheduler_advance_ring_pointers to become a no-operation.
- Modified pto2_scheduler_sync_to_sm to become a no-operation.
- Updated pto2_scheduler_is_done to directly read last_task_alive from shared memory.
src/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h
- Added last_heap_consumed to the PTO2SchedulerState struct.
- Removed the declaration of pto2_scheduler_release_producer.

Activity

ChaoWao opened this pull request.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the PTO2 task scheduling mechanism by migrating task completion, ready queue management, and ring buffer advancement logic from aicpu_executor.cpp to a new centralized scheduler API (pto_scheduler.cpp). Key changes include removing local sharded ready queues, spinlocks, and associated profiling counters from AicpuExecutor, replacing them with calls to pto2_scheduler_get_ready_task and pto2_scheduler_on_task_complete. The orchestrator's pto2_submit_task function is updated to interact directly with the new scheduler for fanin/fanout management and task state updates, eliminating the need for an orchestrator-specific ready queue and associated synchronization flags. Profiling output is simplified, and AicpuExecutor now uses a global PTO2Runtime pointer for scheduler interaction, with a new runtime_init_ready_ flag for synchronization during initialization. Review comments highlight the removal of redundant code blocks and variables that are no longer necessary due to the adoption of the new scheduler API.

I am having trouble creating individual review comments. Click here to see my feedback.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (657-790)

This entire block of code, which handles task completion, fanout traversal, and TaskRing advancement, can be removed as it's now managed by the scheduler API.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (970-1008)

This entire block of code, which handles early-ready drain, can be removed as it's no longer needed with the scheduler API.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (817-889)

This entire block of code, which handles task dispatch from local ready queues, can be removed as it's now managed by the scheduler API.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1081)

Consider removing tasks_per_loop since it's not being used after this change.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (637)

This line is no longer needed, as the task completion logic is now handled by the scheduler API.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (573)

Consider removing ready_pop_own and ready_pop_steal since they are not being used after this change.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (570-573)

Consider removing sched_complete_ready_wait and related variables since they are not being used after this change.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1079)

Consider removing sched_early_ready_cycle since it's not being used after this change.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (558)

Consider removing sched_early_ready_cycle since it's not being used after this change.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1093-1094)

Consider removing early_ready related logs since it's not being used after this change.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1100-1102)

Consider updating the dispatch log message to reflect the new task dispatch mechanism.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1105-1109)

Consider removing the lock contention logs since the local ready queues are removed.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1110-1127)

Consider removing the lock contention logs since the local ready queues are removed.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1445-1450)

This block of code, which cleans up runtime execution state, can be removed as it's no longer needed with the scheduler API.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1465-1471)

This block of code, which resets orchestrator ready queue pointers, can be removed as it's no longer needed with the scheduler API.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1507-1510)

This entire block of code, which calculates ready queue sizes from local state, can be removed as it's now managed by the scheduler API.

…cement Migrate the AICPU executor from local state arrays (sharded ready queues, fanin refcounts, task completion tracking) to the centralized scheduler API (pto2_scheduler_on_task_complete, pto2_scheduler_get_ready_task, init_task). Ring advancement uses CAS-based lock-free writes to header->last_task_alive with ticket-based heap_tail serialization, matching the pre-migration executor's concurrency model and avoiding the scope_end dependency that would deadlock with small task windows. - Replace per-thread state arrays with PTO2SchedulerState member calls - Thread 3 (orchestrator) creates PTO2Runtime; threads 0-2 wait on init flag - Tasks transition directly to CONSUMED in on_task_complete after fanout notifications, enabling immediate slot reuse without scope_end - Fix prod_state == COMPLETED to >= COMPLETED in orchestrator early-return - Simplify orchestrator by removing non-scheduler codepath for AICPU mode

…cement (hw-native-sys#160) Migrate the AICPU executor from local state arrays (sharded ready queues, fanin refcounts, task completion tracking) to the centralized scheduler API (pto2_scheduler_on_task_complete, pto2_scheduler_get_ready_task, init_task). Ring advancement uses CAS-based lock-free writes to header->last_task_alive with ticket-based heap_tail serialization, matching the pre-migration executor's concurrency model and avoiding the scope_end dependency that would deadlock with small task windows. - Replace per-thread state arrays with PTO2SchedulerState member calls - Thread 3 (orchestrator) creates PTO2Runtime; threads 0-2 wait on init flag - Tasks transition directly to CONSUMED in on_task_complete after fanout notifications, enabling immediate slot reuse without scope_end - Fix prod_state == COMPLETED to >= COMPLETED in orchestrator early-return - Simplify orchestrator by removing non-scheduler codepath for AICPU mode

gemini-code-assist Bot reviewed Mar 2, 2026

View reviewed changes

ChaoWao force-pushed the fix/executor-scheduler-migration branch from 56891b1 to d2d5fd0 Compare March 3, 2026 03:24

poursoul approved these changes Mar 3, 2026

View reviewed changes

jvjhfhg approved these changes Mar 3, 2026

View reviewed changes

jvjhfhg merged commit 268c337 into main Mar 3, 2026
3 checks passed

jvjhfhg deleted the fix/executor-scheduler-migration branch March 3, 2026 06:37

This was referenced Mar 3, 2026

Refactor: migrate scheduling to scheduler API, C++ member API, inline hot paths #155

Closed

Support: upgrade profiling pipeline with TensorMap instrumentation #167

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: migrate executor to scheduler API with CAS-based ring advancement#160

Refactor: migrate executor to scheduler API with CAS-based ring advancement#160
jvjhfhg merged 1 commit into
mainfrom
fix/executor-scheduler-migration

ChaoWao commented Mar 2, 2026

Uh oh!

gemini-code-assist Bot commented Mar 2, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ChaoWao commented Mar 2, 2026

Summary

Testing

Uh oh!

gemini-code-assist Bot commented Mar 2, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (657-790)

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (970-1008)

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (817-889)

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1081)

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (637)

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (573)

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (570-573)

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1079)

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (558)

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1093-1094)

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1100-1102)

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1105-1109)

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1110-1127)

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1445-1450)

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1465-1471)

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1507-1510)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants