Skip to content

Refactor: migrate executor to scheduler API with CAS-based ring advancement#160

Merged
jvjhfhg merged 1 commit into
mainfrom
fix/executor-scheduler-migration
Mar 3, 2026
Merged

Refactor: migrate executor to scheduler API with CAS-based ring advancement#160
jvjhfhg merged 1 commit into
mainfrom
fix/executor-scheduler-migration

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented Mar 2, 2026

Summary

  • Migrate AICPU executor from local state arrays to centralized PTO2SchedulerState API (on_task_complete, get_ready_task, init_task)
  • Replace CONSUMED-dependent advance_ring_pointers with CAS-based lock-free direct advancement on header->last_task_alive, matching the pre-migration concurrency model
  • Fix task ring deadlock with small windows (16 slots) where scope_end dependency created a circular wait: orchestrator blocked on full ring → tasks can't reach CONSUMED → scope_end can't run
  • Fix prod_state == COMPLETED to >= COMPLETED in orchestrator early-return path

Testing

  • Simulation tests pass (4/4 tensormap_and_ringbuffer)
  • Hardware device tests pass (7/7 tensormap_and_ringbuffer, including paged_attention_taskring with 16-slot window that was previously deadlocking)
  • Python unit tests pass (21/21)

Fixes #159

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant refactoring of the AICPU executor's task scheduling and ring buffer management. The core changes involve migrating the executor to a new, centralized PTO2SchedulerState API, which simplifies task state handling and ready queue management. A key improvement is the adoption of a lock-free, CAS-based mechanism for advancing ring pointers, replacing the previous CONSUMED-dependent logic. This change, along with decoupling scope_end from ring advancement, directly addresses and resolves a critical task ring deadlock issue that occurred with small task windows. Additionally, a minor but important fix to the orchestrator's task state checking enhances the overall robustness of the system. These updates collectively aim to improve concurrency, stability, and efficiency in task execution.

Highlights

  • Scheduler API Migration: The AICPU executor has been migrated from using local state arrays and custom ready queues to a centralized PTO2SchedulerState API, streamlining task management through on_task_complete, get_ready_task, and init_task functions.
  • CAS-based Ring Advancement: The previous CONSUMED-dependent advance_ring_pointers mechanism has been replaced with a lock-free, Compare-and-Swap (CAS) based direct advancement on header->last_task_alive, which aligns with the pre-migration concurrency model and improves efficiency.
  • Deadlock Resolution: A task ring deadlock, previously occurring with small window sizes (e.g., 16 slots) due to scope_end dependencies creating circular waits, has been fixed by decoupling scope_end from ring advancement and handling task consumption directly upon completion.
  • Robust Task State Check: The orchestrator's early-return path now uses prod_state >= COMPLETED instead of prod_state == COMPLETED, making task state checking more robust against various completion states.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
    • Removed local ready queue structures and related constants.
    • Eliminated the SpinLock struct and its usage for local ready queues.
    • Removed the enqueue_ready_task_with_profiling helper function.
    • Removed sm_header_ready_, orch_pointers_ready_, and orchestrator ready queue pointers from the AicpuExecutor struct.
    • Added a runtime_init_ready_ atomic flag to synchronize runtime initialization.
    • Removed the initialization of local ready queues in the init() method.
    • Removed the wait for sm_header_ready_ in resolve_and_dispatch_pto2().
    • Removed initializations for s_pto2_fanin_refcount, s_pto2_task_completed, and s_pto2_completed_by_task.
    • Removed the wait for orch_pointers_ready_.
    • Updated profiling counters and removed detailed phase profiling.
    • Integrated pto2_scheduler_on_task_complete for task completion handling.
    • Replaced the old fanout traversal and last_task_alive advancement logic with calls to the new scheduler API.
    • Removed next_scan_index_ and the associated task scanning logic.
    • Removed the orch_ready_queue_ draining mechanism.
    • Updated the stall diagnosis logic to utilize the PTO2SchedulerState.
    • Revised the profiling output format for clarity.
    • Initialized the global rt pointer to nullptr in the orchestrator thread.
    • Added an argument count validation check during orchestrator setup.
    • Removed a redundant argument count check.
    • Removed the signal for sm_header_ready_.
    • Removed the orchestrator's direct access to aicpu_fanin_refcount, aicpu_task_completed, aicpu_completed_by_task, and orch_ready_queue.
    • Implemented runtime_init_ready_ signaling and waiting for proper synchronization.
    • Removed local ready queue cleanup in the deinit() method.
    • Removed resets for next_scan_index_, sm_header_ready_, and orch_pointers_ready_ in deinit().
    • Updated diagnose_stuck_state to query ready queue counts from the scheduler.
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
    • Removed specific comments regarding RELAXED memory order in pto2_add_consumer_to_producer.
    • Removed the pto2_alloc_packed_buffer function.
    • Refactored pto2_submit_task to remove total_output_size initialization and related logic from its first pass.
    • Moved the output buffer allocation and address assignment earlier within pto2_submit_task.
    • Removed the direct call to pto2_add_consumer_to_producer from pto2_submit_task when a scheduler is active.
    • Modified pto2_submit_task to directly manage fanin/fanout lists and integrate with PTO2SchedulerState when a scheduler is present.
    • Removed the STEP 5b logic for early-ready task draining.
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h
    • Removed the ORCH_READY_QUEUE and associated members from the PTO2OrchestratorState struct.
    • Removed the declaration of pto2_add_consumer_to_producer.
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.cpp
    • Initialized sched->last_heap_consumed to zero.
    • Reset sched->last_heap_consumed to zero during scheduler reset.
    • Removed the check_and_handle_consumed static helper function.
    • Refactored pto2_scheduler_on_task_complete to directly manage the CONSUMED state transition and implement CAS-based ring advancement.
    • Modified pto2_scheduler_on_scope_end to become a no-operation.
    • Removed the pto2_scheduler_release_producer function.
    • Modified pto2_scheduler_advance_ring_pointers to become a no-operation.
    • Modified pto2_scheduler_sync_to_sm to become a no-operation.
    • Updated pto2_scheduler_is_done to directly read last_task_alive from shared memory.
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h
    • Added last_heap_consumed to the PTO2SchedulerState struct.
    • Removed the declaration of pto2_scheduler_release_producer.
Activity
  • ChaoWao opened this pull request.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the PTO2 task scheduling mechanism by migrating task completion, ready queue management, and ring buffer advancement logic from aicpu_executor.cpp to a new centralized scheduler API (pto_scheduler.cpp). Key changes include removing local sharded ready queues, spinlocks, and associated profiling counters from AicpuExecutor, replacing them with calls to pto2_scheduler_get_ready_task and pto2_scheduler_on_task_complete. The orchestrator's pto2_submit_task function is updated to interact directly with the new scheduler for fanin/fanout management and task state updates, eliminating the need for an orchestrator-specific ready queue and associated synchronization flags. Profiling output is simplified, and AicpuExecutor now uses a global PTO2Runtime pointer for scheduler interaction, with a new runtime_init_ready_ flag for synchronization during initialization. Review comments highlight the removal of redundant code blocks and variables that are no longer necessary due to the adoption of the new scheduler API.

I am having trouble creating individual review comments. Click here to see my feedback.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (657-790)

high

This entire block of code, which handles task completion, fanout traversal, and TaskRing advancement, can be removed as it's now managed by the scheduler API.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (970-1008)

high

This entire block of code, which handles early-ready drain, can be removed as it's no longer needed with the scheduler API.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (817-889)

high

This entire block of code, which handles task dispatch from local ready queues, can be removed as it's now managed by the scheduler API.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1081)

medium

Consider removing tasks_per_loop since it's not being used after this change.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (637)

medium

This line is no longer needed, as the task completion logic is now handled by the scheduler API.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (573)

medium

Consider removing ready_pop_own and ready_pop_steal since they are not being used after this change.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (570-573)

medium

Consider removing sched_complete_ready_wait and related variables since they are not being used after this change.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1079)

medium

Consider removing sched_early_ready_cycle since it's not being used after this change.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (558)

medium

Consider removing sched_early_ready_cycle since it's not being used after this change.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1093-1094)

medium

Consider removing early_ready related logs since it's not being used after this change.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1100-1102)

medium

Consider updating the dispatch log message to reflect the new task dispatch mechanism.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1105-1109)

medium

Consider removing the lock contention logs since the local ready queues are removed.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1110-1127)

medium

Consider removing the lock contention logs since the local ready queues are removed.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1445-1450)

medium

This block of code, which cleans up runtime execution state, can be removed as it's no longer needed with the scheduler API.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1465-1471)

medium

This block of code, which resets orchestrator ready queue pointers, can be removed as it's no longer needed with the scheduler API.

src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (1507-1510)

medium

This entire block of code, which calculates ready queue sizes from local state, can be removed as it's now managed by the scheduler API.

…cement

Migrate the AICPU executor from local state arrays (sharded ready queues,
fanin refcounts, task completion tracking) to the centralized scheduler API
(pto2_scheduler_on_task_complete, pto2_scheduler_get_ready_task, init_task).

Ring advancement uses CAS-based lock-free writes to header->last_task_alive
with ticket-based heap_tail serialization, matching the pre-migration
executor's concurrency model and avoiding the scope_end dependency that
would deadlock with small task windows.

- Replace per-thread state arrays with PTO2SchedulerState member calls
- Thread 3 (orchestrator) creates PTO2Runtime; threads 0-2 wait on init flag
- Tasks transition directly to CONSUMED in on_task_complete after fanout
  notifications, enabling immediate slot reuse without scope_end
- Fix prod_state == COMPLETED to >= COMPLETED in orchestrator early-return
- Simplify orchestrator by removing non-scheduler codepath for AICPU mode
@ChaoWao ChaoWao force-pushed the fix/executor-scheduler-migration branch from 56891b1 to d2d5fd0 Compare March 3, 2026 03:24
@jvjhfhg jvjhfhg merged commit 268c337 into main Mar 3, 2026
3 checks passed
@jvjhfhg jvjhfhg deleted the fix/executor-scheduler-migration branch March 3, 2026 06:37
PKUZHOU pushed a commit to PKUZHOU/simpler that referenced this pull request Mar 31, 2026
…cement (hw-native-sys#160)

Migrate the AICPU executor from local state arrays (sharded ready queues,
fanin refcounts, task completion tracking) to the centralized scheduler API
(pto2_scheduler_on_task_complete, pto2_scheduler_get_ready_task, init_task).

Ring advancement uses CAS-based lock-free writes to header->last_task_alive
with ticket-based heap_tail serialization, matching the pre-migration
executor's concurrency model and avoiding the scope_end dependency that
would deadlock with small task windows.

- Replace per-thread state arrays with PTO2SchedulerState member calls
- Thread 3 (orchestrator) creates PTO2Runtime; threads 0-2 wait on init flag
- Tasks transition directly to CONSUMED in on_task_complete after fanout
  notifications, enabling immediate slot reuse without scope_end
- Fix prod_state == COMPLETED to >= COMPLETED in orchestrator early-return
- Simplify orchestrator by removing non-scheduler codepath for AICPU mode
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants