Skip to content

Update: consolidate PTO2 runtime refactor and scheduler fixes#153

Closed
jvjhfhg wants to merge 2 commits into
hw-native-sys:mainfrom
jvjhfhg:rebase-main-refactor-aicpu-runtime
Closed

Update: consolidate PTO2 runtime refactor and scheduler fixes#153
jvjhfhg wants to merge 2 commits into
hw-native-sys:mainfrom
jvjhfhg:rebase-main-refactor-aicpu-runtime

Conversation

@jvjhfhg
Copy link
Copy Markdown
Collaborator

@jvjhfhg jvjhfhg commented Mar 2, 2026

  • Inline TensorMap, orchestrator, and ring buffer hot paths while moving runtime data-flow operations to C++ member APIs.
  • Unify task lifecycle handling in scheduler APIs to remove executor deadlock paths and keep fan-in signaling on relaxed atomics.
  • Restore PTO2 scheduler profiling hooks after rebase so runtime tracing and diagnostics remain intact.

Co-authored-by: poursoul poursoul@126.com

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a substantial refactoring of the PTO2 runtime and scheduler components, primarily focusing on enhancing performance, robustness, and code structure. Key changes include inlining critical data paths, migrating global functions to class member APIs, and unifying task lifecycle management within the scheduler to prevent deadlocks. Additionally, profiling capabilities were reinstated, and analysis tools were improved to provide more accurate and readable performance metrics.

Highlights

  • Runtime Refactoring: TensorMap, orchestrator, and ring buffer hot paths have been inlined, and runtime data-flow operations were moved to C++ member APIs for improved encapsulation and performance.
  • Scheduler Fixes: Task lifecycle handling in scheduler APIs was unified to eliminate executor deadlock paths and ensure fan-in signaling uses relaxed atomics for efficiency.
  • Profiling Restoration: PTO2 scheduler profiling hooks were restored after a rebase, ensuring runtime tracing and diagnostics remain intact.
  • Code Modernization: Significant portions of the codebase were refactored to use C++ member functions and remove global helper functions, improving code organization and maintainability.
  • Performance Tooling Enhancement: Python analysis scripts for scheduler overhead and swimlane conversion were updated to handle floating-point time values and format small values more precisely.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • AGENTS.md
    • Updated AI instruction rules to clarify the role of CLAUDE.md as the source of truth and introduce new directories for conventions, skills, and commands.
  • src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
    • Removed spin_hint.h include and SpinLock struct, consolidating locking mechanisms.
    • Eliminated AICPU_MAX_READY_TASKS, AICPU_READY_MASK, and shard-based ready queue structures from AicpuExecutor.
    • Removed enqueue_ready_task_with_profiling helper function and its declaration.
    • Replaced AICPU_TASK_INVALID with -1 for executing_task_ids_ initialization.
    • Removed orchestrator-specific shared memory flags (sm_header_ready_, orch_pointers_ready_) and ready queue pointers.
    • Simplified task initialization by removing s_pto2_fanin_refcount, s_pto2_task_completed, and s_pto2_completed_by_task arrays.
    • Integrated new pto2_scheduler_on_task_complete and pto2_scheduler_get_ready_task API calls, including profiling parameters.
    • Added runtime_init_ready_ atomic flag to synchronize runtime initialization.
    • Updated profiling metrics and logging format for scheduler phases and lock contention.
    • Removed logic for advancing last_task_alive and heap_tail directly within AicpuExecutor, delegating to the scheduler.
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
    • Removed global pto2_add_consumer_to_producer and pto2_alloc_packed_buffer functions.
    • Converted pto2_tensormap_init_default, pto2_tensormap_destroy, pto2_tensormap_reset, pto2_tensormap_lookup, pto2_tensormap_insert, pto2_tensormap_remove_entry, pto2_tensormap_valid_count, pto2_orchestrator_sync_tensormap to member methods of PTO2TensorMap.
    • Updated task submission logic to use orch->tensor_map.lookup and orch->tensor_map.insert.
    • Modified task submission to use orch->task_ring.pto2_task_ring_alloc() and orch->pto2_alloc_packed_buffer() member methods.
    • Simplified fanin list finalization by integrating dependency addition directly with the scheduler API.
    • Replaced LOG_INFO with fprintf(stdout, ...) for orchestrator output.
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h
    • Moved pto2_alloc_packed_buffer into PTO2OrchestratorState as a member method.
    • Removed AICPU parallel mode specific members (aicpu_fanin_refcount, aicpu_task_completed, aicpu_completed_by_task, aicpu_window_mask) from PTO2OrchestratorState.
    • Removed orchestrator ready queue members (orch_ready_queue, orch_ready_tail, orch_ready_head, ORCH_READY_QUEUE_SIZE) from PTO2OrchestratorState.
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.cpp
    • Removed global pto2_heap_ring_alloc, pto2_heap_ring_try_alloc, pto2_heap_ring_available functions.
    • Removed global pto2_task_ring_alloc, pto2_task_ring_try_alloc functions.
    • Removed PTO2_SPIN_VERBOSE_LOGGING and related LOG_INFO/LOG_WARN/LOG_ERROR messages, replacing with fprintf(stderr, ...).
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h
    • Moved pto2_heap_ring_alloc, pto2_heap_ring_try_alloc, pto2_heap_ring_available into PTO2HeapRing as member methods.
    • Moved pto2_task_ring_alloc, pto2_task_ring_try_alloc into PTO2TaskRing as member methods.
    • Defined PTO2_SPIN_VERBOSE_LOGGING, PTO2_BLOCK_NOTIFY_INTERVAL, PTO2_HEAP_SPIN_LIMIT, PTO2_FLOW_CONTROL_SPIN_LIMIT macros.
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h
    • Removed output_index and num_outputs members from PTO2TaskDescriptor.
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.cpp
    • Added get_sys_cnt_aicpu weak fallback for non-AICPU builds.
    • Added spinlock member to PTO2ReadyQueue and implemented thread-safe push/pop operations with optional profiling.
    • Removed global pto2_scheduler_init_task and pto2_scheduler_check_ready functions.
    • Updated pto2_scheduler_on_task_complete to use the new scheduler API and include profiling parameters for ready queue operations.
    • Modified check_and_handle_consumed and pto2_scheduler_release_producer to use __ATOMIC_ACQ_REL memory order for atomic operations.
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h
    • Added spinlock to PTO2ReadyQueue struct.
    • Updated pto2_ready_queue_push and pto2_ready_queue_pop function signatures to include optional profiling parameters.
    • Converted PTO2SchedulerState to a struct with member methods: pto2_task_slot, release_fanin_and_check_ready, and init_task.
    • Removed global pto2_scheduler_init_task and pto2_scheduler_check_ready function declarations.
    • Updated pto2_scheduler_get_ready_task and pto2_scheduler_on_task_complete function signatures to include optional profiling parameters.
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_tensormap.cpp
    • Converted pto2_tensormap_init, pto2_tensormap_init_default, pto2_tensormap_destroy, pto2_tensormap_reset, pto2_tensormap_lookup, pto2_tensormap_insert, pto2_tensormap_cleanup_retired, pto2_tensormap_print_stats, pto2_tensormap_valid_count, pto2_orchestrator_sync_tensormap to member methods of PTO2TensorMap.
    • Removed global profiling variables for TensorMap lookups and inserts.
    • Updated linked list manipulation within TensorMap to use pointers (PTO2TensorMapEntry*) instead of integer offsets (int32_t).
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_tensormap.h
    • Changed #ifndef/#define guards to #pragma once.
    • Modified PTO2TensorMapEntry to use pointers (PTO2TensorMapEntry*) for next_in_bucket, prev_in_bucket, next_in_task, prev_in_task instead of int32_t offsets.
    • Defined PTO2LookupResult struct for stack-allocated lookup results.
    • Converted all pto2_tensormap_ and pto2_orchestrator_sync_tensormap functions into member methods of the PTO2TensorMap struct.
    • Removed PTO2TensorMapProfilingData struct.
  • src/runtime/tensormap_and_ringbuffer/runtime/tensor_pool.h
    • Added ref_count = 1 initialization to TensorData constructors.
    • Modified is_overlap method in TensorData to assert is_same_memref and simplify overlap detection logic.
    • Removed redundant data[index].ref_count = 1 assignment from TensorPool::new_tensor_data.
  • tools/sched_overhead_analysis.py
    • Added format_us_value helper function to format microsecond values, especially small non-zero ones.
    • Updated regular expressions to correctly parse floating-point microsecond values from log output.
    • Applied format_us_value to various output metrics for consistent and precise display.
  • tools/swimlane_converter.py
    • Added format_us_value helper function to format microsecond values, including a force_sub_threshold option for zero values.
    • Applied format_us_value to task statistics output for better readability of small time durations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and valuable refactoring of the PTO2 runtime, particularly in the scheduler and memory management components. The changes move C-style free functions to C++ member functions, encapsulating logic for the TensorMap, ring buffers, and scheduler state. This greatly improves code organization, maintainability, and adherence to C++ best practices. Key improvements include unifying task lifecycle handling into a new scheduler API, which resolves potential deadlocks, and restoring profiling hooks for better diagnostics. My review includes a few suggestions for improving clarity and maintainability in the new concurrent logic, but overall, this is an excellent set of changes that modernizes the codebase and makes it more robust.

Comment thread src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
Comment thread src/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp Outdated
Comment thread src/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.cpp
@jvjhfhg jvjhfhg force-pushed the rebase-main-refactor-aicpu-runtime branch from e7e88f2 to 0966e0c Compare March 2, 2026 04:04
@jvjhfhg jvjhfhg marked this pull request as draft March 2, 2026 04:04
@jvjhfhg jvjhfhg force-pushed the rebase-main-refactor-aicpu-runtime branch from 0966e0c to a391550 Compare March 2, 2026 04:04
@jvjhfhg jvjhfhg marked this pull request as ready for review March 2, 2026 04:05
Comment thread src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp Outdated
Comment thread src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp Outdated
Comment thread src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp Outdated
Comment thread src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp Outdated
Comment thread src/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp Outdated
Comment thread src/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h Outdated
@jvjhfhg jvjhfhg force-pushed the rebase-main-refactor-aicpu-runtime branch from a391550 to 05bc82a Compare March 2, 2026 04:25
Comment thread tools/sched_overhead_analysis.py Outdated
@jvjhfhg jvjhfhg marked this pull request as draft March 2, 2026 06:45
@jvjhfhg jvjhfhg force-pushed the rebase-main-refactor-aicpu-runtime branch from 05bc82a to 79ab2f0 Compare March 2, 2026 06:45
@jvjhfhg jvjhfhg marked this pull request as ready for review March 2, 2026 06:46
jvjhfhg and others added 2 commits March 2, 2026 17:41
- Inline TensorMap, orchestrator, and ring buffer hot paths while moving runtime data-flow operations to C++ member APIs.
- Unify task lifecycle handling in scheduler APIs to remove executor deadlock paths and keep fan-in signaling on relaxed atomics.
- Restore PTO2 scheduler profiling hooks after rebase so runtime tracing and diagnostics remain intact.

Co-authored-by: poursoul <poursoul@126.com>
- Port AICPU ring-buffer slot state wiring into the current runtime so
  orchestrator and scheduler share completion/fanin data in device mode.
- Reset per-slot scheduler/AICPU state on submit and validate task_id on
  dispatch/fanout paths to ignore stale entries after slot reuse.
- Keep producer/consumer lifecycle balanced when links are added after a
  producer already completed, preventing premature reclamation.
- Relax hot-path atomics from seq_cst to acquire/release and use plain
  fanin_count reads where the descriptor field is immutable.
@jvjhfhg jvjhfhg force-pushed the rebase-main-refactor-aicpu-runtime branch from 79ab2f0 to b3b9073 Compare March 2, 2026 09:42
@jvjhfhg jvjhfhg marked this pull request as draft March 2, 2026 12:10
@jvjhfhg jvjhfhg closed this Mar 3, 2026
@jvjhfhg jvjhfhg deleted the rebase-main-refactor-aicpu-runtime branch March 3, 2026 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants