Skip to content

Fix: separate onboard run context cleanup from final device teardown#532

Open
ChaoZheng109 wants to merge 1 commit intohw-native-sys:mainfrom
ChaoZheng109:fix/onboard-teardown-order
Open

Fix: separate onboard run context cleanup from final device teardown#532
ChaoZheng109 wants to merge 1 commit intohw-native-sys:mainfrom
ChaoZheng109:fix/onboard-teardown-order

Conversation

@ChaoZheng109
Copy link
Copy Markdown
Collaborator

@ChaoZheng109 ChaoZheng109 commented Apr 13, 2026

Summary

  • Remove resource cleanup from run_runtime() lambda — let finalize_device() handle it via DeviceRunner::finalize()
  • Add rtSetDevice() at finalize() entry so rtFree succeeds even when called from a thread without an existing device context
  • Move rtDeviceReset() from reset_device_context() into finalize(), after mem_alloc_.finalize(), to guarantee all device memory is freed before the context is torn down
  • Drop MemoryAllocator::finalized_ flag — an empty ptr_set_ is already a natural no-op, so repeated and interleaved finalize() calls are idempotent without an explicit guard

Root Cause

The original teardown sequence was:

  1. run_runtime() lambda → reset_device_context() → destroys streams + rtDeviceReset
  2. Outer finalize_device()DeviceRunner::finalize()mem_alloc_.finalize()rtFree

Step 2's rtFree calls happened after rtDeviceReset in step 1, causing error 507899 (invalid device context).

The fix centralizes all resource cleanup in finalize(), called by finalize_device():

finalize_device() (main thread):
  → finalize()
    rtSetDevice(device_id_)    ← establish context
    streams + rtFree           ← context valid
    rtDeviceReset(device_id_)  ← last step

Testing

  • All 19 a2a3sim examples pass

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the device cleanup logic by moving the rtDeviceReset call from reset_device_context to the finalize method in DeviceRunner. It also updates the MemoryAllocator to reset its finalized_ state upon new allocations and modifies the runtime API to call finalize instead of reset_device_context. Feedback was provided to guard the rtDeviceReset call with a check for a valid device_id_ to ensure the finalize method is idempotent and avoids passing an invalid ID to the runtime.

Comment thread src/a2a3/platform/onboard/host/device_runner.cpp Outdated
Comment thread src/a5/platform/onboard/host/device_runner.cpp Outdated
@ChaoZheng109 ChaoZheng109 force-pushed the fix/onboard-teardown-order branch 7 times, most recently from 2655f49 to 8220425 Compare April 15, 2026 08:34
  - split onboard device lifecycle into thread attach, run-context setup,
    run-context release, and final device teardown
  - keep run-scoped streams available through init_runtime_impl so kernel
    binary upload succeeds before execution starts
  - release run-scoped streams after each run without calling
    rtDeviceReset, and defer the final device reset to finalize()
  - preserve the guarantee that all tracked device memory is freed before
    rtDeviceReset runs
  - align onboard API comments and error messages with the new lifecycle
@ChaoZheng109 ChaoZheng109 force-pushed the fix/onboard-teardown-order branch from 8220425 to 53f3f96 Compare April 15, 2026 09:00
@ChaoZheng109 ChaoZheng109 changed the title Fix: finalize onboard device resources before device reset Fix: separate onboard run context cleanup from final device teardown Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant