Skip to content

[WebGPU] Fix stale buffer bindings on first graph-capture replay#28325

Draft
hariharans29 wants to merge 3 commits into
mainfrom
hari/webgpu_graph_capture_buffer_fix
Draft

[WebGPU] Fix stale buffer bindings on first graph-capture replay#28325
hariharans29 wants to merge 3 commits into
mainfrom
hari/webgpu_graph_capture_buffer_fix

Conversation

@hariharans29
Copy link
Copy Markdown
Member

Lazily create the graph-mode BufferManager and gate routing through it on a graph_buffer_mgr_active_ flag set only between OnRunStart/OnRunEnd. Refactor GpuBufferAllocator to take a std::function<const BufferManager&()> getter so the allocator can resolve the active buffer manager dynamically.

Pre-existing latent bug: when graph capture is enabled, captured commands held buffer pointers from the capture run. If the allocator freed and reused a slot before the first Replay(), the dispatch read stale memory and produced one wrong logits tensor. Mostly hidden in fixed-shape decode because the allocator typically re-handed out the same addresses; exposed by MatMulNBits MLP/QKV fusions which churn the allocation pattern (#28280)

Also adds WebGpuContext::WaitForQueueIdle() and ComputeContextBase::FlushAndWait() helpers used by the buffer-mgr lifetime management.

Motivation and Context

Fix correctness on first replay

Lazily create the graph-mode BufferManager and gate routing through it on a graph_buffer_mgr_active_ flag set only between OnRunStart/OnRunEnd. Refactor GpuBufferAllocator to take a std::function<const BufferManager&()> getter so the allocator can resolve the active buffer manager dynamically.

Pre-existing latent bug: when graph capture is enabled, captured commands held buffer pointers from the capture run. If the allocator freed and reused a slot before the first Replay(), the dispatch read stale memory and produced one wrong logits tensor. Mostly hidden in fixed-shape decode because the allocator typically re-handed out the same addresses; exposed by upcoming MatMulNBits MLP/QKV fusions which churn the allocation pattern.

Also adds WebGpuContext::WaitForQueueIdle() and ComputeContextBase::FlushAndWait() helpers used by the buffer-mgr lifetime management.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a WebGPU graph-capture correctness issue where the first replay could read stale buffer bindings. It does this by routing allocations through a dedicated “graph-mode” BufferManager only during the capture window, and by letting the allocator resolve the currently-active buffer manager dynamically.

Changes:

  • Lazily create the graph-mode BufferManager and gate WebGpuExecutionProvider::BufferManager() via a new graph_buffer_mgr_active_ flag (active only between OnRunStart/OnRunEnd).
  • Refactor GpuBufferAllocator to optionally take a std::function<const BufferManager&()> getter, enabling dynamic buffer-manager selection.
  • Add WebGpuContext::WaitForQueueIdle() (implementation) and a ComputeContextBase::FlushAndWait() convenience helper.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
onnxruntime/core/providers/webgpu/webgpu_execution_provider.h Adds graph_buffer_mgr_active_ flag to control graph buffer manager routing.
onnxruntime/core/providers/webgpu/webgpu_execution_provider.cc Lazily instantiates graph buffer manager, toggles active flag during capture, and updates allocator creation to use a getter.
onnxruntime/core/providers/webgpu/webgpu_context.cc Implements WebGpuContext::WaitForQueueIdle().
onnxruntime/core/providers/webgpu/compute_context.h Adds FlushAndWait() helper and expands BufferManagerAccessor friendship.
onnxruntime/core/providers/webgpu/allocator.h Adds <functional> and a new allocator constructor taking a buffer-manager getter; stores the getter.
onnxruntime/core/providers/webgpu/allocator.cc Implements new constructor and switches alloc/free to use the getter.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/providers/webgpu/compute_context.h
Comment thread onnxruntime/core/providers/webgpu/compute_context.h Outdated
Comment thread onnxruntime/core/providers/webgpu/webgpu_context.cc
Fixes build error introduced when ComputeContextBase::FlushAndWait() was added without the corresponding WebGpuContext public declaration. Spotted by Copilot review on the PR.
@hariharans29 hariharans29 changed the title [WebGPU] Fix stale buffer bindings on first graph-capture replay WIP: [WebGPU] Fix stale buffer bindings on first graph-capture replay May 2, 2026
@hariharans29 hariharans29 changed the title WIP: [WebGPU] Fix stale buffer bindings on first graph-capture replay [WebGPU] Fix stale buffer bindings on first graph-capture replay May 3, 2026
@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label May 5, 2026
return webgpu_context_.Run(*this, program);
}

inline Status FlushAndWait() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is FlushAndWait() used for?

@qjia7
Copy link
Copy Markdown
Contributor

qjia7 commented May 7, 2026

I have a PR #28260 to support multi-captured graph which can be used when multi generators are used in GenAI. In that PR, I also need to dynamically adjust the buffer manager so that the correct buffer manager is used. I thought the incorrect only happens for multi generators. But I'm not clear that why you meet the incorrect issue if you are using one generator. You mentioned that ' If the allocator freed and reused a slot before the first Replay(), the dispatch read stale memory and produced one wrong logits tensor. ', I am confused that why it happens. Can you give me a concrete repro to help me understand the issue here?

@hariharans29
Copy link
Copy Markdown
Member Author

I have a PR #28260 to support multi-captured graph which can be used when multi generators are used in GenAI. In that PR, I also need to dynamically adjust the buffer manager so that the correct buffer manager is used. I thought the incorrect only happens for multi generators. But I'm not clear that why you meet the incorrect issue if you are using one generator. You mentioned that ' If the allocator freed and reused a slot before the first Replay(), the dispatch read stale memory and produced one wrong logits tensor. ', I am confused that why it happens. Can you give me a concrete repro to help me understand the issue here?

Please ignore this PR for now. I hit this issue when I was using a correctness benchmark harness for the QKV and MLP fused kernels and AI suggested this fix.

@hariharans29 hariharans29 marked this pull request as draft May 10, 2026 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants