Skip to content

Fix DeepCompile ZeRO-3 release parameter lifetime#8032

Open
tohtana wants to merge 1 commit into
deepspeedai:masterfrom
tohtana:tohtana/fix/deepcompile-release-return-storage
Open

Fix DeepCompile ZeRO-3 release parameter lifetime#8032
tohtana wants to merge 1 commit into
deepspeedai:masterfrom
tohtana:tohtana/fix/deepcompile-release-return-storage

Conversation

@tohtana
Copy link
Copy Markdown
Collaborator

@tohtana tohtana commented May 28, 2026

PR #7489 made ZeRO-3 all-gather allocate a padded base buffer for uneven shards and return a true-shape view into that buffer. That means the registry tensor and the tensor returned to the compiled graph no longer necessarily share the same TensorImpl, although they still share the same underlying storage.

The existing release path only did set_data(empty) on the registry tensor before unregistering it. With the new base/view relationship, that clears the registry-side tensor metadata but does not resize the shared StorageImpl still referenced by returned views. As a result, the padded gathered allocation can remain live after the final release_param.

This patch keeps the release graph ordering unchanged and makes final non-persistent release resize the registered gathered storage to 0 bytes before unregistering it.

@tohtana tohtana requested review from loadams and tjruwase as code owners May 28, 2026 16:20
@tohtana tohtana force-pushed the tohtana/fix/deepcompile-release-return-storage branch from 7b2ae31 to 04ca757 Compare May 28, 2026 16:21
@tohtana tohtana requested a review from eternalNight May 29, 2026 22:09
@eternalNight
Copy link
Copy Markdown
Contributor

The gathered buffer is registered in DSParamRegistry which holds a reference to the tensor until release_param removes it from the registry. So I think torch will not release the buffer storage even inductor drops the gathered view early. I was wondering why that ref-count-based mechanism fails in your case?

@tohtana tohtana force-pushed the tohtana/fix/deepcompile-release-return-storage branch from 7af789e to f075f74 Compare May 30, 2026 05:05
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
@tohtana tohtana force-pushed the tohtana/fix/deepcompile-release-return-storage branch from f075f74 to d715e86 Compare May 30, 2026 05:46
@tohtana
Copy link
Copy Markdown
Collaborator Author

tohtana commented May 30, 2026

Thank you for your review, @eternalNight!

The actual issue is that, after #7489, the release op does not release the underlying gathered buffer storage. It is not an early-release issue. I updated the PR description to clarify this.
I also reverted the graph modification after verifying that the existing ordering is still safe.

This PR is much simpler now.

@eternalNight
Copy link
Copy Markdown
Contributor

The actual issue is that, after #7489, the release op does not release the underlying gathered buffer storage. It is not an early-release issue. I updated the PR description to clarify this. I also reverted the graph modification after verifying that the existing ordering is still safe.

Thanks for the clarification! That makes the picture much clearer. I still have one doubt, though.

The DSParamRegistry holds a reference to the raw, possibly-padded buffer. The buffer is expected to be released once that reference is dropped in unregisterGatheredParam. If it still persists after being unregistered, there must be a reference alive somewhere else.

You may capture a torch memory history which records exactly when (by recording the call stack) each storage is finally released. There may be more hints in the call stack about the residual reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants