[Sequence Parallelism] Add AutoSP scaffolding for multimodal models (ViT + LLM)#8
Draft
nathon-lee wants to merge 1 commit into
Draft
[Sequence Parallelism] Add AutoSP scaffolding for multimodal models (ViT + LLM)#8nathon-lee wants to merge 1 commit into
nathon-lee wants to merge 1 commit into
Conversation
Signed-off-by: leejane <121294318@qq.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces an initial AutoSP path for multimodal sequence parallelism in DeepSpeed. It adds model auto-detection and automatic attention wrapping for ViT and LLM branches, plus a fusion adapter skeleton for cross-modal sequence reshaping.
Motivation
Multimodal training involves much longer effective sequence lengths, and sequence parallelism is critical for throughput and memory efficiency. Existing workflows require substantial manual engineering to enable SP across both vision and language branches.
What is included
AutoSP detector for multimodal architectures
detects ViT attention modules
detects LLM attention modules
detects vision-language projection module candidate
ViT SP wrapper
adds UlyssesSPViTAttention
supports cls token replication behavior
preserves wrapped module tuple outputs
AutoSP entrypoint
adds auto_wrap_model_for_sp(model, process_group)
performs in-place module wrapping
Fusion adapter scaffold
adds ModalityFusionSPAdapter interface and SP gather/scatter flow
keeps token splicing architecture-specific via override hook
Exports
exposes AutoSP APIs from deepspeed.sequence
Tests
adds unit tests for detector/wrapper/auto-wrap behavior
What is not included
Architecture-specific visual token splice implementations (LLaVA/InternVL/Qwen2-VL) are not part of this PR and will be added in follow-up work.
Compatibility and risk
No behavior change unless users explicitly call auto_wrap_model_for_sp
Current implementation is additive and isolated to new sequence modules
Fusion logic remains opt-in and extensible
Validation
Added unit tests in test_autosp.py
Verified no API break in existing sequence module import paths
Follow-ups
Add model-specific fusion splice adapters
Add end-to-end multimodal SP integration tests
Add benchmark report (throughput/memory/scaling)