Run DreamID-Omni (11.66B parameter identity-preserved audio-video model) on multiple consumer 24GB GPUs using layer sharding. This fork replaces the original sequence parallelism (which duplicates the full model on each GPU) with accelerate's dispatch_model to split model layers across your cards.
https://github.com/user-attachments/assets/luther_demo.mp4
Martin Luther discussing the weather with Aristotle at a Victorian train station. 50 denoising steps, 512x992, ~17 minutes on consumer hardware.
Fork of DreamID-Omni by Tsinghua University / ByteDance. Layer sharding technique from H100 Not Required: 32B FLUX.2-dev on 2017 Hardware.
| Original | This Fork |
|---|---|
| Needs 1x 80GB GPU (A100/H100) or 8-GPU sequence parallelism | Runs on 3-6x 24GB consumer GPUs (RTX 3090, A5000, RTX 4090) |
| Sequence parallelism duplicates full model per GPU | Layer sharding — each GPU holds a different slice of the model |
| Requires flash_attn (ABI issues with newer PyTorch) | PyTorch SDPA fallback — works out of the box |
| CLI only | Gradio UI included |
┌────────────────────────────────────────────────────────────────┐
│ Layer Sharding: 11.66B model split across 6x 24GB GPUs │
│ │
│ GPU 0: Video+Audio Blocks 0-5 + Embeddings + VAE ~11 GB │
│ GPU 1: Video+Audio Blocks 6-11 ~7 GB │
│ GPU 2: Video+Audio Blocks 12-17 ~7 GB │
│ GPU 3: Video+Audio Blocks 18-23 + T5 Encoder ~13 GB │
│ GPU 4: Video+Audio Blocks 24-29 + Heads ~7 GB │
│ │
│ Video block[i] and Audio block[i] stay CO-LOCATED │
│ (they cross-attend each other — must be same GPU) │
└────────────────────────────────────────────────────────────────┘
inference_sharded.py— New inference script with explicit device map (replacestorchrunsequence parallelism)dreamid_omni/modules/fusion.py— Device-aware forward loop: moves tensors between GPUs as data flows through blocksdreamid_omni/modules/model.py— RoPE frequency tensors follow data across devicesdreamid_omni/modules/attention.py— SDPA fallback when flash_attn is unavailableapp.py— Gradio web interface
| Config | GPUs | Total VRAM | Status |
|---|---|---|---|
| 6x RTX 3090/A5000 (24GB) | 6 | 144 GB | Tested, works |
| 4x RTX 3090 (24GB) | 4 | 96 GB | Should work |
| 3x RTX 4090 (24GB) | 3 | 72 GB | Should work |
| 2x 24GB cards | 2 | 48 GB | Tight, may need reduced resolution |
| 1x A100/H100 (80GB) | 1 | 80 GB | Use original repo instead |
All GPUs must be Ampere or newer (compute >= 8.0). Turing cards (RTX 2080, Quadro RTX 8000) will break — no bfloat16 support.
# 1. Clone
git clone https://github.com/cronos3k/dreamid-omni-consumer-gpu.git
cd dreamid-omni-consumer-gpu
# 2. Setup
python -m venv venv && source venv/bin/activate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
pip install accelerate imageio[ffmpeg]
# 3. Download weights (~60GB)
python download_weights.py --output-dir ./ckpts
mv ckpts/DreamI_Omni ckpts/DreamID_Omni # fix upstream typo
# 4. Run — set CUDA_VISIBLE_DEVICES to your Ampere GPUs
CUDA_DEVICE_ORDER=PCI_BUS_ID \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 \
python app.py
# Opens Gradio UI at http://localhost:7866
# Or headless:
CUDA_DEVICE_ORDER=PCI_BUS_ID \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 \
python inference_sharded.pyOn 6x NVIDIA A5000 (24GB each):
| Steps | Resolution | Time/clip | Notes |
|---|---|---|---|
| 10 | 512x992 | ~3.5 min | Fast preview |
| 50 | 512x992 | ~17 min | Full quality |
~20 seconds per denoising step. Slower than single-GPU due to inter-GPU data transfers, but it runs — which is the point.
Upload a face image + voice sample, describe the scene, and generate identity-preserved video with synchronized audio. All from a web browser.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python app.py- Paired blocks must be co-located. DreamID-Omni's FusionModel cross-attends video_block[i] with audio_block[i]. They must be on the same GPU.
- RoPE frequencies get stuck. Rotary embeddings are stored as buffers that don't move with
dispatch_model. Fix:freqs = freqs.to(x.device). - flash_attn ABI mismatch. Won't compile against torch 2.10+. Solution: remove it — PyTorch's built-in SDPA works fine.
- Download script typo. Official script saves to
DreamI_Omni, engine expectsDreamID_Omni. - Turing GPUs break everything. Exclude non-Ampere cards via
CUDA_VISIBLE_DEVICES.
- DreamID-Omni — Original model by Xu Guo et al., Tsinghua University / ByteDance
- Layer sharding technique from H100 Not Required
- Built by Gregor Hubert
Apache 2.0 (following original DreamID-Omni license)
Everything below is from the upstream DreamID-Omni repository.
Xu Guo et al. Tsinghua University | Intelligent Creation Team, ByteDance
Project Page | Paper | Models
Prompts use special tags to control characters and speech:
- Subject Identity:
<sub1>,<sub2>— Represents character IPs from input images - Speech:
<S>Your speech content here<E>— Converted to speech using the character's reference audio
See test_case/oneip/captions/9.json (single person) or test_case/twoip/captions/20.json (multi person) for prompt format examples.
Built upon Ovi, Wan2.2, MMAudio, Phantom, HuMo, OpenHumanVid.
@misc{guo2026dreamidomni,
title={DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation},
author={Xu Guo and Fulong Ye and Qichao Sun and Liyang Chen and Bingchuan Li and Pengze Zhang and Jiawei Liu and Songtao Zhao and Qian He and Xiangwang Hou},
year={2026},
eprint={2602.12160},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.12160},
}