DreamID-Omni on Consumer GPUs — No H100 Required

Run DreamID-Omni (11.66B parameter identity-preserved audio-video model) on multiple consumer 24GB GPUs using layer sharding. This fork replaces the original sequence parallelism (which duplicates the full model on each GPU) with accelerate's dispatch_model to split model layers across your cards.

Demo: Generated on 6x NVIDIA A5000 (24GB each)

https://github.com/user-attachments/assets/luther_demo.mp4

Martin Luther discussing the weather with Aristotle at a Victorian train station. 50 denoising steps, 512x992, ~17 minutes on consumer hardware.

Fork of DreamID-Omni by Tsinghua University / ByteDance. Layer sharding technique from H100 Not Required: 32B FLUX.2-dev on 2017 Hardware.

What This Fork Adds

Original	This Fork
Needs 1x 80GB GPU (A100/H100) or 8-GPU sequence parallelism	Runs on 3-6x 24GB consumer GPUs (RTX 3090, A5000, RTX 4090)
Sequence parallelism duplicates full model per GPU	Layer sharding — each GPU holds a different slice of the model
Requires flash_attn (ABI issues with newer PyTorch)	PyTorch SDPA fallback — works out of the box
CLI only	Gradio UI included

How It Works

┌────────────────────────────────────────────────────────────────┐
│  Layer Sharding: 11.66B model split across 6x 24GB GPUs       │
│                                                                │
│  GPU 0: Video+Audio Blocks 0-5 + Embeddings + VAE   ~11 GB    │
│  GPU 1: Video+Audio Blocks 6-11                      ~7 GB    │
│  GPU 2: Video+Audio Blocks 12-17                     ~7 GB    │
│  GPU 3: Video+Audio Blocks 18-23 + T5 Encoder       ~13 GB   │
│  GPU 4: Video+Audio Blocks 24-29 + Heads              ~7 GB   │
│                                                                │
│  Video block[i] and Audio block[i] stay CO-LOCATED             │
│  (they cross-attend each other — must be same GPU)             │
└────────────────────────────────────────────────────────────────┘

Key Code Changes (5 files)

inference_sharded.py — New inference script with explicit device map (replaces torchrun sequence parallelism)
dreamid_omni/modules/fusion.py — Device-aware forward loop: moves tensors between GPUs as data flows through blocks
dreamid_omni/modules/model.py — RoPE frequency tensors follow data across devices
dreamid_omni/modules/attention.py — SDPA fallback when flash_attn is unavailable
app.py — Gradio web interface

Hardware Requirements

Config	GPUs	Total VRAM	Status
6x RTX 3090/A5000 (24GB)	6	144 GB	Tested, works
4x RTX 3090 (24GB)	4	96 GB	Should work
3x RTX 4090 (24GB)	3	72 GB	Should work
2x 24GB cards	2	48 GB	Tight, may need reduced resolution
1x A100/H100 (80GB)	1	80 GB	Use original repo instead

All GPUs must be Ampere or newer (compute >= 8.0). Turing cards (RTX 2080, Quadro RTX 8000) will break — no bfloat16 support.

Quick Start

# 1. Clone
git clone https://github.com/cronos3k/dreamid-omni-consumer-gpu.git
cd dreamid-omni-consumer-gpu

# 2. Setup
python -m venv venv && source venv/bin/activate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
pip install accelerate imageio[ffmpeg]

# 3. Download weights (~60GB)
python download_weights.py --output-dir ./ckpts
mv ckpts/DreamI_Omni ckpts/DreamID_Omni  # fix upstream typo

# 4. Run — set CUDA_VISIBLE_DEVICES to your Ampere GPUs
CUDA_DEVICE_ORDER=PCI_BUS_ID \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 \
python app.py
# Opens Gradio UI at http://localhost:7866

# Or headless:
CUDA_DEVICE_ORDER=PCI_BUS_ID \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 \
python inference_sharded.py

Performance

On 6x NVIDIA A5000 (24GB each):

Steps	Resolution	Time/clip	Notes
10	512x992	~3.5 min	Fast preview
50	512x992	~17 min	Full quality

~20 seconds per denoising step. Slower than single-GPU due to inter-GPU data transfers, but it runs — which is the point.

Gradio Interface

Upload a face image + voice sample, describe the scene, and generate identity-preserved video with synchronized audio. All from a web browser.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python app.py

Gotchas We Hit

Paired blocks must be co-located. DreamID-Omni's FusionModel cross-attends video_block[i] with audio_block[i]. They must be on the same GPU.
RoPE frequencies get stuck. Rotary embeddings are stored as buffers that don't move with dispatch_model. Fix: freqs = freqs.to(x.device).
flash_attn ABI mismatch. Won't compile against torch 2.10+. Solution: remove it — PyTorch's built-in SDPA works fine.
Download script typo. Official script saves to DreamI_Omni, engine expects DreamID_Omni.
Turing GPUs break everything. Exclude non-Ampere cards via CUDA_VISIBLE_DEVICES.

Credits

DreamID-Omni — Original model by Xu Guo et al., Tsinghua University / ByteDance
Layer sharding technique from H100 Not Required
Built by Gregor Hubert

License

Apache 2.0 (following original DreamID-Omni license)

Original Model Card (for reference)

Everything below is from the upstream DreamID-Omni repository.

DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

Xu Guo et al. Tsinghua University | Intelligent Creation Team, ByteDance

Project Page | Paper | Models

How to Create Prompts

Prompts use special tags to control characters and speech:

Subject Identity: <sub1>, <sub2> — Represents character IPs from input images
Speech: <S>Your speech content here<E> — Converted to speech using the character's reference audio

Example Structure

See test_case/oneip/captions/9.json (single person) or test_case/twoip/captions/20.json (multi person) for prompt format examples.

Acknowledgements

Built upon Ovi, Wan2.2, MMAudio, Phantom, HuMo, OpenHumanVid.

Citation

@misc{guo2026dreamidomni,
      title={DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation},
      author={Xu Guo and Fulong Ye and Qichao Sun and Liyang Chen and Bingchuan Li and Pengze Zhang and Jiawei Liu and Songtao Zhao and Qian He and Xiangwang Hou},
      year={2026},
      eprint={2602.12160},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.12160},
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets		assets
dreamid_omni		dreamid_omni
test_case		test_case
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
SHARDED_README.md		SHARDED_README.md
app.py		app.py
blog_huggingface.md		blog_huggingface.md
download_weights.py		download_weights.py
infer.sh		infer.sh
inference_r2av.py		inference_r2av.py
inference_sharded.py		inference_sharded.py
post_linkedin.md		post_linkedin.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DreamID-Omni on Consumer GPUs — No H100 Required

Demo: Generated on 6x NVIDIA A5000 (24GB each)

What This Fork Adds

How It Works

Key Code Changes (5 files)

Hardware Requirements

Quick Start

Performance

Gradio Interface

Gotchas We Hit

Credits

License

Original Model Card (for reference)

DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

How to Create Prompts

Example Structure

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DreamID-Omni on Consumer GPUs — No H100 Required

Demo: Generated on 6x NVIDIA A5000 (24GB each)

What This Fork Adds

How It Works

Key Code Changes (5 files)

Hardware Requirements

Quick Start

Performance

Gradio Interface

Gotchas We Hit

Credits

License

Original Model Card (for reference)

DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

How to Create Prompts

Example Structure

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages