ComfyUI custom nodes for ID-LoRA-2.3 inference — audio+video generation with speaker identity transfer, built on top of LTX-2.3. Supports both one-stage (single resolution) and two-stage (2x spatial upsampling) pipelines.
Two-stage output (max_resolution=1024, HQ mode, 242 frames @ 25fps):
Input voice sample: reference.mp3
ComfyUI_demo_.mp4
ID-LoRA transfers a speaker's vocal identity from a reference audio clip into a generated talking-head video. This package wraps the inference pipelines as ComfyUI nodes:
One-stage — generates at a single resolution (lower quality):
IDLoraModelLoader --> IDLoraPromptEncoder --> IDLoraOneStageSampler --> SaveVideo
^ ^
first_frame reference_audio
Two-stage — generates at target resolution then refines at 2x with spatial upsampling (higher quality):
IDLoraTwoStageModelLoader --> IDLoraPromptEncoder --> IDLoraTwoStageSampler --> SaveVideo
^ ^
first_frame reference_audio
The two-stage pipeline produces higher-resolution output (e.g. 512x512 -> 1024x1024) by:
- Stage 1: Generating video+audio at target resolution with ID-LoRA + full guidance (CFG, STG, identity guidance, A/V bimodal)
- Transition: Freeing stage-1 models, 2x spatial upsampling the video latent, re-encoding the first frame at 2x
- Stage 2: Refining at 2x resolution with a distilled LoRA only (no guidance), audio frozen from stage 1
An optional HQ mode uses the res2s second-order sampler instead of Euler for higher quality.
- GPU: NVIDIA GPU with >=24 GB VRAM (48 GB recommended for non-quantized; 80 GB recommended for two-stage non-quantized at high resolutions)
- Python: 3.10+
- ComfyUI: Recent version with
comfy_api.latestsupport - Disk: ~51 GB for one-stage models, ~60 GB for two-stage models (additional upsampler + distilled LoRA)
This package depends on the ID-LoRA repo for its Python packages and model download script. ID-LoRA-2.3 is a subdirectory inside it.
git clone https://github.com/ID-LoRA/ID-LoRA.gitRun the download script from the repository root so models end up where the nodes expect them:
bash ID-LoRA/ID-LoRA-2.3/scripts/download_models.sh models/This downloads all required models from HuggingFace:
| Model | HuggingFace repo | Size |
|---|---|---|
| LTX-2.3 base checkpoint | Lightricks/LTX-2.3 |
~44 GB |
| Gemma 3 12B text encoder | google/gemma-3-12b-it-qat-q4_0-unquantized |
~6 GB |
| ID-LoRA CelebV-HQ weights | AviadDahan/LTX-2.3-ID-LoRA-CelebVHQ-3K |
~1.1 GB |
| ID-LoRA TalkVid weights | AviadDahan/LTX-2.3-ID-LoRA-TalkVid-3K |
~1.1 GB |
For two-stage, also download (included in the download script):
| Model | HuggingFace repo | Size |
|---|---|---|
| Spatial upsampler | Lightricks/LTX-2.3 |
~1 GB |
| Distilled LoRA | Lightricks/LTX-2.3 |
~7.6 GB |
Note: Gemma requires accepting the license at https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-unquantized and logging in with
huggingface-cli loginbefore downloading.
After downloading, your directory should contain:
models/
├── ltx-2.3-22b-dev.safetensors # 44 GB — base model
├── gemma-3-12b-it-qat-q4_0-unquantized/ # 6 GB — text encoder
│ ├── config.json
│ ├── model*.safetensors
│ ├── tokenizer.model
│ └── ...
├── id-lora-celebvhq-ltx2.3/ # 1.1 GB — ID-LoRA
│ └── lora_weights.safetensors
├── ltx-2.3-spatial-upscaler-x2-1.1.safetensors # 1 GB — spatial upsampler (two-stage)
└── ltx-2.3-22b-distilled-lora-384.safetensors # 7.6 GB — distilled LoRA (two-stage)
pip install -e ID-LoRA/ID-LoRA-2.3/packages/ltx-core
pip install -e ID-LoRA/ID-LoRA-2.3/packages/ltx-pipelines
pip install -e ID-LoRA/ID-LoRA-2.3/packages/ltx-trainercd ComfyUI/custom_nodes
git clone <this-repo-url> comfyui-id-lora-ltxpython ComfyUI/main.pyThen open http://127.0.0.1:8188 in your browser. The five nodes will appear under the ID-LoRA category in the node menu.
Two ready-to-use workflows are included in the example_workflows/ directory:
id_lora_one_stage.json— one-stage pipelineid_lora_two_stage.json— two-stage pipeline with 2x upsampling
In ComfyUI, go to Browse Templates and look for comfyui-id-lora-ltx to load them. They wire all nodes together with LoadImage, LoadAudio, and SaveVideo.
The example workflows use poster_image.png (first frame) and reference.mp3 (speaker reference audio), which are included in the example_inputs/ directory. Copy them into ComfyUI's input/ folder so the LoadImage and LoadAudio nodes can find them:
cp ComfyUI/custom_nodes/comfyui-id-lora-ltx/example_inputs/poster_image.png ComfyUI/input/
cp ComfyUI/custom_nodes/comfyui-id-lora-ltx/example_inputs/reference.mp3 ComfyUI/input/Then load either workflow template from ComfyUI and click Queue Prompt to run.
Loads the base LTX-2.3 checkpoint, Gemma text encoder, and ID-LoRA weights into a one-stage pipeline object. This is the slow/expensive node — ComfyUI caches its output when inputs don't change.
| Input | Type | Default | Description |
|---|---|---|---|
checkpoint_path |
String | models/ltx-2.3-22b-dev.safetensors |
Path to LTX-2.3 checkpoint |
text_encoder_path |
String | models/gemma-3-12b-it-qat-q4_0-unquantized |
Path to Gemma text encoder directory |
lora_path |
String | (empty) | Path to ID-LoRA .safetensors |
lora_strength |
Float | 1.0 | LoRA strength (0.0-2.0) |
quantize |
Combo | none |
none, int8, or fp8 |
stg_scale |
Float | 1.0 | Spatio-Temporal Guidance scale (0 disables) |
identity_guidance_scale |
Float | 3.0 | Identity guidance strength |
av_bimodal_scale |
Float | 3.0 | Audio-video bimodal CFG scale |
Output: ID_LORA_PIPELINE
Loads the base checkpoint, text encoder, ID-LoRA, spatial upsampler, and distilled LoRA for two-stage generation. Like the one-stage loader, ComfyUI caches this node.
| Input | Type | Default | Description |
|---|---|---|---|
checkpoint_path |
String | models/ltx-2.3-22b-dev.safetensors |
Path to LTX-2.3 checkpoint |
text_encoder_path |
String | models/gemma-3-12b-it-qat-q4_0-unquantized |
Path to Gemma text encoder directory |
lora_path |
String | (empty) | Path to ID-LoRA .safetensors |
lora_strength |
Float | 1.0 | LoRA strength (0.0-2.0) |
upsampler_path |
String | models/ltx-2.3-spatial-upscaler-x2-1.1.safetensors |
Path to spatial upsampler |
distilled_lora_path |
String | models/ltx-2.3-22b-distilled-lora-384.safetensors |
Path to distilled LoRA for stage 2 |
quantize |
Combo | none |
none, int8, or fp8 |
stg_scale |
Float | 1.0 | Spatio-Temporal Guidance scale (0 disables) |
identity_guidance_scale |
Float | 3.0 | Identity guidance strength |
av_bimodal_scale |
Float | 3.0 | Audio-video bimodal CFG scale |
Output: ID_LORA_PIPELINE
Encodes positive and negative text prompts into conditioning tensors. Works with both one-stage and two-stage model loaders. Loads the text encoder temporarily, encodes, then frees it to save VRAM.
| Input | Type | Default | Description |
|---|---|---|---|
pipeline |
ID_LORA_PIPELINE | — | From either Model Loader |
prompt |
String (multiline) | (empty) | Positive prompt |
negative_prompt |
String (multiline) | (default negative) | Negative prompt |
Output: ID_LORA_CONDITIONING
Runs the one-stage generation pipeline: denoising loop with identity guidance, then decodes video and audio. Outputs a ComfyUI VIDEO that you wire directly to the built-in Save Video node.
| Input | Type | Default | Description |
|---|---|---|---|
pipeline |
ID_LORA_PIPELINE | — | From Model Loader |
conditioning |
ID_LORA_CONDITIONING | — | From Prompt Encoder |
first_frame |
IMAGE (optional) | None | First-frame image for face conditioning |
reference_audio |
AUDIO (optional) | None | Reference speaker audio for identity transfer |
seed |
Int | 42 | Random seed |
height |
Int | 512 | Output height (multiple of 32) |
width |
Int | 512 | Output width (multiple of 32) |
num_frames |
Int | 121 | Number of frames to generate |
num_inference_steps |
Int | 30 | Denoising steps |
frame_rate |
Float | 25.0 | Output frame rate |
video_guidance_scale |
Float | 3.0 | Video CFG scale |
audio_guidance_scale |
Float | 7.0 | Audio CFG scale |
auto_resolution |
Boolean | True | Auto-detect resolution from first frame, preserving aspect ratio |
max_resolution |
Int | 512 | Target long-side resolution when auto_resolution is on — increase for higher res output |
Output: VIDEO (wire to Save Video)
Runs the two-stage generation pipeline. Stage 1 generates at the specified resolution with full guidance, then stage 2 refines at 2x resolution with the distilled LoRA. Output resolution is 2x the specified height/width.
| Input | Type | Default | Description |
|---|---|---|---|
pipeline |
ID_LORA_PIPELINE | — | From Two-Stage Model Loader |
conditioning |
ID_LORA_CONDITIONING | — | From Prompt Encoder |
first_frame |
IMAGE (optional) | None | First-frame image for face conditioning |
reference_audio |
AUDIO (optional) | None | Reference speaker audio for identity transfer |
seed |
Int | 42 | Random seed |
height |
Int | 512 | Stage 1 height (output = 2x this value) |
width |
Int | 512 | Stage 1 width (output = 2x this value) |
num_frames |
Int | 121 | Number of frames to generate |
num_inference_steps |
Int | 30 | Stage 1 denoising steps (stage 2 uses 3 fixed steps) |
frame_rate |
Float | 25.0 | Output frame rate |
video_guidance_scale |
Float | 3.0 | Video CFG scale |
audio_guidance_scale |
Float | 7.0 | Audio CFG scale |
auto_resolution |
Boolean | True | Auto-detect resolution from first frame, preserving aspect ratio |
max_resolution |
Int | 512 | Target long-side resolution when auto_resolution is on — increase for higher res output (final output is 2x) |
hq_mode |
Boolean | True | Use res2s second-order sampler (higher quality, slower) |
Output: VIDEO (wire to Save Video)
Prompts use a structured [VISUAL] / [SPEECH] / [SOUNDS] format:
[VISUAL]: A medium shot of a young man with curly brown hair, sitting on a beige couch.
He is wearing a light blue shirt and speaking warmly.
[SPEECH]: We are proud to introduce ID-LoRA.
[SOUNDS]: The speaker has a moderate volume and conversational tone. Light instrumental
background music plays softly.
All three sections are optional but recommended for best results.
- Add ID-LoRA Model Loader — set paths to checkpoint, text encoder, and LoRA
- Add ID-LoRA Prompt Encoder — connect the pipeline, write your prompt
- Add Load Image + Load Audio — load face image and reference audio
- Add ID-LoRA One-Stage Sampler — connect everything
- Add Save Video — connect the video output
- Add ID-LoRA Two-Stage Model Loader — set paths (including upsampler and distilled LoRA)
- Add ID-LoRA Prompt Encoder — connect the pipeline, write your prompt
- Add Load Image + Load Audio — load face image and reference audio
- Add ID-LoRA Two-Stage Sampler — connect everything, optionally enable hq_mode
- Add Save Video — connect the video output
The output resolution will be 2x the stage 1 resolution (e.g. 512x512 stage 1 -> 1024x1024 output).
The full pipeline is memory-intensive — the 22B parameter transformer alone requires significant VRAM. If you run into out-of-memory errors, here are several ways to reduce usage, roughly ordered from least to most impact on quality:
-
Enable quantization — Set
quantizetoint8orfp8on the Model Loader node. This reduces the transformer's memory footprint substantially with a small quality tradeoff.int8is recommended as a good balance;fp8saves slightly more memory but may introduce more artifacts. -
Lower
max_resolution— Reducing from 768 to 512 (or lower) significantly reduces VRAM during generation, since memory scales quadratically with resolution. For two-stage, this also reduces the stage-2 resolution proportionally (e.g. 512 → 1024 output instead of 768 → 1536). -
Reduce
num_frames— Fewer frames means less latent state in memory. Try 121 instead of 242 if memory is tight. -
Disable HQ mode (two-stage only) — The res2s sampler uses more memory than Euler due to its second-order midpoint evaluation. Turning off
hq_modesaves memory at the cost of some quality. -
Use one-stage instead of two-stage — The two-stage pipeline requires loading two separate transformers sequentially and upsampling the latent to 2x, which has higher peak memory than one-stage.
These options can be combined. For example, int8 quantization + max_resolution=512 + one-stage should run on 24 GB GPUs.
- Model paths can be relative or absolute. Relative paths are resolved from the repository root (3 levels above the custom node directory).
- Auto-resolution (enabled by default) matches the first frame's aspect ratio while capping the long side at
max_resolution(default 512px). To generate at higher resolution, increasemax_resolutionon the sampler node — the aspect ratio is always preserved. For the two-stage pipeline, this controls stage 1; the final output will be 2x that (e.g.max_resolution=768produces up to 1536px output). - Two-stage VRAM: Stage-1 models (transformer, audio encoder) are fully freed before loading stage-2 models. The video encoder is shared for upsampling then freed.
- HQ mode (two-stage only): Uses the res2s second-order sampler instead of Euler in both stages for higher quality at the cost of speed.
- Reusable pipeline (one-stage only): The video encoder is stashed to CPU (not deleted) after encoding, so re-running with different prompts/seeds doesn't require a full reload.
comfyui-id-lora-ltx/
├── __init__.py # Extension entry point (registers all 5 nodes)
├── nodes_model_loader.py # IDLoraModelLoader + IDLoraTwoStageModelLoader nodes
├── nodes_prompt_encoder.py # IDLoraPromptEncoder node (shared)
├── nodes_sampler.py # IDLoraOneStageSampler + IDLoraTwoStageSampler nodes
├── pipeline_wrapper.py # _IDLoraBase, IDLoraOneStagePipeline, IDLoraTwoStagePipeline
├── example_inputs/
│ ├── poster_image.png # Example first-frame image
│ └── reference.mp3 # Example reference audio
├── example_outputs/
│ └── two_stage_demo.mp4 # Sample two-stage output
├── example_workflows/
│ ├── id_lora_one_stage.json # One-stage workflow template
│ └── id_lora_two_stage.json # Two-stage workflow template
├── pyproject.toml # Package metadata
├── requirements.txt # Dependencies
└── README.md
bibtex
@misc{dahan2026idloraidentitydrivenaudiovideopersonalization,
title = {ID-LoRA: Identity-Driven Audio-Video Personalization
with In-Context LoRA},
author = {Aviad Dahan and Moran Yanuka and Noa Kraicer and Lior Wolf and Raja Giryes},
year = {2026},
eprint = {2603.10256},
archivePrefix = {arXiv},
primaryClass = {cs.SD},
url = {https://arxiv.org/abs/2603.10256}
}