Run the LTX‑2 19B model locally to generate videos from both an image and an audio track using Intel Arc (XPU) GPU or CPU. This project demonstrates local AI video generation without sending data to the cloud, leveraging extended shared GPU memory for high-parameter models.
While most local AI models are optimized for NVIDIA GPUs (CUDA), this project demonstrates how to run LTX-2 on Intel Arc (XPU) hardware. This is a significant milestone, as XPU support is currently rare, providing a functional pathway for Intel Arc users to leverage high-performance video generation.
The model itself isn’t just one giant block. It’s more like a carefully choreographed pipeline where each part has its own job. Gemma‑3‑12B handles text prompts, turning your words into embeddings that guide what the video should look like. The Video VAE compresses the video into a latent space, making it easier and faster for the transformer to process, and then decodes it back into frames. The Audio VAE does a similar thing for sound, capturing pitch, rhythm, and timbre while ignoring unnecessary details. Connectors act like translators between the different inputs, aligning text, audio, and images into a shared representation that the transformer can understand.
At the center is the 19B-parameter transformer. This is the “brain” that fuses all the inputs and generates coherent video and audio in one pass. The scheduler is part of this process too—it’s like a timing coordinator that tells the transformer how to gradually refine the output. Instead of trying to generate the video and audio perfectly in one shot, the scheduler loops through multiple denoising steps, slowly turning noisy latent representations into clean, synchronized video and sound. This ensures that everything stays aligned and natural over time. Finally, the vocoder takes the audio latents and converts them into actual high-quality sound, so it doesn’t end up robotic or synthetic.
All of these components together make it possible for LTX‑2 to take an image, an audio track, and a prompt, and generate a synchronized video with sound, all in one go. It’s like a team where everyone knows their role, the scheduler keeps the timing right, and the transformer is the conductor making everything come together seamlessly.
Place these models in the models/ folder. LTX-2 generates the video, Gemma encodes text prompts, and the LTX-2 components handle video/audio encoding and decoding.
- 📦 ltx-2-19b-distilled-fp8.safetensors — Download
- 📁 ltx2_components/
- 📁 gemma-3-12b-it-qat-q4_0-unquantized/ — Download
# Clone the repository
git clone https://github.com/Ashot72/LTX-2-Audio-to-Video-Local-XPU
cd LTX-2-Audio-to-Video-Local-XPU
Install (Windows - Intel XPU/Arc):
install_all.bat
Add inputs:
inputs/singer.png (image)
inputs/track.mp3 (audio)
Run:
python worker.py
CPU-only:
python worker.py --cpu
Output: final_music_video.mp4📺 Video Watch on YouTube
