This repository provides inference code and open weights for the sound effect generative models developed at Sony AI. The current public release includes four models addressing the text-to-audio (T2A) and video-to- audio (V2A) tasks:
-
Audio encoder/decoder (Woosh-AE): High-quality latent encoder/decoder providing latents for generative modeling and decoding audio from generated latents.
-
Text conditioning (Woosh-CLAP): Multimodal text-audio alignment model providing token la- tents for diffusion model conditioning.
-
T2A Generation (Woosh-Flow and Woosh-DFlow): Original and distilled LDMs generating au- dio unconditionally or from given a text prompt.
-
V2A Generation (Woosh-VFlow): Multimodal LDM generating audio from a video sequence with optional text prompts.
Start by installing uv first
pip install uvand then the Woosh environment, with either:
cpu support,
uv sync --extra cpuor cuda support,
uv sync --extra cudaOpen model weights are available for all Woosh models trained on public datasets. You can download and unzip the pretrained weights from the releases page, or otherwise using the github CLI as
gh release download v1.0.0
unzip '*.zip'The checkpoints should be located in folders named checkpoints/MODEL_NAME, each containing config and weight files.
We provide audio samples to be used as inputs to our test_Woosh-*.py test scripts. You can download
and unzip the file samples.zip from the releases
page, or otherwise using the github CLI as
gh release download v1.0.0 -p 'samples.zip'
unzip samples.zipAn inference test script for every model is provided. Just run any of the following
uv run test_Woosh-AE.py
uv run test_Woosh-Flow.py
uv run test_Woosh-DFlow.py
uv run test_Woosh-VFlow.py
uv run test_Woosh-DVFlow.py
uv run test_Woosh-CLAP.pyand the generated audio/video will be written to outputs/ as .wav/.mp4 audio/video files.
Check our tech report on arxiv.org for a description of all models.
Two basic Gradio demos, for Woosh-Flow and Woosh-DFlow models, are available. To launch a Gradio demo locally, run one of the following
uv run gradio_Woosh-Flow.py
uv run gradio_Woosh-DFlow.pyOpen a web browser on the same machine and access the demo at https://127.0.0.1:7860.
Woosh models can be served via our API server. Check the API folder for usage details.
For details about model architecture, training and evaluation, please check our tech report available on arxiv.org.
@misc{hadjeres2026,
title={Woosh: A Sound Effects Foundation Model},
author={Gaetan Hadjeres, Marc Ferras, Khaled Koutini, Benno Weck, Alexandre Bittar, Thomas Hummel, Zineb Lahrichi, Hakim Missoum, Joan Serrà and Yuki Mitsufuji},
year={2026},
eprint={2604.01929},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2604.01929},
}Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.
- The majority of the code in this repository is released under the MIT license. The video-to-audio
Woosh-VFlowandWoosh-DVFlowmodels use adapted code from MM-AUDIO and MotionFormer. The code for these models is made available under Apache v2 license terms. - The open weights in the releases page are released under the CC-BY-NC license.
- The test audio and video samples in the releases page contain their individual license terms in the corresponding download file.