⭐ If DreamPRVR is helpful to your projects, please help star this repo. Thanks! 🤗
We sincerely invite readers to refer to our previous work ICCV25-HLFormer, as well as our curated Awesome-PRVR.
This repository contain the implementation of our work at CVPR 2026 main:
Imagine Before Concentration: Diffusion-Guided Registers Enhance Partially Relevant Video Retrieval Jun Li, Xuhang Lou, Jinpeng Wang, Yuting Wang, Yaowei Wang, Shu-Tao Xia, Bin Chen.
we propose DreamPRVR, which adopts a coarse-to-fine learning paradigm.
(i) The model first generates global contextual semantic registers as coarse-grained highlights spanning the entire video and then concentrates on fine-grained similarity optimization for precise cross-modal matching. Concretely, these registers are generated by initializing from the video-centric distribution produced by a probabilistic variational sampler and then iteratively refined via a text-supervised truncated diffusion model.
(ii) During this process, textual semantic structure learning constructs a well-formed textual latent space, enhancing the reliability of global perception.
(iii) The registers are then fused with video tokens through register-augmented Gaussian attention blocks, enabling context-aware learning.
git clone https://github.com/lijun2005/CVPR26-DreamPRVR.git
cd CVPR26-DreamPRVR/We train Charades-STA on Nvidia 3080 Ti with the environment:
- python==3.11.8
- pytorch==2.0.1
We train TVR, ActivityNet Captions on Nvidia A100-40G with the environment:
- python==3.9.17
- pytorch==2.0.1
All features can be downloaded from Baidu pan or Google drive (thanks to ms-sl).
!!! Please note that we did not use any features derived from ViT.
The dataset directory is organized as follows:
DreamPRVR/
├── activitynet/
│ ├── FeatureData/
│ ├── TextData/
│ ├── val_1.json
│ └── val_2.json
├── charades/
│ ├── FeatureData/
│ └── TextData/
└── tvr/
├── FeatureData/
└── TextData/We convert the feature.bin into feature.hdf5 . Please refer to src/Utils/convert_hdf5.py (thanks to FAWL).
Finally, set root and data_root in config files (e.g., ./src/Configs/tvr.py cfg['root'] and cfg['data_root']).
To train DreamPRVR on ActivityNet Captions:
cd src
python main.py -d act --gpu 0
To train DreamPRVR on Charades-STA:
cd src
python main.py -d cha --gpu 0
To train DreamPRVR on TVR:
cd src
python main.py -d tvr --gpu 0
For this repository, the expected performance is:
| Dataset | R@1 | R@5 | R@10 | R@100 | SumR | Log | Ckpt |
|---|---|---|---|---|---|---|---|
| ActivityNet Captions | 8.7 | 27.5 | 40.3 | 79.5 | 156.1 | act-log | act-ckpt |
| Charades-STA | 2.6 | 8.7 | 14.5 | 54.2 | 80.0 | cha-log | cha-ckpt |
| TVR | 17.4 | 39.0 | 50.4 | 86.2 | 193.1 | tvr-log | tvr-ckpt |
If you find our code useful or use the toolkit in your work, please consider citing:
@misc{li2026dreamprvr,
title={Imagine Before Concentration: Diffusion-Guided Registers Enhance Partially Relevant Video Retrieval},
author={Jun Li and Xuhang Lou and Jinpeng Wang and Yuting Wang and Yaowei Wang and Shu-Tao Xia and Bin Chen},
year={2026},
eprint={2604.03653},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2604.03653},
}
This code is based on HLFormer and GMMFormerV2. We are also grateful for other teams for open-sourcing codes that inspire our work, including MS-SL, DiffIR.
If you have any question, you can raise an issue or email Jun Li (220110924@stu.hit.edu.cn) and Jinpeng Wang (wangjp26@gmail.com).