This repo contains the official code for the paper "Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG"
- 🔥 We propose RAP, a training-free framework designed to enhance Multimodal Large Language Models' (MLLMs) ability to process high-resolution images effectively.
[2026.03.15] LLaVA-1.5 series and Qwen3VL series are supported in our code! Additionally, to address the previously reported issues regarding the LLaVA and PyTorch versions, we have further specified and refined the version in the requirements.txt.
[2025.06.28] We add a demo script play.py for inference on one image.
[2025.06.07] Our paper was accepted to ICML 2025 as an Oral paper (Top 1%)! 🎉
[2025.05.05] RAP code is available!
[2025.03.04] We released the ArXiv paper. 🚀
High-resolution (HR) image perception remains a key challenge in multimodal large language models (MLLMs). To overcome the limitations of existing methods, this paper shifts away from prior dedicated heuristic approaches and revisits the most fundamental idea to HR perception by enhancing the long-context capability of MLLMs, driven by recent advances in long-context techniques like retrieval-augmented generation (RAG) for general LLMs. Towards this end, this paper presents the first study exploring the use of RAG to address HR perception challenges. Specifically, we propose Retrieval-Augmented Perception (RAP), a training-free framework that retrieves and fuses relevant image crops while preserving spatial context using the proposed Spatial-Awareness Layout. To accommodate different tasks, the proposed Retrieved-Exploration Search (RE-Search) dynamically selects the optimal number of crops based on model confidence and retrieval scores. Experimental results on HR benchmarks demonstrate the significant effectiveness of RAP, with LLaVA-v1.5-13B achieving a 43% improvement on
- Clone this repository and navigate to into the codebase
git clone https://github.com/DreamMr/RAP.git
cd RAP- Install Packages
conda create -n RAP python=3.10 -y
conda activate RAP
pip install -e .
pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git # llava-1.7.0.dev0In this repo, we implement RAP with LLaVA-OneVision (ov) series, LLaVA-1.5 series and Qwen3VL series. You can either download these checkpoints manually beforehand or let them be fetched automatically when calling the from_pretrained method in transformers.
Download the LMUData:
export LMUData=YOUR_DATASET_PATH
cp vstar.tsv $LMUData
cp hr_bench_4k_single.tsv $LMUData
cp hr_bench_8k_single.tsv $LMUData
# Note: need to modify the md5 in ./rap/dataset/image_mcq.py
# We provide the code to calculate the md5 in ./rap/smp/file.py
# example:
# from rap.smp import md5
# file_path = r'LMUData/vstar.tsv'
# print(md5(file_path))cd scripts
## LLaVA-OneVision-0.5B
bash run_llava_ov_hrbench.sh
## LLaVA-1.5-7B
## Note: For LLaVA-1.5-7B, with rag_image_size=112 and max_step=200, vstar=91.6
bash run_llava1d5_7b_rap.sh # HR-Bench 4K: 56.5, HR-Bench 8K: 53.6, vstar: 88.9
## LLaVA-1.5-13B
bash run_llava1d5_13 # HR-Bench 4K: 61.9, HR-Bench 8K: 58.9, vstar: 90.2
## Qwen3VL-8B-Instruct `pip install -U transformers==4.57.6`
bash run_qwen3vl_8b_rap.sh # HR-Bench 4K: 76.9, HR-Bench 8K: 74.9, vstar: 92.0Note: Since the official HR-Bench uses Cyclic Permutation, in order to improve evaluation efficiency, we adopt a two-stage approach: 1) First, for each image and query, we use RAP to obtain key image crops; 2) Then, we use the images obtained in 1) to replace the original images as input.
To enable better comparison, we also provide evaluation code without RAP.
cd scripts
## LLaVA-OneVision-0.5B
bash run_llava_ov_vanilla.sh
## LLaVA-1.5-7B
bash run_llava1d5_7b_vanilla.sh
## LLaVA-1.5-13B
bash run_llava1d5_13b_vanilla.sh
## Qwen3Vl-8B-Instruct `pip install -U transformers==4.57.6`
bash run_qwen3vl_8b_vanilla.sh # HR-Bench 4K: 71.5, HR-Bench 8K: 64.1, vstar: 81.3Note: If an OOM (Out of Memory) error occurs during evaluation, please try reducing the number of
workers(inrap/inference.pyline 107) and themax_batch_size(inrap/vlm/base.pyline 24).
We offer a demo file for RAP that can process any given Image-Question pair.
python play.py --model llava_onevision_qwen2_0.5b_ov --image_path ./demo.jpg --input "What's the color of the umbrella?"
python play.py --model llava_onevision_qwen2_0.5b_ov --image_path ./demo.jpg --use_rap --input "What's the color of the umbrella?"
- Wenbin Wang: wangwenbin97@whu.edu.cn
If you use RAP in your research, please cite our work:
@inproceedings{wangretrieval,
title={Retrieval-Augmented Perception: High-resolution Image Perception Meets Visual RAG},
author={Wang, Wenbin and Jing, Yongcheng and Ding, Liang and Wang, Yingjie and Shen, Li and Luo, Yong and Du, Bo and Tao, Dacheng},
booktitle={Forty-second International Conference on Machine Learning},
url={https://arxiv.org/abs/2503.01222}
}
- VLMEvalKit: We start from codebase from the VLMEvalKit.

