This repository contains the code for our submission to the MARS2 Workshop @ ICCV 2025.
We are proud to announce our achievements in the Multimodal Reasoning Competition:
- 🥇 1st Place in Track 1: VG-RS (Visual Grounding in Real-world Scenarios)
- 🥉 3rd Place in Track 2: VQA-SA (Visual Question Answering with Spatial Awareness)
- 🥉 3rd Place in Track 3: VR-Ads (Visual Reasoning in Creative Advertisement Videos)
This repository provides the code to reproduce our results for all three tracks.
We have released the models for the VG-RS and VQA-SA tracks on Hugging Face. You can download them from the following links:
- VG-RS Model: Zach996/ActiveAlphaAgent-VG-RS
- VQA-SA Model: Zach996/ActiveAlphaAgent-VQA-SA
- VR-Ads Model: Qwen/Qwen2.5-VL-72B-Instruct
To download the models, you can use git lfs:
# Make sure you have git-lfs installed
# git lfs install
# Clone the repository for the VG-RS model
git clone https://huggingface.co/Zach996/ActiveAlphaAgent-VG-RS
# Clone the repository for the VQA-SA model
git clone https://huggingface.co/Zach996/ActiveAlphaAgent-VQA-SA- GPU: The experiments are conducted on servers equipped with 8x NVIDIA A100 (80G) or H800 (80G) GPUs. The provided scripts are configured for an 8-GPU setup.
- Python Version: This project requires Python
3.10or higher. - PyTorch: Ensure you have a compatible version of PyTorch installed for your CUDA environment.
- Dependencies: Install all required packages using the
requirements.txtfile:pip install -r requirements.txt
This project relies on several key libraries. Below are some of the most important ones, with their versions specified in requirements.txt:
torch==2.6.0+cu124transformers==4.51.3vllm==0.8.2flash-attn==2.7.4.post1xformers==0.0.29.post2decord==0.6.0&pyav==14.2.1(for video decoding)
This task performs visual grounding on the VG-RS dataset.
We provide a convenient script to run the entire inference pipeline.
-
Configure the script: Open
run_grounding.shand modify the variables in the Configuration section to match your environment, especiallyINFERENCE_MODE,MODEL_PATH, and data paths. -
Run the script:
bash run_grounding.sh
The script will handle both hf and client modes. In client mode, it will automatically manage the VLLM service (start it if not running, check health, and use an existing service if available).
This task performs VQA on the VQA-SA dataset.
For context-aware VQA (--prompt_version v2), the run_vqa.sh script will automatically check for and generate a question file with context if it doesn't exist. You just need to ensure the original question file (e.g., VQA-SA-question.json) is present at the path specified in the script.
-
Configure the script: Open
run_vqa.shand modify the variables in the Configuration section to match your environment, especiallyINFERENCE_MODE,MODEL_PATH, and data paths. -
Run the script:
bash run_vqa.sh
The script handles both hf and client modes, automatically managing the VLLM service in client mode.
This task performs VQA on the VR-Ads dataset.
-
Configure the script: Open
run_video_reasoning.shand modify the variables in the Configuration section to match your environment, especiallyMODEL_PATHand data paths. -
Run the script:
bash run_video_reasoning.sh
The script will first check if a compatible VLLM service is already running. If not, it will start one, wait for it to be ready, and then proceed with the inference. After the task is complete, it will remind you to stop the service if it was started by the script.
--json_path: Path to the input JSON file containing evaluation data.--task: Specifies the task to run, choices aregroundingorvqa.--output_dir: Directory to save the output results.--image_base_dir: The root directory where images are stored.--model_name: A name for your model configuration, used to generate the default output filename.--inference_mode: The inference framework, choices arehf,client.Note: For
groundingandvqatasks, the performance ofvllm/clientmodes is significantly lower (2-4 pp) thanhfmode due to implementation differences. To reproduce the official scores, please use thehfinference mode.--model_path: Path to the trained model checkpoint. (Applicable forhf).--gpu_ids: Comma-separated list of GPU IDs to use for inference. (Applicable forhfmode).--port: The port number for the API service. (Applicable forclientmode).--num_workers: Number of worker threads for data processing inclientmode.--prompt_version: The prompt version for the VQA task, choices arev1orv2. (Applicable forvqatask).--min_pixels,--max_pixels: The minimum/maximum number of pixels for image resizing during preprocessing.--output_path: Specifies the full path for the output JSON file. If not provided, it will be automatically generated in--output_dirbased on the input filename and model name.
--api_url: The API endpoint for the VLLM server.--model_name: The name of the model being evaluated.--video_root_path: The root directory where video files are stored.--question_file_path: The full path to the JSON file containing questions.--output_dir: Directory to save the output results.--fps: Frames per second to sample from the video.--max_workers: Number of worker threads for data processing.