ICCV 2025 Challenge

This repository contains the code for our submission to the MARS2 Workshop @ ICCV 2025.

🏆 Competition Results

We are proud to announce our achievements in the Multimodal Reasoning Competition:

🥇 1st Place in Track 1: VG-RS (Visual Grounding in Real-world Scenarios)
🥉 3rd Place in Track 2: VQA-SA (Visual Question Answering with Spatial Awareness)
🥉 3rd Place in Track 3: VR-Ads (Visual Reasoning in Creative Advertisement Videos)

This repository provides the code to reproduce our results for all three tracks.

Model Zoo

We have released the models for the VG-RS and VQA-SA tracks on Hugging Face. You can download them from the following links:

VG-RS Model: Zach996/ActiveAlphaAgent-VG-RS
VQA-SA Model: Zach996/ActiveAlphaAgent-VQA-SA
VR-Ads Model: Qwen/Qwen2.5-VL-72B-Instruct

To download the models, you can use git lfs:

# Make sure you have git-lfs installed
# git lfs install

# Clone the repository for the VG-RS model
git clone https://huggingface.co/Zach996/ActiveAlphaAgent-VG-RS

# Clone the repository for the VQA-SA model
git clone https://huggingface.co/Zach996/ActiveAlphaAgent-VQA-SA

Setup Environment

Hardware Requirements

GPU: The experiments are conducted on servers equipped with 8x NVIDIA A100 (80G) or H800 (80G) GPUs. The provided scripts are configured for an 8-GPU setup.

Prerequisites

Python Version: This project requires Python 3.10 or higher.
PyTorch: Ensure you have a compatible version of PyTorch installed for your CUDA environment.
Dependencies: Install all required packages using the requirements.txt file:
```
pip install -r requirements.txt
```

Key Dependencies

This project relies on several key libraries. Below are some of the most important ones, with their versions specified in requirements.txt:

torch==2.6.0+cu124
transformers==4.51.3
vllm==0.8.2
flash-attn==2.7.4.post1
xformers==0.0.29.post2
decord==0.6.0 & pyav==14.2.1 (for video decoding)

1. Visual Grounding on VG-RS

This task performs visual grounding on the VG-RS dataset.

Inference via Script

We provide a convenient script to run the entire inference pipeline.

Configure the script: Open run_grounding.sh and modify the variables in the Configuration section to match your environment, especially INFERENCE_MODE, MODEL_PATH, and data paths.
Run the script:
```
bash run_grounding.sh
```

The script will handle both hf and client modes. In client mode, it will automatically manage the VLLM service (start it if not running, check health, and use an existing service if available).

2. Visual Question Answering on VQA-SA

This task performs VQA on the VQA-SA dataset.

Step 1: Data Preprocessing (Optional)

For context-aware VQA (--prompt_version v2), the run_vqa.sh script will automatically check for and generate a question file with context if it doesn't exist. You just need to ensure the original question file (e.g., VQA-SA-question.json) is present at the path specified in the script.

Step 2: Inference via Script

Configure the script: Open run_vqa.sh and modify the variables in the Configuration section to match your environment, especially INFERENCE_MODE, MODEL_PATH, and data paths.
Run the script:
```
bash run_vqa.sh
```

The script handles both hf and client modes, automatically managing the VLLM service in client mode.

3. Video Question Answering on VR-Ads

This task performs VQA on the VR-Ads dataset.

Inference via Script

Configure the script: Open run_video_reasoning.sh and modify the variables in the Configuration section to match your environment, especially MODEL_PATH and data paths.
Run the script:
```
bash run_video_reasoning.sh
```

The script will first check if a compatible VLLM service is already running. If not, it will start one, wait for it to be ready, and then proceed with the inference. After the task is complete, it will remind you to stop the service if it was started by the script.

Argument Descriptions

Common Arguments (`eval_grounding_vqa.py`)

--json_path: Path to the input JSON file containing evaluation data.
--task: Specifies the task to run, choices are grounding or vqa.
--output_dir: Directory to save the output results.
--image_base_dir: The root directory where images are stored.
--model_name: A name for your model configuration, used to generate the default output filename.
--inference_mode: The inference framework, choices are hf, client.

Note: For grounding and vqa tasks, the performance of vllm/client modes is significantly lower (2-4 pp) than hf mode due to implementation differences. To reproduce the official scores, please use the hf inference mode.
--model_path: Path to the trained model checkpoint. (Applicable for hf).
--gpu_ids: Comma-separated list of GPU IDs to use for inference. (Applicable for hf mode).
--port: The port number for the API service. (Applicable for client mode).
--num_workers: Number of worker threads for data processing in client mode.
--prompt_version: The prompt version for the VQA task, choices are v1 or v2. (Applicable for vqa task).
--min_pixels, --max_pixels: The minimum/maximum number of pixels for image resizing during preprocessing.
--output_path: Specifies the full path for the output JSON file. If not provided, it will be automatically generated in --output_dir based on the input filename and model name.

Video VQA Arguments (`eval_video_reasoning.py`)

--api_url: The API endpoint for the VLLM server.
--model_name: The name of the model being evaluated.
--video_root_path: The root directory where video files are stored.
--question_file_path: The full path to the JSON file containing questions.
--output_dir: Directory to save the output results.
--fps: Frames per second to sample from the video.
--max_workers: Number of worker threads for data processing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ICCV 2025 Challenge

🏆 Competition Results

Model Zoo

Setup Environment

Hardware Requirements

Prerequisites

Key Dependencies

1. Visual Grounding on VG-RS

Inference via Script

2. Visual Question Answering on VQA-SA

Step 1: Data Preprocessing (Optional)

Step 2: Inference via Script

3. Video Question Answering on VR-Ads

Inference via Script

Argument Descriptions

Common Arguments (`eval_grounding_vqa.py`)

Video VQA Arguments (`eval_video_reasoning.py`)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
README.md		README.md
create_question_with_context.py		create_question_with_context.py
eval_grounding_vqa.py		eval_grounding_vqa.py
eval_video_reasoning.py		eval_video_reasoning.py
requirements.txt		requirements.txt
run_grounding.sh		run_grounding.sh
run_video_reasoning.sh		run_video_reasoning.sh
run_vqa.sh		run_vqa.sh
vision_utils.py		vision_utils.py

Folders and files

Latest commit

History

Repository files navigation

ICCV 2025 Challenge

🏆 Competition Results

Model Zoo

Setup Environment

Hardware Requirements

Prerequisites

Key Dependencies

1. Visual Grounding on VG-RS

Inference via Script

2. Visual Question Answering on VQA-SA

Step 1: Data Preprocessing (Optional)

Step 2: Inference via Script

3. Video Question Answering on VR-Ads

Inference via Script

Argument Descriptions

Common Arguments (eval_grounding_vqa.py)

Video VQA Arguments (eval_video_reasoning.py)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Common Arguments (`eval_grounding_vqa.py`)

Video VQA Arguments (`eval_video_reasoning.py`)

Packages