VCBench is a streaming counting benchmark for long videos. It treats counting as a minimal probe for diagnosing spatial-temporal state maintenance in video-language models. The benchmark queries a model at multiple time points during playback and measures how its predictions evolve over time, rather than only checking a single final answer.
VCBench decomposes counting into eight subcategories across two axes: object counting and event counting. Object counting includes current-state snapshots, state deltas, identity-tracking counts, and windowed gains. Event counting includes atomic actions, state transitions, episodic segments, and periodic actions. The dataset contains 406 videos, 1,000 questions, 4,576 query points, and 10,071 annotated event or state-change moments. The evaluation protocol uses three complementary metrics: GPA for numerical precision, MoC for monotonic consistency, and UDA for update-direction accuracy.
data/vcbench_eval.jsonldata/vcbench_data.jsonleval/demo_gemini.pyeval/unify_results.pyeval/compute_metrics.pyrun_gemini_eval.shrequirements.txt
- Run a Gemini evaluation demo on VCBench
- Convert raw per-query-point results into unified per-question format
- Compute GPA, MoC, and UDA
This release is designed so that someone who clones the repo can follow the README and run the provided scripts.
Download the benchmark videos from the Hugging Face dataset:
huggingface-cli download buaaplay/VCBench --repo-type dataset --local-dir data/videosThe demo script expects the source videos to be organized like this:
data/videos/
RoomTour3D/
-FZTi5EfPSQ.mp4
scannetpp/
09c1414f1b.mp4
...
pip install -r requirements.txtSet your Gemini key:
export GEMINI_API_KEY="your-gemini-api-key"Then run the provided shell script:
bash run_gemini_eval.sh --video-dir data/videos --limit 5You can also override the defaults with environment variables such as
VIDEO_DIR, INPUT_JSONL, LIMIT, MODEL, and FPS.
The script will:
- Run Gemini on a small demo slice of VCBench
- Write raw per-query-point outputs to
outputs/ - Convert the raw file to unified format
- Compute GPA, MoC, and UDA
If you want to run the pieces separately:
python eval/demo_gemini.py \
--video-dir data/videos \
--input data/vcbench_eval.jsonl \
--limit 5python eval/unify_results.py outputs/vcbench_gemini_demo_*.jsonl outputs/vcbench_gemini_demo_unified.jsonlpython eval/compute_metrics.py outputs/vcbench_gemini_demo_unified.jsonl data/vcbench_eval.jsonlThe metric script prints raw 0-1 values. Multiply by 100 if you want paper-style percentages.
- GPA: Gaussian Precision Accuracy. Higher is better.
- MoC: Monotonicity Consistency. Higher is better.
- UDA: Update Direction Accuracy. Higher is better.
data/
vcbench_eval.jsonl
vcbench_data.jsonl
eval/
demo_gemini.py
unify_results.py
compute_metrics.py
run_gemini_eval.sh
requirements.txt
@misc{liu2026vcbenchstreamingcountingbenchmark,
title={VCBench: A Streaming Counting Benchmark for Spatial-Temporal State Maintenance in Long Videos},
author={Pengyiang Liu and Zhongyue Shi and Hongye Hao and Qi Fu and Xueting Bi and Siwei Zhang and Xiaoyang Hu and Zitian Wang and Linjiang Huang and Si Liu},
year={2026},
eprint={2603.12703},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.12703},
}This dataset and code are released under CC BY 4.0.