Official Repository for NeurIPS'25 Paper "Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task"
-
Updated
May 18, 2026 - Python
Official Repository for NeurIPS'25 Paper "Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task"
Official codebase for the paper "WorldMemArena: Evaluating Multimodal Agent Memory Through Action–World Interaction"
MedCTA: A Benchmark for Clinical Tool Agents
A modular framework for benchmarking multimodal AI agents in a reproducible, full-OS environment. Using and adaption of the Smolagents's CodeAgent, Docker containers to run the VM in, VM's created using Qemu.
Adaptive Recursive Image Grid for Multimodal Agents.
Fully local AI desktop application that uses multi-agent orchestration to analyze short videos (~1 min) through natural language queries. All AI inference runs offline with no cloud dependencies.
Evaluation of GPT-4o-mini on OSWorld desktop automation benchmark. Compares screenshot-only vs accessibility tree-enhanced approaches across 10 tasks (Chrome, LibreOffice, file ops, etc). Documents critical coordinate extraction failures and provides architectural recommendations for GUI agents.
CoSee is a research prototype for diagnosing shared-state collaboration in resource-constrained visual agents using an auditable Board-based multimodal VQA workflow.
Add a description, image, and links to the multimodal-agents topic page so that developers can more easily learn about it.
To associate your repository with the multimodal-agents topic, visit your repo's landing page and select "manage topics."