multimodal-agents

Here are 8 public repositories matching this topic...

fansunqi / VideoTool

Official Repository for NeurIPS'25 Paper "Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task"

agent computer-vision deep-learning video-understanding video-analysis multimodal videoqa llm mllm tool-learning multimodal-agents video-agents

Updated May 18, 2026
Python

UCSB-AI / WorldMemArena

Star

Official codebase for the paper "WorldMemArena: Evaluating Multimodal Agent Memory Through Action–World Interaction"

memory ai-agents agentic-memory multimodal-agents multimodal-memory memory-evaluation

Updated May 29, 2026
Python

IVUL-KAUST / MedCTA

Star

MedCTA: A Benchmark for Clinical Tool Agents

benchmark controllers tool-use medical-ai vlms agentic-ai reasoning-language-models multimodal-agents tool-augmented-agents

Updated Jun 11, 2026
Python

melchiorhering / GUI-OS-AI-Agent-Benchmarking

Star

A modular framework for benchmarking multimodal AI agents in a reproducible, full-OS environment. Using and adaption of the Smolagents's CodeAgent, Docker containers to run the VM in, VM's created using Qemu.

docker benchmarking qemu gui-automation ai-agents smolagents multimodal-agents spider2-v

Updated Jun 15, 2026
Jupyter Notebook

Yeqi99 / CanpGrid

Star

Adaptive Recursive Image Grid for Multimodal Agents.

python image-grid visual-grounding agent-tools gui-agents computer-use multimodal-agents canpai

Updated Jun 16, 2026
Python

haja-k / agentic-video-analyst

Star

Fully local AI desktop application that uses multi-agent orchestration to analyze short videos (~1 min) through natural language queries. All AI inference runs offline with no cloud dependencies.

video transcribe edge-computing agentic multimodal-agents

Updated Feb 11, 2026
Python

suraj-ranganath / osworld-gpt4o-mini-benchmark

Star

Evaluation of GPT-4o-mini on OSWorld desktop automation benchmark. Compares screenshot-only vs accessibility tree-enhanced approaches across 10 tasks (Chrome, LibreOffice, file ops, etc). Documents critical coordinate extraction failures and provides architectural recommendations for GUI agents.

desktop-automation gpt-4o-mini multimodal-agents osworld

Updated Nov 20, 2025
Python

Clayca / CoSee

Star

CoSee is a research prototype for diagnosing shared-state collaboration in resource-constrained visual agents using an auditable Board-based multimodal VQA workflow.

multi-agent-systems visual-question-answering failure-mode-analysis multimodal-agents

Updated May 15, 2026
Python

Improve this page

Add a description, image, and links to the multimodal-agents topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the multimodal-agents topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multimodal-agents

Here are 8 public repositories matching this topic...

fansunqi / VideoTool

UCSB-AI / WorldMemArena

IVUL-KAUST / MedCTA

melchiorhering / GUI-OS-AI-Agent-Benchmarking

Yeqi99 / CanpGrid

haja-k / agentic-video-analyst

suraj-ranganath / osworld-gpt4o-mini-benchmark

Clayca / CoSee

Improve this page

Add this topic to your repo