Molmo2-HF-Demo

A Gradio-based demonstration for the AllenAI Molmo2-8B multimodal model, enabling image QA, multi-image pointing, video QA, and temporal tracking. Users upload images or videos, provide natural language prompts (e.g., "Point to the boats" or "Track the player"), and receive generated responses with visual annotations: red circles and labels for detected points on images/videos. Supports multi-image galleries and MP4 videos with frame sampling.

Features

Multi-Image Processing: Upload galleries for comparative QA or pointing tasks; extracts coordinates from model output and annotates with red ellipses and "target" labels.
Video Analysis: Handles MP4 inputs for QA or tracking; samples frames, detects temporal points (timestamps), and overlays circles on annotated output videos.
Point/Track Extraction: Parses <points|tracks> XML-like coords from responses; scales to original dimensions; supports IDs for multi-element detection.
Custom Theme: OrangeRedTheme with gradients and enhanced typography for a modern interface.
Examples Integration: Pre-loaded samples for quick testing (e.g., cat/dog pointing, football tracking).
Queueing Support: Handles concurrent inferences with error display.
Efficient Inference: Uses auto dtype/device_map; up to 2048 new tokens for verbose video responses.

Screenshot 2025-12-20 at 13-57-26 Molmo2 HF Demo - a Hugging Face Space by prithivMLmods

Screenshot 2025-12-20 at 13-59-03 Molmo2 HF Demo - a Hugging Face Space by prithivMLmods

annotated_video.mp4

Prerequisites

Python 3.10 or higher.
CUDA-compatible GPU (recommended for auto dtype; falls back to CPU).
pip >= 23.0.0 (see pre-requirements.txt).
Stable internet for initial model download (~8B params).

Installation

Clone the repository:

git clone https://github.com/PRITHIVSAKTHIUR/Molmo2-HF-Demo.git
cd Molmo2-HF-Demo

Install pre-requirements (for pip version): Create a pre-requirements.txt file with the following content, then run:
```
pip install -r pre-requirements.txt
```
pre-requirements.txt content:
```
pip>=23.0.0
```

Install dependencies: Create a requirements.txt file with the following content, then run:

pip install -r requirements.txt

requirements.txt content:

transformers==4.57.1
huggingface_hub
qwen-vl-utils
pyvips-binary
sentencepiece
opencv-python
torch==2.6.0
docling-core
molmo_utils
python-docx
torchvision
supervision
accelerate
matplotlib
pdf2image
reportlab
markdown
requests
pymupdf
decord2
hf_xet
spaces
pyvips
pillow
gradio
einops
httpx
fpdf
peft
timm
av

Start the application:
```
python app.py
```
The demo launches at http://localhost:7860 (or the provided URL if using Spaces).

Usage

Switch between "Images" and "Video" tabs.

Images Tab

Upload Gallery: Select multiple images (e.g., via file paths).
Enter Prompt: E.g., "Point to the cats" or "Compare these two images."
Run Analysis: Click "Run Image Analysis."
Output: Text response + annotated gallery (red circles on pointed elements).

Video Tab

Upload Video: Select an MP4 file.
Enter Prompt: E.g., "Track the football" or "Explain the video."
Run Analysis: Click "Run Video Analysis" (note: longer due to frame sampling).
Output: Text response + annotated MP4 (circles on tracked points).

Examples

Tab	Example Inputs	Prompt Example
Images	["example-images/cat1.jpg", "example-images/cat2.jpg", "example-images/dog1.jpg"]	"Point to the cats."
Images	["example-images/compare1.jpg", "example-images/compare2.jpeg"]	"Compare these two images."
Video	"example-videos/sample_video.mp4"	"Track the football."
Video	"example-videos/drink.mp4"	"Explain the video."

Troubleshooting

Model Loading Errors: Verify transformers 4.57.1 and torch 2.6.0; check device_map="auto" for multi-GPU. Use dtype=torch.float32 if bfloat16 fails.
Point Extraction Fails: Ensure prompt requests "point" or "track"; regex handles <points coords="..."> format. Console logs raw text.
Video Annotation Issues: OpenCV requires MP4V codec; scale factors applied if metadata differs. Test with short clips.
Gallery Paths: Use 'filepath' type; tuples handled for older Gradio.
OOM on GPU: Reduce max_new_tokens or batch size; clear cache with torch.cuda.empty_cache().
Utils Missing: Install molmo_utils for video processing; process_vision_info samples frames.
UI Rendering: Set ssr_mode=True if gradients fail; CSS for container max-width.

Contributing

Contributions encouraged! Fork the repo, add examples or enhance parsing (e.g., for multi-video), and submit PRs with tests. Focus areas:

Support for more modalities (e.g., audio).
Custom annotation styles.
Batch video processing.

Repository: https://github.com/PRITHIVSAKTHIUR/Molmo2-HF-Demo.git

License

Apache License 2.0. See LICENSE for details.

Built by Prithiv Sakthi. Report issues via the repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Molmo2-HF-Demo

Features

Prerequisites

Installation

Usage

Images Tab

Video Tab

Examples

Troubleshooting

Contributing

License

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
example-images		example-images
example-videos		example-videos
ipynb		ipynb
LICENSE		LICENSE
README.md		README.md
app.py		app.py
pre-requirements.txt		pre-requirements.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Molmo2-HF-Demo

Features

Prerequisites

Installation

Usage

Images Tab

Video Tab

Examples

Troubleshooting

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages