Gradio Demo: Herculis-CUA-GUI-Actioner-4B is a Computer Use Agent (CUA) multimodal model designed for GUI understanding, UI localization, and action execution across web, desktop, and mobile environments. It focuses on visual grounding, intent-driven actioning, and UI-based question answering (VQA), enabling reliable interaction with real-world software interfaces. The model is optimized for efficient inference while maintaining strong accuracy on complex UI workflows.
- UI Localization: Upload screenshots and provide natural language prompts (e.g., "Locate the
microsoft/Fara-7Bmodel") to predict precise click coordinates in the formatClick(x, y). - Visual Grounding: Outputs annotated images with red ellipses marking predicted action points, scaled to the input resolution.
- Efficient Inference: Uses bfloat16 precision on CUDA for fast generation (max 128 new tokens); deterministic output with
do_sample=False. - Prompt Engineering: Structured localization prompts ensure focused responses without extraneous text.
- Error Handling: Graceful fallbacks for model loading, resizing, or parsing issues; detailed console logging.
- Gradio Interface: Simple UI with image upload, prompt input, and real-time visualization; supports sharing via
share=True.
- Python 3.10 or higher.
- CUDA-compatible GPU (recommended for bfloat16; falls back to CPU but slower).
- Stable internet for initial model download from Hugging Face.
git clone https://github.com/PRITHIVSAKTHIUR/Herculis-CUA-GUI-Actioner-4B-Demo.git
cd Herculis-CUA-GUI-Actioner-4B-Demo
Create a requirements.txt file with the following content, then run:
pip install -r requirements.txt
requirements.txt content:
gradio==6.1.0
transformers==4.57.1
numpy
torch
torchvision
accelerate
qwen-vl-utils
requests
pillow
spaces
python app.py
The demo launches at http://localhost:7860 (or a public URL if using share=True).
-
Upload Image: Provide a UI screenshot (e.g., web page or app interface; pre-loaded example available).
-
Enter Prompt: Describe the target element (e.g., "Locate the
microsoft/Fara-7Bmodel." or "Find the search bar."). -
Localize: Click "Localize" to run inference.
-
View Results:
- Text: Model output with
Click(x, y)coordinates. - Image: Annotated screenshot with a red ellipse at the predicted position.
- Text: Model output with
| Type | Preview |
|---|---|
| Input Image | ![]() |
| Output Image | ![]() |
- Upload a Hugging Face models page screenshot.
- Prompt: "Locate the
microsoft/Fara-7Bmodel." - Output:
Click(450, 300)and image with red marker on the model card.
- Model Loading Errors: Check network; ensure transformers 4.57.1. If flash-attention issues, install via
pip install flash-attn. Verify CUDA withtorch.cuda.is_available(). - Resizing Fails: Image processor uses
smart_resize; input dimensions must be positive. Fallback to original if errors occur. - No Coordinates Parsed: Ensure prompt ends with the target; model outputs deterministic text. Check console for raw response.
- OOM on GPU: Reduce batch size or use CPU; clear cache with
torch.cuda.empty_cache(). - Gradio Share Issues: Run with
debug=True; public links expire after 72 hours. - PIL Warnings: Update Pillow if resampling errors appear.
Contributions welcome! Fork the repo, create a feature branch, and submit a pull request with tests. Potential enhancements:
- Multi-step action chains.
- Support for type/scroll actions.
- Integration with automation tools like Selenium.
Repository: https://github.com/PRITHIVSAKTHIUR/Herculis-CUA-GUI-Actioner-4B-Demo.git
Apache License 2.0. See LICENSE for details.
Built by Prithiv Sakthi. Report issues via the repository.

