A Gradio-based demonstration for the prithivMLmods/Gliese-CUA-Tool-Call-8B model, specialized in GUI element localization. Users upload UI screenshots, provide task instructions (e.g., "Click on the search bar"), and receive predicted click coordinates in
Click(x, y)format, visualized as crosshairs and labels on the image. Features model download to local directory for offline use, smart image resizing, and coordinate scaling to original resolution.
- Element Localization: Natural language tasks predict precise pixel coordinates for UI components (e.g., buttons, inputs).
- Action Visualization: Overlays red crosshairs with yellow labels on the output image using PIL for clear action points.
- Smart Resizing: Automatically resizes inputs based on model processor params (min/max pixels, patch/merge sizes) for optimal inference.
- Coordinate Scaling: Adjusts resized coordinates back to original image dimensions for accurate absolute positioning.
- Efficient Inference: Uses bfloat16/float32 precision on CUDA; generates up to 128 new tokens with deterministic output.
- Local Model Storage: Downloads model via Hugging Face Hub snapshot to
./model/for faster reloads and offline capability. - Custom Theme: OrangeRedTheme with gradients for an intuitive interface.
- Queueing Support: Handles up to 50 concurrent inferences.
- Python 3.10 or higher.
- CUDA-compatible GPU (recommended for bfloat16; falls back to CPU).
- Stable internet for initial model download (subsequent runs use local cache).
-
Clone the repository:
git clone https://github.com/PRITHIVSAKTHIUR/Gliese-CUA-Tool-Call-8B-Localization.git cd Gliese-CUA-Tool-Call-8B-Localization -
Install dependencies: Create a
requirements.txtfile with the following content, then run:pip install -r requirements.txtrequirements.txt content:
gradio==6.1.0 transformers==4.57.1 huggingface-hub numpy torch torchvision accelerate qwen-vl-utils requests pillow spaces -
Start the application:
python app.pyThe demo launches at
http://localhost:7860(or the provided URL if using Spaces). The first run downloads the model (~8B params) to./model/Gliese-CUA-Tool-Call-8B.
-
Upload Image: Provide a UI screenshot (e.g., PNG of a web page or app; height up to 500px).
-
Enter Task: Describe the target (e.g., "Locate the search bar" or "Find the submit button").
-
Call CUA Agent: Click the button to run inference.
-
View Results:
- Text: Raw model response with parsed
Click(x, y). - Image: Annotated screenshot with crosshair visualization.
- Text: Raw model response with parsed
- Upload a browser screenshot.
- Task: "Click on the search bar."
- Output:
Click(250, 150)and image with red crosshair on the bar.
- Model Download Fails: Check internet; resume with
resume_download=True. Verifyallow_patterns="Localization-8B/**". - Loading Errors: Ensure transformers 4.57.1; check CUDA with
torch.cuda.is_available(). Usetorch.float32if bfloat16 OOM. - No Coordinates Parsed: Task must be localization-focused; raw output in console. Increase max_new_tokens if needed.
- Resizing Issues:
smart_resizeenforces min/max pixels; fallback to original if errors. - Visualization Problems: PIL font fallback used; ensure RGB images.
- Queue Full: Increase
max_sizeindemo.queue(). - Spaces Deployment: Install
spaces; setshow_error=Truefor debugging.
Contributions encouraged! Fork the repo, create a feature branch (e.g., for multi-target support), and submit PRs with tests. Focus areas:
- Extension to tool-calling beyond localization.
- Batch image processing.
- Custom prompt templates.
Repository: https://github.com/PRITHIVSAKTHIUR/Gliese-CUA-Tool-Call-8B-Localization.git
Apache License 2.0. See LICENSE for details.
Built by Prithiv Sakthi. Report issues via the repository.