Gliese-CUA-Tool-Call-8B-Demo

A Gradio-based demonstration for the prithivMLmods/Gliese-CUA-Tool-Call-8B model, a Computer Use Agent (CUA) specialized in GUI understanding and tool-calling actions. Users upload UI screenshots (e.g., desktop or app interfaces), provide task instructions (e.g., "Click on the search bar"), and receive parsed actions (e.g., clicks or types with coordinates) visualized as crosshairs and labels on the image. Outputs structured JSON tool calls within <tool_call> blocks for precise interactions.

Features

GUI Action Inference: Natural language tasks generate JSON-formatted tool calls (e.g., {"action": "click", "coordinate": [400, 300]}).
Action Visualization: Overlays red crosshairs for clicks (or blue for others) with yellow labels on the output image using PIL.
Tool-Call Parsing: Extracts actions from regex-matched <tool_call> blocks; supports coordinates and text inputs.
Efficient Processing: Uses float16 precision on CUDA; generates up to 512 new tokens with Qwen2.5-VL architecture.
Custom Theme: OrangeRedTheme with gradients for an engaging interface.
Queueing Support: Handles up to 50 concurrent inferences for smooth usage.
Error Resilience: Fallbacks for missing inputs; console logging for raw responses and parsed actions.

Prerequisites

Python 3.10 or higher.
CUDA-compatible GPU (recommended for float16; falls back to CPU).
Stable internet for initial model download from Hugging Face.

Installation

Clone the repository:

git clone https://github.com/PRITHIVSAKTHIUR/Gliese-CUA-Tool-Call-8B-Demo.git
cd Gliese-CUA-Tool-Call-8B-Demo

Install dependencies: Create a requirements.txt file with the following content, then run:

pip install -r requirements.txt

requirements.txt content:

gradio==6.1.0
transformers==4.57.1
numpy
torch
torchvision
accelerate
qwen-vl-utils
requests
pillow
spaces

Start the application:
```
python app.py
```
The demo launches at http://localhost:7860 (or the provided URL if using Spaces).

Usage

Upload Image: Provide a UI screenshot (e.g., PNG of a desktop or app window; height up to 500px).
Enter Task: Describe the action (e.g., "Click on the search bar" or "Type 'Hello World' in the input field").
Call CUA: Click "Call CUA" to run inference.
View Results:
- Text: Raw model response with parsed JSON actions.
- Image: Annotated screenshot showing action points (crosshairs with labels).

Example Workflow

Upload a Windows desktop image.
Task: "Click on the start menu."
Output: Response with <tool_call> block; image with red crosshair labeled "Click" on the start button.

Troubleshooting

Model Loading Errors: Verify transformers 4.57.1; check CUDA with torch.cuda.is_available(). Use torch.float32 if float16 OOM occurs.
No Actions Parsed: Ensure task includes actionable elements; raw output logged in console. Adjust max_new_tokens if truncated.
Visualization Issues: PIL font fallback used; ensure images are RGB.
Queue Full: Increase max_size in demo.queue() for higher traffic.
Vision Utils: Install qwen-vl-utils for image processing; process_vision_info handles inputs.
UI Rendering: Set ssr_mode=True if gradients fail; check CSS for custom styles.

Contributing

Contributions encouraged! Fork the repo, create a feature branch (e.g., for multi-action support), and submit PRs with tests. Focus areas:

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
ipynb		ipynb
LICENSE		LICENSE
README.md		README.md
app.py		app.py
pre-requirements.txt		pre-requirements.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gliese-CUA-Tool-Call-8B-Demo

Features

Prerequisites

Installation

Usage

Example Workflow

Troubleshooting

Contributing

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Gliese-CUA-Tool-Call-8B-Demo

Features

Prerequisites

Installation

Usage

Example Workflow

Troubleshooting

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages