You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Zero-shot VLMs fail on GUI tasks not due to lack of capability, but due to **ambiguity in UI affordances**. OpenAdapt resolves this by conditioning agents on human demonstrations — "show, don't tell."
291
291
292
-
| Traditional Agent | OpenAdapt Agent |
293
-
|-------------------|-----------------|
294
-
| User writes prompts | User records demonstration |
295
-
| Ambiguous instructions | Grounded in actual UI |
|**Fine-tuning**| Standard SFT baseline |**Demo-conditioned FT** (core goal) |
298
296
299
-
**Retrieval powers BOTH training AND evaluation**: Similar demonstrations are retrieved as context for the VLM. In early experiments on a controlled macOS benchmark, this improved first-action accuracy from 46.7% to 100% - though all 45 tasks in that benchmark share the same navigation entry point. See the [publication roadmap](docs/publication-roadmap.md) for methodology and limitations.
297
+
The bottom-right cell is OpenAdapt's unique value: training models to **use** demonstrations they haven't seen before, combining retrieval with fine-tuning for maximum accuracy. Phase 2 (retrieval-only prompting) is validated; Phase 3 (demo-conditioned fine-tuning) is in progress.
298
+
299
+
**Validated result**: On a controlled macOS benchmark (45 System Settings tasks sharing a common navigation entry point), demo-conditioned prompting improved first-action accuracy from 46.7% to 100%. A length-matched control (+11.1 pp only) confirms the benefit is semantic, not token-length. See the [research thesis](https://github.com/OpenAdaptAI/openadapt-ml/blob/main/docs/research_thesis.md) for methodology and the [publication roadmap](docs/publication-roadmap.md) for limitations.
300
+
301
+
**Industry validation**: [OpenCUA](https://github.com/xlang-ai/OpenCUA) (NeurIPS 2025 Spotlight, XLANG Lab) built their cross-platform capture tool on OpenAdapt, but uses demos only for training — not runtime conditioning. No open-source CUA framework currently does demo-conditioned inference, which remains OpenAdapt's architectural differentiator.
Copy file name to clipboardExpand all lines: docs/design/landing-page-strategy.md
+52-34Lines changed: 52 additions & 34 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -40,7 +40,9 @@ OpenAdapt has evolved from a monolithic application (v0.46.0) to a **modular met
40
40
4.**Open Source (MIT License)**: Full transparency, no vendor lock-in
41
41
42
42
**Key Innovation**:
43
-
-**Trajectory-conditioned disambiguation of UI affordances** - validated experiment showing 33% -> 100% first-action accuracy with demo conditioning
43
+
-**Trajectory-conditioned disambiguation of UI affordances** — the only open-source CUA framework that conditions agents on recorded demonstrations at runtime (validated: 46.7% → 100% first-action accuracy)
44
+
-**Specialization over scale** — fine-tuned Qwen3-VL-2B outperforms Claude Sonnet 4.5 and GPT-5.1 on action accuracy (42.9% vs 11.2% vs 23.2%) on an internal benchmark
45
+
-**Capture-to-deployment pipeline** — record → retrieve → train → deploy, used by [OpenCUA](https://github.com/xlang-ai/OpenCUA) (NeurIPS 2025 Spotlight) as foundation for their capture tooling
44
46
-**Set-of-Marks (SoM) mode**: 100% accuracy on synthetic benchmarks using element IDs instead of coordinates
- But: "Just do the task and OpenAdapt learns from watching"
224
-
- Proof: 33% -> 100% first-action accuracy with demo conditioning
223
+
1.**Capture-to-Deployment Pipeline**
224
+
- Not: "Prompt the AI to do your task"
225
+
- But: "Record the task once. OpenAdapt handles the rest — retrieval, training, deployment."
226
+
- Proof: 7 modular packages (capture, ML, evals, grounding, retrieval, privacy, viewer); OpenCUA (NeurIPS 2025) built on OpenAdapt's capture infrastructure
225
227
226
-
2.**Model Agnostic**
228
+
2.**Demonstration-Conditioned Agents**
229
+
- Not: "Zero-shot reasoning about what to click"
230
+
- But: "Agents conditioned on relevant demos — at inference AND during training"
231
+
- Proof: 46.7% → 100% first-action accuracy with demo conditioning (validated, n=45). No other open-source CUA framework does runtime demo conditioning.
232
+
- Note: This is first-action accuracy on tasks sharing a navigation entry point. Multi-step and cross-domain evaluation is ongoing on Windows Agent Arena.
233
+
234
+
3.**Specialization Over Scale**
235
+
- Not: "Use the biggest model available"
236
+
- But: "A 2B model fine-tuned on your workflows outperforms frontier models"
237
+
- Proof: Qwen3-VL-2B (42.9%) vs Claude Sonnet 4.5 (11.2%) on action accuracy (internal benchmark, synthetic login task)
|**Anthropic Computer Use**| First-mover, Claude integration, simple API | Proprietary, cloud-only, no customization | Open source, model-agnostic, trainable |
264
-
|**UI-TARS (ByteDance)**| Strong benchmark scores, research backing | Closed source, not productized | Open source, deployable, extensible |
265
-
|**Traditional RPA (UiPath, etc.)**| Enterprise-proven, large ecosystems | Brittle selectors, no AI reasoning, expensive | AI-first, learns from demos, affordable |
266
-
|**GPT-4V + Custom Code**| Powerful model, flexibility | Requires building everything, no structure | Ready-made SDK, training pipeline, benchmarks |
271
+
|**Anthropic Computer Use**| 72.5% OSWorld (near-human), simple API | Proprietary, cloud-only, no customization, per-action cost | Open source, model-agnostic, trainable, runs locally |
272
+
|**Agent S3 (Simular)**| 72.6% OSWorld (superhuman), open source | Zero-shot only, no demo conditioning, no fine-tuning pipeline | Demo-conditioned agents, capture-to-train pipeline |
273
+
|**OpenCUA (XLANG Lab)**| NeurIPS Spotlight, 45% OSWorld, open models (7B-72B) | Zero-shot at inference — demos used only for training, not runtime | Runtime demo conditioning (unique), OpenCUA built on our capture tool |
274
+
|**Browser Use**| 50k+ GitHub stars, 89% WebVoyager | Browser-only, no desktop, no training pipeline | Full desktop support, fine-tuning, demo library |
275
+
|**UI-TARS (ByteDance)**| Local models (2B-72B), Apache 2.0 | No demo conditioning, no capture pipeline | End-to-end record→train→deploy, demo retrieval |
276
+
|**CUA / Bytebot**| Container infra, YC-backed | Infrastructure-only, no ML training pipeline | Full pipeline: capture + train + eval + deploy |
277
+
|**Traditional RPA (UiPath, etc.)**| Enterprise-proven, UiPath Screen Agent #1 on OSWorld | Brittle selectors, expensive ($10K+/yr), requires scripting | AI-first, learns from demos, open source |
267
278
268
279
### 4.2 Positioning Statement
269
280
@@ -352,21 +363,19 @@ Show it once. Let it handle the rest.
352
363
```
353
364
## Why OpenAdapt?
354
365
355
-
### Demonstration-Based Learning
356
-
No prompt engineering required. OpenAdapt learns from how you actually do tasks.
357
-
[Stat: 33% -> 100% first-action accuracy with demo conditioning]
366
+
### Record Once, Automate Forever
367
+
Capture any workflow. OpenAdapt retrieves relevant demos to guide agents
368
+
AND trains specialized models on your recordings.
369
+
[Stat: 46.7% → 100% first-action accuracy with demo conditioning]
358
370
359
-
### Model Agnostic
360
-
Your choice of AI: Claude, GPT-4V, Gemini, Qwen-VL, or fine-tune your own.
361
-
Not locked to any single provider.
371
+
### Small Models, Big Results
372
+
A 2B model fine-tuned on your workflows outperforms frontier models.
0 commit comments