Skip to content

Commit 6fa74e0

Browse files
abrichrclaude
andcommitted
docs: reframe positioning with multi-pillar strategy and honest scoping
- README: Replace "Demo-Conditioned Prompting" with "Trajectory-Conditioned Disambiguation" showing the 2x2 experimental matrix (prompting validated, fine-tuning in progress). Add OpenCUA industry validation. - Landing page strategy: Lead with capture-to-deployment pipeline, add specialization pillar, update competitor table for March 2026 landscape (Agent S3, OpenCUA, Browser Use, CUA/Bytebot). Add honesty notes for proof points. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent a12355a commit 6fa74e0

2 files changed

Lines changed: 63 additions & 43 deletions

File tree

README.md

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -285,18 +285,20 @@ flowchart TB
285285
class L0,L1,L2 implemented
286286
```
287287

288-
### Core Approach: Demo-Conditioned Prompting
288+
### Core Approach: Trajectory-Conditioned Disambiguation
289289

290-
OpenAdapt explores **demonstration-conditioned automation** - "show, don't tell":
290+
Zero-shot VLMs fail on GUI tasks not due to lack of capability, but due to **ambiguity in UI affordances**. OpenAdapt resolves this by conditioning agents on human demonstrations — "show, don't tell."
291291

292-
| Traditional Agent | OpenAdapt Agent |
293-
|-------------------|-----------------|
294-
| User writes prompts | User records demonstration |
295-
| Ambiguous instructions | Grounded in actual UI |
296-
| Requires prompt engineering | Reduced prompt engineering |
297-
| Context-free | Context from similar demos |
292+
| | No Retrieval | With Retrieval |
293+
|---|---|---|
294+
| **No Fine-tuning** | 33–47% (zero-shot baseline) | **100%** (validated, n=45) |
295+
| **Fine-tuning** | Standard SFT baseline | **Demo-conditioned FT** (core goal) |
298296

299-
**Retrieval powers BOTH training AND evaluation**: Similar demonstrations are retrieved as context for the VLM. In early experiments on a controlled macOS benchmark, this improved first-action accuracy from 46.7% to 100% - though all 45 tasks in that benchmark share the same navigation entry point. See the [publication roadmap](docs/publication-roadmap.md) for methodology and limitations.
297+
The bottom-right cell is OpenAdapt's unique value: training models to **use** demonstrations they haven't seen before, combining retrieval with fine-tuning for maximum accuracy. Phase 2 (retrieval-only prompting) is validated; Phase 3 (demo-conditioned fine-tuning) is in progress.
298+
299+
**Validated result**: On a controlled macOS benchmark (45 System Settings tasks sharing a common navigation entry point), demo-conditioned prompting improved first-action accuracy from 46.7% to 100%. A length-matched control (+11.1 pp only) confirms the benefit is semantic, not token-length. See the [research thesis](https://github.com/OpenAdaptAI/openadapt-ml/blob/main/docs/research_thesis.md) for methodology and the [publication roadmap](docs/publication-roadmap.md) for limitations.
300+
301+
**Industry validation**: [OpenCUA](https://github.com/xlang-ai/OpenCUA) (NeurIPS 2025 Spotlight, XLANG Lab) built their cross-platform capture tool on OpenAdapt, but uses demos only for training — not runtime conditioning. No open-source CUA framework currently does demo-conditioned inference, which remains OpenAdapt's architectural differentiator.
300302

301303
### Key Concepts
302304

docs/design/landing-page-strategy.md

Lines changed: 52 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,9 @@ OpenAdapt has evolved from a monolithic application (v0.46.0) to a **modular met
4040
4. **Open Source (MIT License)**: Full transparency, no vendor lock-in
4141

4242
**Key Innovation**:
43-
- **Trajectory-conditioned disambiguation of UI affordances** - validated experiment showing 33% -> 100% first-action accuracy with demo conditioning
43+
- **Trajectory-conditioned disambiguation of UI affordances** — the only open-source CUA framework that conditions agents on recorded demonstrations at runtime (validated: 46.7% → 100% first-action accuracy)
44+
- **Specialization over scale** — fine-tuned Qwen3-VL-2B outperforms Claude Sonnet 4.5 and GPT-5.1 on action accuracy (42.9% vs 11.2% vs 23.2%) on an internal benchmark
45+
- **Capture-to-deployment pipeline** — record → retrieve → train → deploy, used by [OpenCUA](https://github.com/xlang-ai/OpenCUA) (NeurIPS 2025 Spotlight) as foundation for their capture tooling
4446
- **Set-of-Marks (SoM) mode**: 100% accuracy on synthetic benchmarks using element IDs instead of coordinates
4547

4648
### 1.2 Current Landing Page Assessment
@@ -218,25 +220,31 @@ Why: Clear 3-step process, action-oriented
218220

219221
### 3.3 Key Differentiators to Emphasize
220222

221-
1. **Demonstration-Based Learning**
222-
- Not: "Use natural language to describe tasks"
223-
- But: "Just do the task and OpenAdapt learns from watching"
224-
- Proof: 33% -> 100% first-action accuracy with demo conditioning
223+
1. **Capture-to-Deployment Pipeline**
224+
- Not: "Prompt the AI to do your task"
225+
- But: "Record the task once. OpenAdapt handles the rest — retrieval, training, deployment."
226+
- Proof: 7 modular packages (capture, ML, evals, grounding, retrieval, privacy, viewer); OpenCUA (NeurIPS 2025) built on OpenAdapt's capture infrastructure
225227

226-
2. **Model Agnostic**
228+
2. **Demonstration-Conditioned Agents**
229+
- Not: "Zero-shot reasoning about what to click"
230+
- But: "Agents conditioned on relevant demos — at inference AND during training"
231+
- Proof: 46.7% → 100% first-action accuracy with demo conditioning (validated, n=45). No other open-source CUA framework does runtime demo conditioning.
232+
- Note: This is first-action accuracy on tasks sharing a navigation entry point. Multi-step and cross-domain evaluation is ongoing on Windows Agent Arena.
233+
234+
3. **Specialization Over Scale**
235+
- Not: "Use the biggest model available"
236+
- But: "A 2B model fine-tuned on your workflows outperforms frontier models"
237+
- Proof: Qwen3-VL-2B (42.9%) vs Claude Sonnet 4.5 (11.2%) on action accuracy (internal benchmark, synthetic login task)
238+
239+
4. **Model Agnostic**
227240
- Not: "Works with [specific AI]"
228241
- But: "Your choice: Claude, GPT-4V, Gemini, Qwen, or custom models"
229242
- Proof: Adapters for multiple VLM backends
230243

231-
3. **Runs Anywhere**
232-
- Not: "Cloud-powered automation"
233-
- But: "Run locally, in the cloud, or hybrid"
234-
- Proof: CLI-based, works offline
235-
236-
4. **Open Source**
237-
- Not: "Try our free tier"
238-
- But: "MIT licensed, fully transparent, community-driven"
239-
- Proof: GitHub, PyPI, active Discord
244+
5. **Runs Anywhere & Open Source**
245+
- Not: "Cloud-powered automation" / "Try our free tier"
246+
- But: "Run locally, in the cloud, or hybrid. MIT licensed, fully transparent."
247+
- Proof: CLI-based, works offline; GitHub, PyPI, active Discord
240248

241249
### 3.4 Messaging Framework
242250

@@ -256,14 +264,17 @@ Why: Clear 3-step process, action-oriented
256264

257265
## 4. Competitive Positioning
258266

259-
### 4.1 Primary Competitors
267+
### 4.1 Primary Competitors (Updated March 2026)
260268

261269
| Competitor | Strengths | Weaknesses | Our Advantage |
262270
|------------|-----------|------------|---------------|
263-
| **Anthropic Computer Use** | First-mover, Claude integration, simple API | Proprietary, cloud-only, no customization | Open source, model-agnostic, trainable |
264-
| **UI-TARS (ByteDance)** | Strong benchmark scores, research backing | Closed source, not productized | Open source, deployable, extensible |
265-
| **Traditional RPA (UiPath, etc.)** | Enterprise-proven, large ecosystems | Brittle selectors, no AI reasoning, expensive | AI-first, learns from demos, affordable |
266-
| **GPT-4V + Custom Code** | Powerful model, flexibility | Requires building everything, no structure | Ready-made SDK, training pipeline, benchmarks |
271+
| **Anthropic Computer Use** | 72.5% OSWorld (near-human), simple API | Proprietary, cloud-only, no customization, per-action cost | Open source, model-agnostic, trainable, runs locally |
272+
| **Agent S3 (Simular)** | 72.6% OSWorld (superhuman), open source | Zero-shot only, no demo conditioning, no fine-tuning pipeline | Demo-conditioned agents, capture-to-train pipeline |
273+
| **OpenCUA (XLANG Lab)** | NeurIPS Spotlight, 45% OSWorld, open models (7B-72B) | Zero-shot at inference — demos used only for training, not runtime | Runtime demo conditioning (unique), OpenCUA built on our capture tool |
274+
| **Browser Use** | 50k+ GitHub stars, 89% WebVoyager | Browser-only, no desktop, no training pipeline | Full desktop support, fine-tuning, demo library |
275+
| **UI-TARS (ByteDance)** | Local models (2B-72B), Apache 2.0 | No demo conditioning, no capture pipeline | End-to-end record→train→deploy, demo retrieval |
276+
| **CUA / Bytebot** | Container infra, YC-backed | Infrastructure-only, no ML training pipeline | Full pipeline: capture + train + eval + deploy |
277+
| **Traditional RPA (UiPath, etc.)** | Enterprise-proven, UiPath Screen Agent #1 on OSWorld | Brittle selectors, expensive ($10K+/yr), requires scripting | AI-first, learns from demos, open source |
267278

268279
### 4.2 Positioning Statement
269280

@@ -352,21 +363,19 @@ Show it once. Let it handle the rest.
352363
```
353364
## Why OpenAdapt?
354365
355-
### Demonstration-Based Learning
356-
No prompt engineering required. OpenAdapt learns from how you actually do tasks.
357-
[Stat: 33% -> 100% first-action accuracy with demo conditioning]
366+
### Record Once, Automate Forever
367+
Capture any workflow. OpenAdapt retrieves relevant demos to guide agents
368+
AND trains specialized models on your recordings.
369+
[Stat: 46.7% → 100% first-action accuracy with demo conditioning]
358370
359-
### Model Agnostic
360-
Your choice of AI: Claude, GPT-4V, Gemini, Qwen-VL, or fine-tune your own.
361-
Not locked to any single provider.
371+
### Small Models, Big Results
372+
A 2B model fine-tuned on your workflows outperforms frontier models.
373+
Specialization beats scale for GUI tasks.
374+
[Stat: 42.9% action accuracy (Qwen 2B FT) vs 11.2% (Claude Sonnet 4.5)]
362375
363-
### Run Anywhere
364-
CLI-based, works offline. Deploy locally, in the cloud, or hybrid.
365-
Your data stays where you want it.
366-
367-
### Fully Open Source
368-
MIT licensed. Transparent, auditable, community-driven.
369-
No vendor lock-in, ever.
376+
### Model Agnostic & Open Source
377+
Your choice of AI: Claude, GPT-4V, Gemini, Qwen-VL, or fine-tune your own.
378+
MIT licensed. Run locally, in the cloud, or hybrid.
370379
```
371380

372381
### 5.5 For Developers Section
@@ -486,13 +495,22 @@ Example: Onboarding guides for complex internal tools.
486495

487496
### 6.3 Proof Points to Include
488497

489-
- "33% -> 100% first-action accuracy with demonstration conditioning"
498+
- "46.7% → 100% first-action accuracy with demo conditioning (n=45, same model, no training)"
499+
- "Fine-tuned 2B model outperforms Claude Sonnet 4.5 on action accuracy (42.9% vs 11.2%, internal benchmark)"
500+
- "OpenCUA (NeurIPS 2025 Spotlight) built their capture tool on OpenAdapt"
501+
- "Only open-source CUA framework with runtime demo-conditioned inference"
490502
- "[X,XXX] PyPI downloads this month" (dynamic)
491503
- "[XXX] GitHub stars" (dynamic)
492504
- "7 modular packages, 1 unified CLI"
493505
- "Integrated with Windows Agent Arena, WebArena, OSWorld benchmarks"
494506
- "MIT licensed, fully open source"
495507

508+
**Honesty notes for proof points**:
509+
- The 46.7%→100% result is first-action only on 45 macOS tasks sharing the same navigation entry point
510+
- The 42.9% vs 11.2% result is on a controlled internal synthetic login benchmark (~3 UI elements)
511+
- Multi-step episode success on real-world benchmarks (WAA) is under active evaluation
512+
- Frame these as "validated signal" not "production-proven"
513+
496514
---
497515

498516
## 7. Wireframe Concepts

0 commit comments

Comments
 (0)