Skip to content

Commit d4b7ca0

Browse files
abrichrclaude
andauthored
Add Section 11: Path to Main Track Publication (Parallel Track) (#975)
This section provides a rigorous and honest assessment of what would be required to elevate the current work from workshop-level to main track publication at venues like NeurIPS, ICML, or ICLR. Key additions: - 11.1: Honest assessment of why current work is workshop-level (prompt engineering, not ML research) with table of reviewer concerns - 11.2: Four technical contribution options to elevate the work: - Option A: Learned Demo Retrieval (RECOMMENDED, 2-3 months) - Option B: Learned Prompt Synthesis (3-4 months) - Option C: Behavioral Cloning with Demo-Augmentation (4-6 months) - Option D: Theoretical Analysis (2-3 months) - 11.3: Additional experiments required (WAA 50+ tasks, WebArena 100+, multi-model, ablations, statistical significance) - 11.4: Timeline and resource estimates (6-7 months minimum, 1-2 FTE, $5-10k compute/API costs) - 11.5: Honest recommendation based on team resources - 11.6: Additional references (REALM, Atlas, DocPrompting, APE, DSPy, CogAgent, SeeClick, RT-2) Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
1 parent 37170ee commit d4b7ca0

1 file changed

Lines changed: 231 additions & 0 deletions

File tree

docs/publication-roadmap.md

Lines changed: 231 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ This document is written from the perspective of a skeptical reviewer at a top v
2727
8. [Realistic Timeline](#8-realistic-timeline)
2828
9. [Risk Mitigation](#9-risk-mitigation)
2929
10. [Action Items](#10-action-items)
30+
11. [Path to Main Track Publication (Parallel Track)](#11-path-to-main-track-publication-parallel-track)
3031

3132
---
3233

@@ -442,6 +443,236 @@ Based on related work, likely reviewers include researchers from:
442443

443444
---
444445

446+
## 11. Path to Main Track Publication (Parallel Track)
447+
448+
This section provides a rigorous assessment of what would be required to publish in a main track venue (NeurIPS, ICML, ICLR) rather than a workshop. This is a parallel track that requires substantially more investment.
449+
450+
### 11.1 Honest Assessment: Why Current Work is Workshop-Level
451+
452+
Our current contribution is fundamentally **prompt engineering**, not machine learning research. While valuable for practitioners, this positions us poorly for ML venues that expect learned components, theoretical insights, or architectural innovations.
453+
454+
**Table: Anticipated Reviewer Concerns for Main Track Submission**
455+
456+
| Concern | Severity | Our Current Status | What Main Track Requires |
457+
|---------|----------|-------------------|--------------------------|
458+
| No learned component | **Critical** | True - retrieval uses heuristic similarity | Train retrieval end-to-end for downstream task |
459+
| Single demo format | **High** | True - behavior-only format hardcoded | Learn optimal format/compression |
460+
| Heuristic retrieval (BM25/embedding) | **High** | True - not optimized for action accuracy | Retrieval that optimizes task success, not similarity |
461+
| Limited evaluation | **High** | 45 tasks, 1 model, 1 platform | 200+ tasks, 3+ models, 2+ benchmarks |
462+
| No comparison to fine-tuning | **High** | True | Show when prompting beats/complements fine-tuning |
463+
| No theoretical analysis | **Medium** | True - purely empirical | Information-theoretic or PAC-learning analysis |
464+
| Engineering focus | **Medium** | True - system building, not research | Clear algorithmic or theoretical contribution |
465+
| No ablation of demo components | **Medium** | Partial | Systematic ablation with significance tests |
466+
467+
**Bottom line**: A main track reviewer at NeurIPS/ICML will likely say: "This is a well-executed engineering project with an empirical evaluation, but where is the research contribution? Adding demos to prompts is not novel."
468+
469+
### 11.2 Required Technical Contributions (Options to Elevate)
470+
471+
To elevate from workshop to main track, we need at least ONE of the following technical contributions:
472+
473+
#### Option A: Learned Demo Retrieval (RECOMMENDED)
474+
475+
**Effort**: 2-3 months | **Risk**: Medium | **Novelty**: High
476+
477+
**Core idea**: Train the retrieval system to optimize action accuracy, not semantic similarity.
478+
479+
**Why this works**: Current retrieval uses off-the-shelf embeddings (CLIP, text similarity) that optimize for semantic match. But the best demo for a task may not be the most semantically similar - it may be one that provides the right procedural template or spatial priors.
480+
481+
**Technical approach**:
482+
1. Collect retrieval training data: (query, demo, action_accuracy) tuples
483+
2. Train retrieval scorer to predict action accuracy given (query, demo) pair
484+
3. Use contrastive learning: demos that help should score higher than demos that don't
485+
4. Evaluate: Does learned retrieval outperform heuristic retrieval?
486+
487+
**Key experiments**:
488+
- Retrieval recall@k vs action accuracy correlation
489+
- Learned vs heuristic retrieval on held-out tasks
490+
- Analysis of what the model learns (which demo features matter?)
491+
492+
**Related work to cite**:
493+
- REALM (Guu et al., 2020) - Retrieval-augmented language model pretraining
494+
- Atlas (Izacard et al., 2022) - Few-shot learning with retrieval
495+
- DocPrompting (Zhou et al., 2022) - Retrieve docs for code generation
496+
497+
**Why reviewers would accept**: "First demonstration that learned retrieval improves demo-conditioned GUI agents, with analysis of what retrieval features matter."
498+
499+
#### Option B: Learned Prompt Synthesis
500+
501+
**Effort**: 3-4 months | **Risk**: Medium-High | **Novelty**: High
502+
503+
**Core idea**: Learn to synthesize optimal demo prompts rather than using fixed templates.
504+
505+
**Technical approach**:
506+
1. Define prompt template space (what to include, how to format, compression level)
507+
2. Use LLM-in-the-loop optimization (APE-style) to find optimal templates
508+
3. Alternatively, train a small model to select/compress demo content
509+
4. Evaluate: Does learned synthesis outperform hand-crafted templates?
510+
511+
**Key experiments**:
512+
- Template ablation with learned selection
513+
- Compression ratio vs accuracy tradeoff
514+
- Cross-task transfer of learned templates
515+
516+
**Related work to cite**:
517+
- APE (Zhou et al., 2022) - Automatic prompt engineering
518+
- DSPy (Khattab et al., 2023) - Programmatic prompt optimization
519+
- PromptBreeder (Fernando et al., 2023) - Self-referential prompt evolution
520+
521+
**Why reviewers would accept**: "Novel prompt synthesis method that learns to format demonstrations for maximal downstream utility."
522+
523+
#### Option C: Behavioral Cloning with Demo-Augmentation
524+
525+
**Effort**: 4-6 months | **Risk**: High | **Novelty**: Very High
526+
527+
**Core idea**: Fine-tune a VLM using demonstration-augmented behavioral cloning.
528+
529+
**Technical approach**:
530+
1. Collect behavioral cloning dataset: (screenshot, task, action) tuples
531+
2. Augment each example with retrieved demonstration context
532+
3. Fine-tune VLM with demo in context vs without
533+
4. Compare: Does demo-augmented fine-tuning outperform standard fine-tuning?
534+
535+
**Key experiments**:
536+
- Fine-tuning with/without demo augmentation
537+
- Sample efficiency: Do demos reduce required training data?
538+
- Analysis of attention patterns: Does the model attend to demos?
539+
540+
**Related work to cite**:
541+
- CogAgent (Hong et al., 2023) - GUI agent fine-tuning
542+
- SeeClick (Cheng et al., 2024) - Visual grounding for GUI
543+
- RT-2 (Brohan et al., 2023) - Vision-language-action models
544+
545+
**Why reviewers would accept**: "First demonstration that demo-augmentation improves fine-tuned GUI agents, with analysis of when prompting vs fine-tuning is preferred."
546+
547+
**Caveat**: This requires significant compute ($2-5k GPU, 4-6 weeks training) and expertise in VLM fine-tuning.
548+
549+
#### Option D: Theoretical Analysis
550+
551+
**Effort**: 2-3 months | **Risk**: High | **Novelty**: Medium
552+
553+
**Core idea**: Provide theoretical analysis of why demonstrations help GUI agents.
554+
555+
**Technical approach**:
556+
1. Information-theoretic analysis: How much information do demos provide?
557+
2. PAC-learning analysis: Sample complexity with/without demos
558+
3. Formal model of GUI task space and demo utility
559+
560+
**Key contributions**:
561+
- Theoretical bound on demo utility
562+
- Characterization of when demos help vs hurt
563+
- Connection to few-shot learning theory
564+
565+
**Related work to cite**:
566+
- Brown et al. (2020) - GPT-3 few-shot capabilities
567+
- Xie et al. (2021) - Why in-context learning works
568+
- Min et al. (2022) - Rethinking demonstration role
569+
570+
**Why reviewers would accept**: "Theoretical understanding of demonstration utility for GUI agents, with empirical validation."
571+
572+
**Caveat**: Requires theoretical ML expertise; risk of disconnect between theory and practice.
573+
574+
### 11.3 Additional Experiments Required
575+
576+
Beyond the technical contribution, main track requires substantially more empirical evidence:
577+
578+
**Benchmark Coverage**:
579+
| Benchmark | Tasks Required | Current Status | Effort |
580+
|-----------|---------------|----------------|--------|
581+
| Windows Agent Arena (WAA) | 50+ tasks | 8 tasks (incomplete) | 3-4 weeks |
582+
| WebArena | 100+ tasks | 0 tasks | 4-6 weeks |
583+
| OSWorld (optional) | 50+ tasks | 0 tasks | 4-6 weeks |
584+
585+
**Evaluation Metrics**:
586+
- **First-action accuracy**: Already measured, but on non-standard tasks
587+
- **Episode success rate**: Not measured - REQUIRED for main track
588+
- **Step efficiency**: Actions per successful task
589+
- **Grounding accuracy**: Correct element identification rate
590+
591+
**Multi-Model Comparison**:
592+
| Model | Priority | Status |
593+
|-------|----------|--------|
594+
| Claude Sonnet 4.5 | Required | Tested |
595+
| GPT-4V | Required | Not tested |
596+
| Gemini 1.5 Pro | Required | Not tested |
597+
| Qwen-VL | Nice to have | Not tested |
598+
| Open-source (LLaVA) | Nice to have | Not tested |
599+
600+
**Ablation Studies**:
601+
1. Demo format: full trace vs behavior-only vs action-only
602+
2. Number of demos: k=1, 3, 5, 10
603+
3. Demo relevance: exact match vs same-domain vs random
604+
4. Demo recency: fresh demos vs stale demos
605+
5. Model scale: Does demo benefit scale with model size?
606+
607+
**Statistical Requirements**:
608+
- 3+ seeds per experiment for variance estimation
609+
- 95% confidence intervals on all metrics
610+
- Statistical significance tests (McNemar's, permutation tests)
611+
- Effect sizes (Cohen's h, odds ratios)
612+
613+
### 11.4 Timeline and Resources
614+
615+
**Minimum timeline for main track submission**:
616+
617+
| Phase | Duration | Activities |
618+
|-------|----------|------------|
619+
| **Phase 1**: Technical contribution | 2-4 months | Implement learned retrieval or prompt synthesis |
620+
| **Phase 2**: Large-scale evaluation | 2-3 months | WAA (50+), WebArena (100+), multi-model |
621+
| **Phase 3**: Analysis & writing | 1-2 months | Ablations, significance tests, paper writing |
622+
| **Total** | **6-9 months** | From start to submission-ready |
623+
624+
**Resource requirements**:
625+
626+
| Resource | Estimate | Notes |
627+
|----------|----------|-------|
628+
| Dedicated researchers | 1-2 FTE | Cannot be done part-time |
629+
| GPU compute | $2-5k | For fine-tuning experiments (Option C) |
630+
| API credits | $1-3k | Multi-model evaluation at scale |
631+
| Azure VM (WAA) | $200-500 | Extended evaluation runs |
632+
| Human annotation | $500-1k | Demo quality labels, retrieval training data |
633+
634+
**Total estimated cost**: $5-10k (excluding researcher time)
635+
636+
### 11.5 Honest Recommendation
637+
638+
**For a small team with limited resources**:
639+
- **Focus on workshop paper**. The workshop contribution is solid and achievable.
640+
- Do NOT attempt main track unless you can dedicate 1-2 researchers full-time for 6+ months.
641+
- A rejected main track submission wastes 6-9 months and demoralizes the team.
642+
643+
**For a team with dedicated resources**:
644+
- **Pursue Option A (Learned Retrieval)** as the most tractable path to main track.
645+
- This adds a clear learned component while building on existing infrastructure.
646+
- Expected timeline: 6-7 months to submission-ready.
647+
- Honest acceptance probability: 25-35% at NeurIPS/ICML (still challenging).
648+
649+
**Do NOT attempt main track if**:
650+
- You cannot dedicate 1-2 researchers full-time to this project
651+
- You do not have ML research expertise (vs engineering expertise)
652+
- You need a publication in < 6 months
653+
- You are not prepared for likely rejection and iteration
654+
655+
**The workshop path is not a consolation prize**. Top workshops at NeurIPS/ICML have excellent visibility, lead to valuable feedback, and establish priority for your ideas. Many impactful papers started as workshop papers.
656+
657+
### 11.6 Additional References for Main Track
658+
659+
**Retrieval-Augmented Learning**:
660+
- Guu, K., Lee, K., Tung, Z., Pasupat, P., & Chang, M. W. (2020). REALM: Retrieval-augmented language model pre-training. *ICML 2020*.
661+
- Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., ... & Grave, E. (2022). Atlas: Few-shot learning with retrieval augmented language models. *arXiv preprint arXiv:2208.03299*.
662+
- Zhou, S., Alon, U., Xu, F. F., Wang, Z., Jiang, Z., & Neubig, G. (2022). DocPrompting: Generating code by retrieving the docs. *ICLR 2023*.
663+
664+
**Automatic Prompt Engineering**:
665+
- Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan, H., & Ba, J. (2022). Large language models are human-level prompt engineers. *ICLR 2023*.
666+
- Khattab, O., Santhanam, K., Li, X. L., Hall, D., Liang, P., Potts, C., & Zaharia, M. (2023). DSPy: Compiling declarative language model calls into self-improving pipelines. *arXiv preprint arXiv:2310.03714*.
667+
- Fernando, C., Banarse, D., Michalewski, H., Osindero, S., & Rocktäschel, T. (2023). PromptBreeder: Self-referential self-improvement via prompt evolution. *arXiv preprint arXiv:2309.16797*.
668+
669+
**GUI Agent Fine-Tuning**:
670+
- Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., ... & Tang, J. (2023). CogAgent: A visual language model for GUI agents. *arXiv preprint arXiv:2312.08914*.
671+
- Cheng, K., Sun, Q., Chu, Y., Xu, F., Li, Y., Zhang, J., & Wu, Z. (2024). SeeClick: Harnessing GUI grounding for advanced visual GUI agents. *arXiv preprint arXiv:2401.10935*.
672+
- Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., ... & Zitkovich, B. (2023). RT-2: Vision-language-action models transfer web knowledge to robotic control. *arXiv preprint arXiv:2307.15818*.
673+
674+
---
675+
445676
## Appendix A: Honest Framing for Paper
446677

447678
### Abstract Template

0 commit comments

Comments
 (0)