You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add Section 11: Path to Main Track Publication (Parallel Track) (#975)
This section provides a rigorous and honest assessment of what would be
required to elevate the current work from workshop-level to main track
publication at venues like NeurIPS, ICML, or ICLR.
Key additions:
- 11.1: Honest assessment of why current work is workshop-level (prompt
engineering, not ML research) with table of reviewer concerns
- 11.2: Four technical contribution options to elevate the work:
- Option A: Learned Demo Retrieval (RECOMMENDED, 2-3 months)
- Option B: Learned Prompt Synthesis (3-4 months)
- Option C: Behavioral Cloning with Demo-Augmentation (4-6 months)
- Option D: Theoretical Analysis (2-3 months)
- 11.3: Additional experiments required (WAA 50+ tasks, WebArena 100+,
multi-model, ablations, statistical significance)
- 11.4: Timeline and resource estimates (6-7 months minimum, 1-2 FTE,
$5-10k compute/API costs)
- 11.5: Honest recommendation based on team resources
- 11.6: Additional references (REALM, Atlas, DocPrompting, APE, DSPy,
CogAgent, SeeClick, RT-2)
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: docs/publication-roadmap.md
+231Lines changed: 231 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,6 +27,7 @@ This document is written from the perspective of a skeptical reviewer at a top v
27
27
8.[Realistic Timeline](#8-realistic-timeline)
28
28
9.[Risk Mitigation](#9-risk-mitigation)
29
29
10.[Action Items](#10-action-items)
30
+
11.[Path to Main Track Publication (Parallel Track)](#11-path-to-main-track-publication-parallel-track)
30
31
31
32
---
32
33
@@ -442,6 +443,236 @@ Based on related work, likely reviewers include researchers from:
442
443
443
444
---
444
445
446
+
## 11. Path to Main Track Publication (Parallel Track)
447
+
448
+
This section provides a rigorous assessment of what would be required to publish in a main track venue (NeurIPS, ICML, ICLR) rather than a workshop. This is a parallel track that requires substantially more investment.
449
+
450
+
### 11.1 Honest Assessment: Why Current Work is Workshop-Level
451
+
452
+
Our current contribution is fundamentally **prompt engineering**, not machine learning research. While valuable for practitioners, this positions us poorly for ML venues that expect learned components, theoretical insights, or architectural innovations.
453
+
454
+
**Table: Anticipated Reviewer Concerns for Main Track Submission**
455
+
456
+
| Concern | Severity | Our Current Status | What Main Track Requires |
| No comparison to fine-tuning |**High**| True | Show when prompting beats/complements fine-tuning |
463
+
| No theoretical analysis |**Medium**| True - purely empirical | Information-theoretic or PAC-learning analysis |
464
+
| Engineering focus |**Medium**| True - system building, not research | Clear algorithmic or theoretical contribution |
465
+
| No ablation of demo components |**Medium**| Partial | Systematic ablation with significance tests |
466
+
467
+
**Bottom line**: A main track reviewer at NeurIPS/ICML will likely say: "This is a well-executed engineering project with an empirical evaluation, but where is the research contribution? Adding demos to prompts is not novel."
468
+
469
+
### 11.2 Required Technical Contributions (Options to Elevate)
470
+
471
+
To elevate from workshop to main track, we need at least ONE of the following technical contributions:
**Effort**: 2-3 months | **Risk**: Medium | **Novelty**: High
476
+
477
+
**Core idea**: Train the retrieval system to optimize action accuracy, not semantic similarity.
478
+
479
+
**Why this works**: Current retrieval uses off-the-shelf embeddings (CLIP, text similarity) that optimize for semantic match. But the best demo for a task may not be the most semantically similar - it may be one that provides the right procedural template or spatial priors.
480
+
481
+
**Technical approach**:
482
+
1. Collect retrieval training data: (query, demo, action_accuracy) tuples
483
+
2. Train retrieval scorer to predict action accuracy given (query, demo) pair
484
+
3. Use contrastive learning: demos that help should score higher than demos that don't
485
+
4. Evaluate: Does learned retrieval outperform heuristic retrieval?
486
+
487
+
**Key experiments**:
488
+
- Retrieval recall@k vs action accuracy correlation
489
+
- Learned vs heuristic retrieval on held-out tasks
490
+
- Analysis of what the model learns (which demo features matter?)
491
+
492
+
**Related work to cite**:
493
+
- REALM (Guu et al., 2020) - Retrieval-augmented language model pretraining
494
+
- Atlas (Izacard et al., 2022) - Few-shot learning with retrieval
495
+
- DocPrompting (Zhou et al., 2022) - Retrieve docs for code generation
496
+
497
+
**Why reviewers would accept**: "First demonstration that learned retrieval improves demo-conditioned GUI agents, with analysis of what retrieval features matter."
498
+
499
+
#### Option B: Learned Prompt Synthesis
500
+
501
+
**Effort**: 3-4 months | **Risk**: Medium-High | **Novelty**: High
502
+
503
+
**Core idea**: Learn to synthesize optimal demo prompts rather than using fixed templates.
504
+
505
+
**Technical approach**:
506
+
1. Define prompt template space (what to include, how to format, compression level)
507
+
2. Use LLM-in-the-loop optimization (APE-style) to find optimal templates
508
+
3. Alternatively, train a small model to select/compress demo content
509
+
4. Evaluate: Does learned synthesis outperform hand-crafted templates?
510
+
511
+
**Key experiments**:
512
+
- Template ablation with learned selection
513
+
- Compression ratio vs accuracy tradeoff
514
+
- Cross-task transfer of learned templates
515
+
516
+
**Related work to cite**:
517
+
- APE (Zhou et al., 2022) - Automatic prompt engineering
518
+
- DSPy (Khattab et al., 2023) - Programmatic prompt optimization
519
+
- PromptBreeder (Fernando et al., 2023) - Self-referential prompt evolution
520
+
521
+
**Why reviewers would accept**: "Novel prompt synthesis method that learns to format demonstrations for maximal downstream utility."
522
+
523
+
#### Option C: Behavioral Cloning with Demo-Augmentation
524
+
525
+
**Effort**: 4-6 months | **Risk**: High | **Novelty**: Very High
526
+
527
+
**Core idea**: Fine-tune a VLM using demonstration-augmented behavioral cloning.
2. Augment each example with retrieved demonstration context
532
+
3. Fine-tune VLM with demo in context vs without
533
+
4. Compare: Does demo-augmented fine-tuning outperform standard fine-tuning?
534
+
535
+
**Key experiments**:
536
+
- Fine-tuning with/without demo augmentation
537
+
- Sample efficiency: Do demos reduce required training data?
538
+
- Analysis of attention patterns: Does the model attend to demos?
539
+
540
+
**Related work to cite**:
541
+
- CogAgent (Hong et al., 2023) - GUI agent fine-tuning
542
+
- SeeClick (Cheng et al., 2024) - Visual grounding for GUI
543
+
- RT-2 (Brohan et al., 2023) - Vision-language-action models
544
+
545
+
**Why reviewers would accept**: "First demonstration that demo-augmentation improves fine-tuned GUI agents, with analysis of when prompting vs fine-tuning is preferred."
546
+
547
+
**Caveat**: This requires significant compute ($2-5k GPU, 4-6 weeks training) and expertise in VLM fine-tuning.
548
+
549
+
#### Option D: Theoretical Analysis
550
+
551
+
**Effort**: 2-3 months | **Risk**: High | **Novelty**: Medium
552
+
553
+
**Core idea**: Provide theoretical analysis of why demonstrations help GUI agents.
554
+
555
+
**Technical approach**:
556
+
1. Information-theoretic analysis: How much information do demos provide?
-**Focus on workshop paper**. The workshop contribution is solid and achievable.
640
+
- Do NOT attempt main track unless you can dedicate 1-2 researchers full-time for 6+ months.
641
+
- A rejected main track submission wastes 6-9 months and demoralizes the team.
642
+
643
+
**For a team with dedicated resources**:
644
+
-**Pursue Option A (Learned Retrieval)** as the most tractable path to main track.
645
+
- This adds a clear learned component while building on existing infrastructure.
646
+
- Expected timeline: 6-7 months to submission-ready.
647
+
- Honest acceptance probability: 25-35% at NeurIPS/ICML (still challenging).
648
+
649
+
**Do NOT attempt main track if**:
650
+
- You cannot dedicate 1-2 researchers full-time to this project
651
+
- You do not have ML research expertise (vs engineering expertise)
652
+
- You need a publication in < 6 months
653
+
- You are not prepared for likely rejection and iteration
654
+
655
+
**The workshop path is not a consolation prize**. Top workshops at NeurIPS/ICML have excellent visibility, lead to valuable feedback, and establish priority for your ideas. Many impactful papers started as workshop papers.
656
+
657
+
### 11.6 Additional References for Main Track
658
+
659
+
**Retrieval-Augmented Learning**:
660
+
- Guu, K., Lee, K., Tung, Z., Pasupat, P., & Chang, M. W. (2020). REALM: Retrieval-augmented language model pre-training. *ICML 2020*.
661
+
- Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., ... & Grave, E. (2022). Atlas: Few-shot learning with retrieval augmented language models. *arXiv preprint arXiv:2208.03299*.
662
+
- Zhou, S., Alon, U., Xu, F. F., Wang, Z., Jiang, Z., & Neubig, G. (2022). DocPrompting: Generating code by retrieving the docs. *ICLR 2023*.
663
+
664
+
**Automatic Prompt Engineering**:
665
+
- Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan, H., & Ba, J. (2022). Large language models are human-level prompt engineers. *ICLR 2023*.
666
+
- Khattab, O., Santhanam, K., Li, X. L., Hall, D., Liang, P., Potts, C., & Zaharia, M. (2023). DSPy: Compiling declarative language model calls into self-improving pipelines. *arXiv preprint arXiv:2310.03714*.
667
+
- Fernando, C., Banarse, D., Michalewski, H., Osindero, S., & Rocktäschel, T. (2023). PromptBreeder: Self-referential self-improvement via prompt evolution. *arXiv preprint arXiv:2309.16797*.
668
+
669
+
**GUI Agent Fine-Tuning**:
670
+
- Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., ... & Tang, J. (2023). CogAgent: A visual language model for GUI agents. *arXiv preprint arXiv:2312.08914*.
671
+
- Cheng, K., Sun, Q., Chu, Y., Xu, F., Li, Y., Zhang, J., & Wu, Z. (2024). SeeClick: Harnessing GUI grounding for advanced visual GUI agents. *arXiv preprint arXiv:2401.10935*.
672
+
- Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., ... & Zitkovich, B. (2023). RT-2: Vision-language-action models transfer web knowledge to robotic control. *arXiv preprint arXiv:2307.15818*.
0 commit comments