Agent-guided Reliability-aware Temporal Mask Evolution for Imperfectly Supervised Video Polyp Segmentation
Tong Wang1,2, Siwen Wang2, Yaolei Qi1, Jinxing Zhou2, Yuting He3, Guanyu Yang1, Yutong Xie2
1 Southeast University, China
2 Mohamed bin Zayed University of Artificial Intelligence, UAE
3 Case Western Reserve University, USA
Guanyu Yang and Yutong Xie are corresponding authors. Work was done when Tong Wang was visiting MBZUAI.
- [2026/06] ARTEMIS is now available on arXiv.
- [2026/06] Pseudo-label data, prediction maps, and model snapshots are released for reproducibility.
- [2026/06] ARTEMIS repository page is initialized for the submission-stage manuscript.
Imperfectly supervised video polyp segmentation (VPS) aims to learn dense and temporally consistent masks from inexpensive supervision, including weak annotations such as points and scribbles, as well as semi-supervision with only a small subset of densely labeled frames. Although SAM2 can convert sparse or partial annotations into dense masks, direct pseudo labeling remains limited by geometry-degraded masks, underused temporal propagation, and reliability-blind supervision.
We propose ARTEMIS, a unified framework for imperfectly supervised VPS driven by agent-guided reliability-aware temporal mask evolution. ARTEMIS first initializes coarse masks from available supervision, then uses a debate-and-judge vision-language agent to select reliable temporal anchors under weak supervision. These anchors are propagated bidirectionally with SAM2 to refine unreliable or unlabeled frames. Finally, ARTEMIS trains the segmenter with temporal reliability-aware robust learning, including reliability-guided reference selection, a Reference Prototype Transport Module, and reliability-aware robust loss. Experiments on SUN-SEG and CVC-ClinicDB-612 under scribble, point, and limited-label settings demonstrate state-of-the-art performance.
- Unified imperfect supervision. ARTEMIS handles weakly supervised and semi-supervised VPS in one complete-then-learn framework.
- Agent-guided anchor selection. A debate-and-judge vision-language agent identifies reliable temporal anchors from noisy SAM2-generated masks.
- Bidirectional temporal mask evolution. Reliable anchors are propagated forward and backward with SAM2 to complete sparse or missing annotations.
- Reliability-aware robust learning. Reliability-guided reference selection, RPTM, and robust loss suppress residual pseudo-label noise while preserving difficult samples.
ARTEMIS follows a two-stage pipeline.
- Stage 1: Agent-guided bidirectional mask evolution. Available point, scribble, or sparse dense labels are converted into temporally consistent pseudo masks through reliable anchor selection and SAM2-based propagation.
- Stage 2: Temporal reliability-aware robust learning. The final segmenter is trained with reliability-guided reference selection, reference prototype transport, and reliability-aware robust supervision.
Stage 1: Reliable temporal anchors are selected and propagated bidirectionally to evolve pseudo masks.
Stage 2: Reliable reference identity is transported across frames and noisy supervision is down-weighted.
We evaluate ARTEMIS on SUN-SEG under weakly supervised and semi-supervised settings. The tables below report the complete SUN-SEG results of ARTEMIS across Easy/Hard and Seen/Unseen splits. Full comparisons with competing methods and ablation studies are provided in the paper.
| Supervision | Split | Sα↑ | Eφ↑ | Fβ↑ | Dice↑ | IoU↑ | MAE↓ |
|---|---|---|---|---|---|---|---|
| Scribble | Easy-Seen | 89.7 | 92.4 | 84.1 | 85.2 | 78.3 | 3.4 |
| Scribble | Easy-Unseen | 78.6 | 79.7 | 65.9 | 66.7 | 58.7 | 4.3 |
| Scribble | Hard-Seen | 84.5 | 87.8 | 77.3 | 78.8 | 71.3 | 7.4 |
| Scribble | Hard-Unseen | 79.6 | 81.6 | 67.5 | 68.7 | 60.9 | 4.8 |
| Point | Easy-Seen | 86.3 | 88.7 | 65.4 | 81.2 | 73.2 | 6.8 |
| Point | Easy-Unseen | 77.0 | 76.9 | 52.9 | 62.8 | 53.9 | 8.0 |
| Point | Hard-Seen | 81.5 | 84.0 | 59.2 | 74.8 | 66.0 | 10.0 |
| Point | Hard-Unseen | 77.0 | 78.6 | 52.7 | 64.4 | 55.5 | 8.0 |
| Supervision | Split | Sα↑ | Eφ↑ | Fβ↑ | Dice↑ | IoU↑ | MAE↓ |
|---|---|---|---|---|---|---|---|
| 1/8 labeled | Easy-Seen | 90.7 | 93.6 | 85.2 | 86.5 | 79.9 | 2.7 |
| 1/8 labeled | Easy-Unseen | 79.4 | 81.1 | 67.1 | 68.2 | 60.4 | 4.8 |
| 1/8 labeled | Hard-Seen | 86.2 | 90.1 | 78.2 | 80.0 | 72.2 | 4.5 |
| 1/8 labeled | Hard-Unseen | 80.8 | 83.9 | 69.4 | 70.9 | 63.1 | 4.6 |
| 1/16 labeled | Easy-Seen | 89.9 | 92.4 | 84.3 | 85.5 | 78.9 | 3.4 |
| 1/16 labeled | Easy-Unseen | 78.3 | 79.8 | 66.0 | 66.9 | 58.9 | 4.9 |
| 1/16 labeled | Hard-Seen | 84.5 | 87.9 | 77.4 | 79.1 | 71.5 | 7.6 |
| 1/16 labeled | Hard-Unseen | 79.2 | 81.4 | 67.1 | 68.2 | 60.2 | 5.0 |
All values are percentages (%). Higher is better except MAE.
Qualitative comparison under scribble supervision. ARTEMIS better preserves blurry, drifting, and small polyp regions.
Qualitative comparison under the 1/8 labeled training data setting. ARTEMIS reduces over-segmentation and under-segmentation for background-like polyps.
We provide the processed pseudo-label data, prediction maps, and trained model snapshots via OneDrive. These resources are released for reproducibility and comparison while the source code remains unavailable during peer review.
| Resource | Description | Link |
|---|---|---|
| Dataset | Dataset package used by ARTEMIS experiments | Download |
| Scribble / Point | Weak annotation files | Download |
| Point2Mask | Coarse masks generated from point prompts | Download |
| Mask2Mask Forward | Forward-evolved pseudo masks | Download |
| Mask2Mask Forward Logit | Forward propagation logits | Download |
| Mask2Mask Backward | Backward-evolved pseudo masks | Download |
| Mask2Mask Backward Logit | Backward propagation logits | Download |
| Reliable Reference Cache | Cached reliability-guided reference information | Download |
| Resource | Link |
|---|---|
| All Predictions | Download |
| ARTEMIS Scribble Predictions | Download |
| ARTEMIS Point Predictions | Download |
| ARTEMIS 1/8 Data Predictions | Download |
| ARTEMIS 1/16 Data Predictions | Download |
| Resource | Link |
|---|---|
| All Snapshots | Download |
| ARTEMIS Scribble Snapshot | Download |
| ARTEMIS Point Snapshot | Download |
| ARTEMIS 1/8 Data Snapshot | Download |
| ARTEMIS 1/16 Data Snapshot | Download |
This repository is currently maintained for the submission-stage manuscript. Source code, training scripts, and testing scripts are not released during peer review.
The full code will be made available after acceptance.
If you find ARTEMIS useful, please consider citing our work.
@article{wang2026artemis,
title={ARTEMIS: Agent-guided Reliability-aware Temporal Mask Evolution for Imperfectly Supervised Video Polyp Segmentation},
author={Wang, Tong and Wang, Siwen and Qi, Yaolei and Zhou, Jinxing and He, Yuting and Yang, Guanyu and Xie, Yutong},
journal={arXiv preprint arXiv:2606.20161},
year={2026}
}For questions about the paper or future code release, please contact tongwangnj@qq.com.




