VisionXLab
diff --git a/‎README.md‎
Lines changed: 2 additions & 0 deletions b/‎README.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎projects/ViLD/README.md‎
Lines changed: 97 additions & 0 deletions b/‎projects/ViLD/README.md‎
Lines changed: 97 additions & 0 deletions
diff --git a/‎projects/ViLD/configs/vild_oriented-rcnn_r50_fpn_visdronezsd_step1_prepare.py‎
Lines changed: 114 additions & 0 deletions b/‎projects/ViLD/configs/vild_oriented-rcnn_r50_fpn_visdronezsd_step1_prepare.py‎
Lines changed: 114 additions & 0 deletions
diff --git a/‎projects/ViLD/configs/vild_oriented-rcnn_r50_fpn_visdronezsd_step2_finetune.py‎
Lines changed: 112 additions & 0 deletions b/‎projects/ViLD/configs/vild_oriented-rcnn_r50_fpn_visdronezsd_step2_finetune.py‎
Lines changed: 112 additions & 0 deletions
@@ -6,6 +6,8 @@
 ## ✨ Latest Updates
 
 
+📆 [**2025-06-05**] : The code for **Oriented GLIP**, **Oriented GroundingDINO**, and **Oriented ViLD** are now available!
+
 📆 [**2025-02-08**] : The code for **Oriented CastDet** is now available! 🎉 CastDet now supports Open-vocabulary Oriented Aerial Object Detection. Stay tuned—**Oriented GLIP**, **Oriented GroundingDINO**, and **Oriented ViLD** are coming soon! 🚀
 
 📆 [**2024-11-04**] : Our paper ["Exploiting Unlabeled Data with Multiple Expert Teachers for Open Vocabulary Aerial Object Detection and Its Orientation Adaptation"](https://arxiv.org/abs/2411.02057) is available open on arxiv!
 
@@ -0,0 +1,97 @@
+# [Oriented ViLD] Open-Vocabulary Detection via Vision and Language Knowledge Distillation
+
+- [Open-Vocabulary Detection via Vision and Language Knowledge Distillation](https://arxiv.org/abs/2104.13921)
+
+## Introduction
+
+Open-vocabulary object detection detects objects described by arbitrary text inputs. The fundamental challenge is the availability of training data. Existing object detection datasets only contain hundreds of categories, and it is costly to scale further. To overcome this challenge, we propose ViLD. Our method distills the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student). Specifically, we use the teacher model to encode category texts and image regions of object proposals. Then we train a student detector, whose region embeddings of detected boxes are aligned with the text and image embeddings inferred by the teacher.
+
+![vild_framework](resources/vild_framework.png)
+
+
+Here is the implemetation of **Oriented ViLD**.
+
+
+
+## Quick Start:
+
+```shell
+bash projects/ViLD/run.sh
+```
+
+
+## Training
+
+1. train base detector
+```shell
+exp1="oriented-rcnn_r50-fpn_20k_visdronezsd_base-set"
+python tools/train.py \
+    projects/ViLD/configs/$exp1.py
+```
+
+2. merge weights
+
+```shell
+python projects/CastDetv2/tools/merge_weights.py \
+    --clip_path checkpoints/RemoteCLIP-RN50.pt \
+    --base_path work_dirs/$exp1/iter_20000.pth \
+    --save_path work_dirs/$exp1/merged_vild_init_iter20k.pth
+    --target_model vild
+```
+
+3. prepare pseudo labels
+
+```shell
+exp2="vild_oriented-rcnn_r50_fpn_visdronezsd_step1_prepare"
+python tools/test.py \
+    projects/ViLD/configs/$exp2.py \
+    work_dirs/$exp1/merged_vild_init_iter20k.pth
+```
+
+4. self-training
+
+```shell
+exp3="vild_oriented-rcnn_r50_fpn_visdronezsd_step2_finetune"
+python tools/train.py \
+    projects/ViLD/configs/$exp3.py
+```
+
+
+## Evaluation
+
+```shell
+python tools/test.py \
+    projects/ViLD/configs/$exp3.py \
+    work_dirs/$exp3/iter_10000.pth \
+    --work-dir work_dirs/$exp3/dior_test
+```
+
+
+## Acknowledgement
+
+Thanks the wonderful open source projects [MMRotate](https://github.com/open-mmlab/mmrotate) and [ViLD](https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild)!
+
+
+## Citation
+
+```
+// Oriented ViLD (this repo)
+@misc{li2024exploitingunlabeleddatamultiple,
+      title={Exploiting Unlabeled Data with Multiple Expert Teachers for Open Vocabulary Aerial Object Detection and Its Orientation Adaptation}, 
+      author={Yan Li and Weiwei Guo and Xue Yang and Ning Liao and Shaofeng Zhang and Yi Yu and Wenxian Yu and Junchi Yan},
+      year={2024},
+      eprint={2411.02057},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2411.02057}, 
+}
+
+// ViLD (Horizontal detection)
+@article{gu2021open,
+  title={Open-vocabulary object detection via vision and language knowledge distillation},
+  author={Gu, Xiuye and Lin, Tsung-Yi and Kuo, Weicheng and Cui, Yin},
+  journal={arXiv preprint arXiv:2104.13921},
+  year={2021}
+}
+```
+
@@ -0,0 +1,114 @@
+_base_ = [
+    'mmrotate::_base_/models/oriented-rcnn-le90_r50_fpn.py',
+    'mmrotate::_base_/default_runtime.py',
+    'vild_visdronezsd.py'
+]
+
+work_dir = 'work_dirs/vild_oriented-rcnn_r50_fpn_visdronezsd_step1_prepare'
+
+custom_imports = dict(
+    imports=['projects.ViLD.vild'], allow_failed_imports=False)
+
+batch_size = 2
+num_workers = 2
+train_dataloader = dict(
+    batch_size=batch_size,
+    num_workers=num_workers,
+)
+
+test_dataloader = dict(
+    batch_size=batch_size,
+    num_workers=num_workers,
+    dataset=dict(
+        ann_file='ImageSets/Main/dior_trainval.txt',
+        data_prefix=dict(img_path='JPEGImages-trainval'),
+    )
+)
+
+model = dict(
+    type='RotatedViLD',
+    data_preprocessor = dict(
+        type='mmdet.DetDataPreprocessor',
+        mean=[122.7709383 , 116.7460125 , 104.09373615],
+        std=[68.5005327 , 66.6321579 , 70.32316305],
+        bgr_to_rgb=True,
+        pad_size_divisor=32,
+        boxtype2tensor=False),
+    visual=dict(
+        type='ModifiedResNet2',
+        layers=[3, 4, 6, 3],
+        width=64,
+        output_dim=1024,
+        heads=32,
+        image_size=224,
+    ),
+    pseudo_cfg=dict(
+        semi_weight=0.5,    # semi branch
+        vector_path="projects/CastDetv2/resources/remoteCLIP_embeddings_normalized.npy",
+        proposal_path=work_dir+'/proposals_300',
+        pseudo_nms=True,
+        iou_threshold=0.6,
+        pre_keep=1000,
+        post_keep=300,
+        initialize=True,
+        mini_batch_size=128
+    ),
+    roi_head=dict(
+        bbox_head=dict(
+            type='Shared2FCBBoxHeadZSD',
+            num_classes=20,
+            fc_cls=dict(
+                    type='Projection2',
+                    vector_path="projects/CastDetv2/resources/remoteCLIP_embeddings_normalized.npy",
+                    is_scale=True,
+                    is_grad_bg=True,
+                    is_grad=False
+                ),
+        ),
+    )
+)
+
+# training schedule for 180k
+train_cfg = dict(
+    type='IterBasedTrainLoop', max_iters=20000, val_interval=4000)
+val_cfg = dict(type='ValLoop')
+test_cfg = dict(type='TestLoop')
+
+# learning rate policy
+param_scheduler = [
+    dict(
+        type='LinearLR', start_factor= 1.0 / 3, by_epoch=False, begin=0, end=500),
+    dict(
+        type='MultiStepLR',
+        begin=0,
+        end=20000,
+        by_epoch=False,
+        milestones=[16000, 18000],
+        gamma=0.1)
+]
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=0.005, momentum=0.9, weight_decay=0.0001),
+    clip_grad=dict(max_norm=35, norm_type=2),
+    paramwise_cfg=dict(
+        custom_keys={
+            'visual': dict(decay_mult=0., lr_mult=0.)
+        },
+        norm_decay_mult=0.)
+    )
+
+
+default_hooks = dict(
+    logger=dict(type='LoggerHook', interval=20),
+    checkpoint=dict(by_epoch=False, interval=4000, max_keep_ckpts=5))
+log_processor = dict(by_epoch=False)
+
+visualizer = dict(
+    vis_backends=[
+        dict(type='LocalVisBackend'),
+        dict(type='TensorboardVisBackend')
+    ])
+
+load_from = "checkpoints/merged_ori_20k_vild.pth"
@@ -0,0 +1,112 @@
+_base_ = [
+    'mmrotate::_base_/models/oriented-rcnn-le90_r50_fpn.py',
+    'mmrotate::_base_/default_runtime.py',
+    'vild_visdronezsd.py'
+]
+
+work_dir = 'work_dirs/vild_oriented-rcnn_r50_fpn_visdronezsd_step2_finetune'
+
+custom_imports = dict(
+    imports=['projects.ViLD.vild'], allow_failed_imports=False)
+
+batch_size = 12
+num_workers = 2
+train_dataloader = dict(
+    batch_size=batch_size,
+    num_workers=num_workers,
+)
+
+test_dataloader = dict(
+    dataset=dict(
+        ann_file='ImageSets/Main/test.txt',
+    )
+)
+
+model = dict(
+    type='RotatedViLD',
+    data_preprocessor = dict(
+        type='mmdet.DetDataPreprocessor',
+        mean=[122.7709383 , 116.7460125 , 104.09373615],
+        std=[68.5005327 , 66.6321579 , 70.32316305],
+        bgr_to_rgb=True,
+        pad_size_divisor=32,
+        boxtype2tensor=False),
+    visual=dict(
+        type='ModifiedResNet2',
+        layers=[3, 4, 6, 3],
+        width=64,
+        output_dim=1024,
+        heads=32,
+        image_size=224,
+    ),
+    pseudo_cfg=dict(
+        semi_weight=0.5,    # semi branch
+        vector_path="projects/CastDetv2/resources/remoteCLIP_embeddings_normalized.npy",
+        proposal_path='work_dirs/vild_oriented-rcnn_r50_fpn_visdronezsd_step1_prepare/proposals_300',
+        pseudo_nms=True,
+        iou_threshold=0.6,
+        pre_keep=1000,
+        post_keep=300,
+        initialize=False,
+        mini_batch_size=128,
+        filter_empty_instances=True
+    ),
+    roi_head=dict(
+        bbox_head=dict(
+            type='Shared2FCBBoxHeadZSD',
+            num_classes=20,
+            fc_cls=dict(
+                    type='Projection2',
+                    vector_path="projects/CastDetv2/resources/remoteCLIP_embeddings_normalized.npy",
+                    is_scale=True,
+                    is_grad_bg=True,
+                    is_grad=False
+                ),
+        ),
+    )
+)
+
+# training schedule for 180k
+train_cfg = dict(
+    type='IterBasedTrainLoop', max_iters=10000, val_interval=2000)
+val_cfg = dict(type='ValLoop')
+test_cfg = dict(type='TestLoop')
+
+# learning rate policy
+param_scheduler = [
+    dict(
+        type='LinearLR', start_factor=0.001, by_epoch=False, begin=0, end=500),
+    dict(
+        type='MultiStepLR',
+        begin=0,
+        end=180000,
+        by_epoch=False,
+        milestones=[120000, 160000],
+        gamma=0.1)
+]
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001),
+    clip_grad=dict(max_norm=35, norm_type=2),
+    paramwise_cfg=dict(
+        custom_keys={
+            'visual': dict(decay_mult=0., lr_mult=0.)
+        },
+        norm_decay_mult=0.)
+    )
+
+
+default_hooks = dict(
+    logger=dict(type='LoggerHook', interval=20),
+    checkpoint=dict(by_epoch=False, interval=2000, max_keep_ckpts=1))
+log_processor = dict(by_epoch=False)
+
+visualizer = dict(
+    vis_backends=[
+        dict(type='LocalVisBackend'),
+        dict(type='TensorboardVisBackend')
+    ])
+
+load_from = "work_dirs/oriented-rcnn_r50-fpn_20k_visdronezsd_base-set/merged_vild_init_iter20k.pth"