Skip to content

Commit 43c0f51

Browse files
committed
support Oriented ViLD
1 parent a8d0e3c commit 43c0f51

11 files changed

Lines changed: 1145 additions & 0 deletions

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@
66
## ✨ Latest Updates
77

88

9+
📆 [**2025-06-05**] : The code for **Oriented GLIP**, **Oriented GroundingDINO**, and **Oriented ViLD** are now available!
10+
911
📆 [**2025-02-08**] : The code for **Oriented CastDet** is now available! 🎉 CastDet now supports Open-vocabulary Oriented Aerial Object Detection. Stay tuned—**Oriented GLIP**, **Oriented GroundingDINO**, and **Oriented ViLD** are coming soon! 🚀
1012

1113
📆 [**2024-11-04**] : Our paper ["Exploiting Unlabeled Data with Multiple Expert Teachers for Open Vocabulary Aerial Object Detection and Its Orientation Adaptation"](https://arxiv.org/abs/2411.02057) is available open on arxiv!

projects/ViLD/README.md

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# [Oriented ViLD] Open-Vocabulary Detection via Vision and Language Knowledge Distillation
2+
3+
- [Open-Vocabulary Detection via Vision and Language Knowledge Distillation](https://arxiv.org/abs/2104.13921)
4+
5+
## Introduction
6+
7+
Open-vocabulary object detection detects objects described by arbitrary text inputs. The fundamental challenge is the availability of training data. Existing object detection datasets only contain hundreds of categories, and it is costly to scale further. To overcome this challenge, we propose ViLD. Our method distills the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student). Specifically, we use the teacher model to encode category texts and image regions of object proposals. Then we train a student detector, whose region embeddings of detected boxes are aligned with the text and image embeddings inferred by the teacher.
8+
9+
![vild_framework](resources/vild_framework.png)
10+
11+
12+
Here is the implemetation of **Oriented ViLD**.
13+
14+
15+
16+
## Quick Start:
17+
18+
```shell
19+
bash projects/ViLD/run.sh
20+
```
21+
22+
23+
## Training
24+
25+
1. train base detector
26+
```shell
27+
exp1="oriented-rcnn_r50-fpn_20k_visdronezsd_base-set"
28+
python tools/train.py \
29+
projects/ViLD/configs/$exp1.py
30+
```
31+
32+
2. merge weights
33+
34+
```shell
35+
python projects/CastDetv2/tools/merge_weights.py \
36+
--clip_path checkpoints/RemoteCLIP-RN50.pt \
37+
--base_path work_dirs/$exp1/iter_20000.pth \
38+
--save_path work_dirs/$exp1/merged_vild_init_iter20k.pth
39+
--target_model vild
40+
```
41+
42+
3. prepare pseudo labels
43+
44+
```shell
45+
exp2="vild_oriented-rcnn_r50_fpn_visdronezsd_step1_prepare"
46+
python tools/test.py \
47+
projects/ViLD/configs/$exp2.py \
48+
work_dirs/$exp1/merged_vild_init_iter20k.pth
49+
```
50+
51+
4. self-training
52+
53+
```shell
54+
exp3="vild_oriented-rcnn_r50_fpn_visdronezsd_step2_finetune"
55+
python tools/train.py \
56+
projects/ViLD/configs/$exp3.py
57+
```
58+
59+
60+
## Evaluation
61+
62+
```shell
63+
python tools/test.py \
64+
projects/ViLD/configs/$exp3.py \
65+
work_dirs/$exp3/iter_10000.pth \
66+
--work-dir work_dirs/$exp3/dior_test
67+
```
68+
69+
70+
## Acknowledgement
71+
72+
Thanks the wonderful open source projects [MMRotate](https://github.com/open-mmlab/mmrotate) and [ViLD](https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild)!
73+
74+
75+
## Citation
76+
77+
```
78+
// Oriented ViLD (this repo)
79+
@misc{li2024exploitingunlabeleddatamultiple,
80+
title={Exploiting Unlabeled Data with Multiple Expert Teachers for Open Vocabulary Aerial Object Detection and Its Orientation Adaptation},
81+
author={Yan Li and Weiwei Guo and Xue Yang and Ning Liao and Shaofeng Zhang and Yi Yu and Wenxian Yu and Junchi Yan},
82+
year={2024},
83+
eprint={2411.02057},
84+
archivePrefix={arXiv},
85+
primaryClass={cs.CV},
86+
url={https://arxiv.org/abs/2411.02057},
87+
}
88+
89+
// ViLD (Horizontal detection)
90+
@article{gu2021open,
91+
title={Open-vocabulary object detection via vision and language knowledge distillation},
92+
author={Gu, Xiuye and Lin, Tsung-Yi and Kuo, Weicheng and Cui, Yin},
93+
journal={arXiv preprint arXiv:2104.13921},
94+
year={2021}
95+
}
96+
```
97+
Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
_base_ = [
2+
'mmrotate::_base_/models/oriented-rcnn-le90_r50_fpn.py',
3+
'mmrotate::_base_/default_runtime.py',
4+
'vild_visdronezsd.py'
5+
]
6+
7+
work_dir = 'work_dirs/vild_oriented-rcnn_r50_fpn_visdronezsd_step1_prepare'
8+
9+
custom_imports = dict(
10+
imports=['projects.ViLD.vild'], allow_failed_imports=False)
11+
12+
batch_size = 2
13+
num_workers = 2
14+
train_dataloader = dict(
15+
batch_size=batch_size,
16+
num_workers=num_workers,
17+
)
18+
19+
test_dataloader = dict(
20+
batch_size=batch_size,
21+
num_workers=num_workers,
22+
dataset=dict(
23+
ann_file='ImageSets/Main/dior_trainval.txt',
24+
data_prefix=dict(img_path='JPEGImages-trainval'),
25+
)
26+
)
27+
28+
model = dict(
29+
type='RotatedViLD',
30+
data_preprocessor = dict(
31+
type='mmdet.DetDataPreprocessor',
32+
mean=[122.7709383 , 116.7460125 , 104.09373615],
33+
std=[68.5005327 , 66.6321579 , 70.32316305],
34+
bgr_to_rgb=True,
35+
pad_size_divisor=32,
36+
boxtype2tensor=False),
37+
visual=dict(
38+
type='ModifiedResNet2',
39+
layers=[3, 4, 6, 3],
40+
width=64,
41+
output_dim=1024,
42+
heads=32,
43+
image_size=224,
44+
),
45+
pseudo_cfg=dict(
46+
semi_weight=0.5, # semi branch
47+
vector_path="projects/CastDetv2/resources/remoteCLIP_embeddings_normalized.npy",
48+
proposal_path=work_dir+'/proposals_300',
49+
pseudo_nms=True,
50+
iou_threshold=0.6,
51+
pre_keep=1000,
52+
post_keep=300,
53+
initialize=True,
54+
mini_batch_size=128
55+
),
56+
roi_head=dict(
57+
bbox_head=dict(
58+
type='Shared2FCBBoxHeadZSD',
59+
num_classes=20,
60+
fc_cls=dict(
61+
type='Projection2',
62+
vector_path="projects/CastDetv2/resources/remoteCLIP_embeddings_normalized.npy",
63+
is_scale=True,
64+
is_grad_bg=True,
65+
is_grad=False
66+
),
67+
),
68+
)
69+
)
70+
71+
# training schedule for 180k
72+
train_cfg = dict(
73+
type='IterBasedTrainLoop', max_iters=20000, val_interval=4000)
74+
val_cfg = dict(type='ValLoop')
75+
test_cfg = dict(type='TestLoop')
76+
77+
# learning rate policy
78+
param_scheduler = [
79+
dict(
80+
type='LinearLR', start_factor= 1.0 / 3, by_epoch=False, begin=0, end=500),
81+
dict(
82+
type='MultiStepLR',
83+
begin=0,
84+
end=20000,
85+
by_epoch=False,
86+
milestones=[16000, 18000],
87+
gamma=0.1)
88+
]
89+
90+
# optimizer
91+
optim_wrapper = dict(
92+
type='OptimWrapper',
93+
optimizer=dict(type='SGD', lr=0.005, momentum=0.9, weight_decay=0.0001),
94+
clip_grad=dict(max_norm=35, norm_type=2),
95+
paramwise_cfg=dict(
96+
custom_keys={
97+
'visual': dict(decay_mult=0., lr_mult=0.)
98+
},
99+
norm_decay_mult=0.)
100+
)
101+
102+
103+
default_hooks = dict(
104+
logger=dict(type='LoggerHook', interval=20),
105+
checkpoint=dict(by_epoch=False, interval=4000, max_keep_ckpts=5))
106+
log_processor = dict(by_epoch=False)
107+
108+
visualizer = dict(
109+
vis_backends=[
110+
dict(type='LocalVisBackend'),
111+
dict(type='TensorboardVisBackend')
112+
])
113+
114+
load_from = "checkpoints/merged_ori_20k_vild.pth"
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
_base_ = [
2+
'mmrotate::_base_/models/oriented-rcnn-le90_r50_fpn.py',
3+
'mmrotate::_base_/default_runtime.py',
4+
'vild_visdronezsd.py'
5+
]
6+
7+
work_dir = 'work_dirs/vild_oriented-rcnn_r50_fpn_visdronezsd_step2_finetune'
8+
9+
custom_imports = dict(
10+
imports=['projects.ViLD.vild'], allow_failed_imports=False)
11+
12+
batch_size = 12
13+
num_workers = 2
14+
train_dataloader = dict(
15+
batch_size=batch_size,
16+
num_workers=num_workers,
17+
)
18+
19+
test_dataloader = dict(
20+
dataset=dict(
21+
ann_file='ImageSets/Main/test.txt',
22+
)
23+
)
24+
25+
model = dict(
26+
type='RotatedViLD',
27+
data_preprocessor = dict(
28+
type='mmdet.DetDataPreprocessor',
29+
mean=[122.7709383 , 116.7460125 , 104.09373615],
30+
std=[68.5005327 , 66.6321579 , 70.32316305],
31+
bgr_to_rgb=True,
32+
pad_size_divisor=32,
33+
boxtype2tensor=False),
34+
visual=dict(
35+
type='ModifiedResNet2',
36+
layers=[3, 4, 6, 3],
37+
width=64,
38+
output_dim=1024,
39+
heads=32,
40+
image_size=224,
41+
),
42+
pseudo_cfg=dict(
43+
semi_weight=0.5, # semi branch
44+
vector_path="projects/CastDetv2/resources/remoteCLIP_embeddings_normalized.npy",
45+
proposal_path='work_dirs/vild_oriented-rcnn_r50_fpn_visdronezsd_step1_prepare/proposals_300',
46+
pseudo_nms=True,
47+
iou_threshold=0.6,
48+
pre_keep=1000,
49+
post_keep=300,
50+
initialize=False,
51+
mini_batch_size=128,
52+
filter_empty_instances=True
53+
),
54+
roi_head=dict(
55+
bbox_head=dict(
56+
type='Shared2FCBBoxHeadZSD',
57+
num_classes=20,
58+
fc_cls=dict(
59+
type='Projection2',
60+
vector_path="projects/CastDetv2/resources/remoteCLIP_embeddings_normalized.npy",
61+
is_scale=True,
62+
is_grad_bg=True,
63+
is_grad=False
64+
),
65+
),
66+
)
67+
)
68+
69+
# training schedule for 180k
70+
train_cfg = dict(
71+
type='IterBasedTrainLoop', max_iters=10000, val_interval=2000)
72+
val_cfg = dict(type='ValLoop')
73+
test_cfg = dict(type='TestLoop')
74+
75+
# learning rate policy
76+
param_scheduler = [
77+
dict(
78+
type='LinearLR', start_factor=0.001, by_epoch=False, begin=0, end=500),
79+
dict(
80+
type='MultiStepLR',
81+
begin=0,
82+
end=180000,
83+
by_epoch=False,
84+
milestones=[120000, 160000],
85+
gamma=0.1)
86+
]
87+
88+
# optimizer
89+
optim_wrapper = dict(
90+
type='OptimWrapper',
91+
optimizer=dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001),
92+
clip_grad=dict(max_norm=35, norm_type=2),
93+
paramwise_cfg=dict(
94+
custom_keys={
95+
'visual': dict(decay_mult=0., lr_mult=0.)
96+
},
97+
norm_decay_mult=0.)
98+
)
99+
100+
101+
default_hooks = dict(
102+
logger=dict(type='LoggerHook', interval=20),
103+
checkpoint=dict(by_epoch=False, interval=2000, max_keep_ckpts=1))
104+
log_processor = dict(by_epoch=False)
105+
106+
visualizer = dict(
107+
vis_backends=[
108+
dict(type='LocalVisBackend'),
109+
dict(type='TensorboardVisBackend')
110+
])
111+
112+
load_from = "work_dirs/oriented-rcnn_r50-fpn_20k_visdronezsd_base-set/merged_vild_init_iter20k.pth"

0 commit comments

Comments
 (0)