You can find the dataset instructions in DATASET. We have provide all the metadata files of our data.
You can find all the models and the scripts in MODEL_ZOO.
We use CLIP pretrained models as the unmasked teachers by default:
- Follow extract.ipynb to extract visual encoder from CLIP.
- Change
MODEL_PATHin clip.py.
For training, you can simply run the pretraining scripts as follows:
# masked pretraining
bash ./exp_pt/videomamba_middle_5m/run.sh
# further unmasked pretraining for 1 epoch
bash ./exp_pt/videomamba_middle_5m_unmasked/run.shNotes:
- Set
data_dirandyour_data_pathlikeyour_webvid_pathin data.py before running the scripts.- Set
vision_encoder.pretrainedinvision_encoder.pretrainedin the corresponding config files.- Set
--rdzv_endpointto yourMASTER_NODE:MASTER_PORTin torchrun.sh.save_latest=Truewill automatically save the latest checkpoint while training.auto_resume=Truewill automatically loaded the best or latest checkpoint while training.- For unmasked pretraining, please set
pretrained_pathto load the masked pretrained epoch.
For zero-shot evaluation, you can simply run the pretraining scripts as follows:
bash ./exp_zs/msrvtt/run.shNotes:
- Set
pretrained_pathin the running scripts before running the scripts.- Set
zero_shot=Trueandevaluate=Truefor zero-shot evaluation