⚠️ The current video models are fine-tuned without layer decay due to the bug, which may help to improve the performances as in MAE. We have fixed the bug but do not plan to retrain them.We have applied it for VideoMamba-M but it does not help.- For all the pretraining and finetuning, we adopt spaese/uniform sampling.
-
#Frame$=$ #input_frame$\times$ #crop$\times$ #clip -
#input_framemeans how many frames are input for model per inference -
#cropmeans spatial crops (e.g., 3 for left/right/center) -
#clipmeans temporal clips (e.g., 4 means repeted sampling four clips with different start indices)
| Model | Setting | Model | Shell |
|---|---|---|---|
| VideoMamba-M | K400 800e | aliyun, 🤗HF | run.sh |
| VideoMamba-M | SthSthV2 200e | aliyun, 🤗HF | run.sh |
| Model | Pretraining | Resolution | #Frame | Top-1 | Model | Shell |
|---|---|---|---|---|---|---|
| VideoMamba-Ti | ImageNet-1K | 224 | 8x3x4 | 76.9 | aliyun, 🤗HF | run.sh |
| VideoMamba-Ti | ImageNet-1K | 224 | 16x3x4 | 78.1 | aliyun, 🤗HF | run.sh |
| VideoMamba-Ti | ImageNet-1K | 224 | 32x3x4 | 78.8 | aliyun, 🤗HF | run.sh |
| VideoMamba-Ti | ImageNet-1K | 224 | 64x3x4 | 79.6 | aliyun, 🤗HF | run.sh |
| VideoMamba-Ti | ImageNet-1K | 384 | 64x3x4 | 80.3 | aliyun, 🤗HF | run.sh |
| VideoMamba-S | ImageNet-1K | 224 | 8x3x4 | 79.3 | aliyun, 🤗HF | run.sh |
| VideoMamba-S | ImageNet-1K | 224 | 16x3x4 | 80.8 | aliyun, 🤗HF | run.sh |
| VideoMamba-S | ImageNet-1K | 224 | 32x3x4 | 81.5 | aliyun, 🤗HF | run.sh |
| VideoMamba-S | ImageNet-1K | 224 | 64x3x4 | 81.8 | aliyun, 🤗HF | run.sh |
| VideoMamba-S | ImageNet-1K | 384 | 64x3x4 | 82.7 | aliyun, 🤗HF | run.sh |
| VideoMamba-M | ImageNet-1K | 224 | 8x3x4 | 80.6 | aliyun, 🤗HF | run.sh |
| VideoMamba-M | ImageNet-1K | 224 | 16x3x4 | 81.9 | aliyun, 🤗HF | run.sh |
| VideoMamba-M | ImageNet-1K | 224 | 32x3x4 | 82.4 | aliyun, 🤗HF | run.sh |
| VideoMamba-M | ImageNet-1K | 224 | 64x3x4 | 82.8 | aliyun, 🤗HF | run.sh |
| VideoMamba-M | ImageNet-1K | 384 | 64x3x4 | 83.3 | aliyun, 🤗HF | run.sh |
| VideoMamba-M | MASK | 224 | 8x3x4 | 82.0 | aliyun, 🤗HF | run.sh |
| VideoMamba-M | MASK | 224 | 16x3x4 | 83.4 | aliyun, 🤗HF | run.sh |
| VideoMamba-M | MASK | 224 | 32x3x4 | 83.9 | aliyun, 🤗HF | run.sh |
| VideoMamba-M | MASK | 224 | 64x3x4 | 84.3 | aliyun, 🤗HF | run.sh |
| VideoMamba-M | MASK | 384 | 64x3x4 | 85.0 | aliyun, 🤗HF | run.sh |
| Model | Pretraining | Resolution | #Frame | Top-1 | Model | Shell |
|---|---|---|---|---|---|---|
| VideoMamba-Ti | ImageNet-1K | 224 | 8x3x4 | 65.1 | aliyun, 🤗HF | run.sh |
| VideoMamba-Ti | ImageNet-1K | 224 | 16x3x4 | 66.0 | aliyun, 🤗HF | run.sh |
| VideoMamba-Ti | ImageNet-1K | 288 | 16x3x4 | 66.2 | aliyun, 🤗HF | run.sh |
| VideoMamba-S | ImageNet-1K | 224 | 8x3x4 | 66.6 | aliyun, 🤗HF | run.sh |
| VideoMamba-S | ImageNet-1K | 224 | 16x3x4 | 67.7 | aliyun, 🤗HF | run.sh |
| VideoMamba-S | ImageNet-1K | 288 | 16x3x4 | 68.1 | aliyun, 🤗HF | run.sh |
| VideoMamba-M | ImageNet-1K | 224 | 8x3x4 | 67.3 | aliyun, 🤗HF | run.sh |
| VideoMamba-M | ImageNet-1K | 224 | 16x3x4 | 68.3 | aliyun, 🤗HF | run.sh |
| VideoMamba-M | ImageNet-1K | 288 | 16x3x4 | 68.4 | aliyun, 🤗HF | run.sh |
| VideoMamba-M | MASK | 224 | 8x3x4 | 70.2 | aliyun, 🤗HF | run.sh |
| VideoMamba-M | MASK | 224 | 16x3x4 | 71.0 | aliyun, 🤗HF | run.sh |
| VideoMamba-M | MASK | 288 | 16x3x4 | 71.4 | aliyun, 🤗HF | run.sh |
| Model | Pretraining | Resolution | #Frame | Top-1 | Model | Shell |
|---|---|---|---|---|---|---|
| VideoMamba-Ti | K400 | 224 | 32x3x4 | 94.3 | aliyun, 🤗HF | run.sh |
| VideoMamba-Ti | K400 | 224 | 64x3x4 | 94.3 | aliyun, 🤗HF | run.sh |
| VideoMamba-S | K400 | 224 | 32x3x4 | 95.3 | aliyun, 🤗HF | run.sh |
| VideoMamba-S | K400 | 224 | 64x3x4 | 97.4 | aliyun, 🤗HF | run.sh |
| VideoMamba-M | K400 | 224 | 32x3x4 | 94.8 | aliyun, 🤗HF | run.sh |
| VideoMamba-M | K400 | 224 | 64x3x4 | 95.8 | aliyun, 🤗HF | run.sh |
| VideoMamba-M | MASK+K400 | 224 | 32x3x4 | 97.9 | aliyun, 🤗HF | run.sh |
| VideoMamba-M | MASK+K400 | 224 | 64x3x4 | 96.9 | aliyun, 🤗HF | run.sh |
| Model | Pretraining | Resolution | #Frame | Top-1 | Model | Shell |
|---|---|---|---|---|---|---|
| VideoMamba-Ti | K400 | 224 | 32x3x10 | 86.2 | aliyun, 🤗HF | run.sh |
| VideoMamba-Ti | K400 | 224 | 64x3x10 | 87.0 | aliyun, 🤗HF | run.sh |
| VideoMamba-S | K400 | 224 | 32x3x10 | 88.4 | aliyun, 🤗HF | run.sh |
| VideoMamba-S | K400 | 224 | 64x3x10 | 88.7 | aliyun, 🤗HF | run.sh |
| VideoMamba-M | K400 | 224 | 32x3x10 | 88.3 | aliyun, 🤗HF | run.sh |
| VideoMamba-M | K400 | 224 | 64x3x10 | 89.5 | aliyun, 🤗HF | run.sh |
| VideoMamba-M | MASK+K400 | 224 | 32x3x10 | 89.6 | aliyun, 🤗HF | run.sh |
| VideoMamba-M | MASK+K400 | 224 | 64x3x10 | 90.4 | aliyun, 🤗HF | run.sh |
For LVU, we originally sample frame from the raw videos sparsely, but the results are not stable due to the limited videos. However, we found that ViS4mer uses trimmed clips with sliding window, which may improve the results. We also provide the related dataset with sliding window. Stay tuned!