https://github.com/fmthoker/SEVERE-BENCHMARK
https://github.com/fmthoker/SEVERE-BENCHMARK
Last synced: 7 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/fmthoker/SEVERE-BENCHMARK
- Owner: fmthoker
- Created: 2022-07-02T21:53:24.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2023-08-31T08:59:14.000Z (over 2 years ago)
- Last Synced: 2024-04-24T16:42:55.346Z (over 1 year ago)
- Language: Python
- Size: 44.8 MB
- Stars: 23
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-self-supervised-learning - [code
README
# SEVERE Benchmark
## 📰 News
**[2023.8.22]** Code and pre-trained models of [Tubelet Contrast](https://arxiv.org/abs/2303.11003) will be released soon! Keep a look at this [repo](https://github.com/fmthoker/tubelet-contrast)!
**[2023.8.22]** Code for evaluation of [Tubelet Contrast](https://arxiv.org/abs/2303.11003) pretrained models is added this repo. 🎉
**[2023.7.13]** Our [Tubelet Contrast] (https://arxiv.org/abs/2303.11003) paper is accepted by **ICCV 2023**! 🎉
Official code for our ECCV 2022 paper [How Severe is Benchmark-Sensitivity in Video
Self-Supervised Learning?](https://bpiyush.github.io/SEVERE-website/)
TL;DR. We propose the SEVERE (SEnsitivity of VidEo REpresentations) benchmark for evaluating the generalizability of representations obtained by existing and future self-supervised video learning methods.

## Overview of Experiments
We evaluate 9 video self-supervised learning (VSSL) methods on 7 video datasets for 6 video understanding tasks.
### Evaluated VSSL models
Below are the video self-suprevised methods that we evaluate.
| Model | URL |
|-------|-----|
| SeLaVi| https://github.com/facebookresearch/selavi |
| MoCo| https://github.com/tinapan-pt/VideoMoCo |
| VideoMoCo | https://github.com/tinapan-pt/VideoMoCo |
| Pretext-Contrast | https://github.com/BestJuly/Pretext-Contrastive-Learning |
| RSPNet | https://github.com/PeihaoChen/RSPNet |
| AVID-CMA | https://github.com/facebookresearch/AVID-CMA |
| CtP | https://github.com/microsoft/CtP |
| TCLR | https://github.com/DAVEISHAN/TCLR |
| GDT | https://github.com/facebookresearch/GDT |
| Supervised | https://pytorch.org/vision/0.8/_modules/torchvision/models/video/resnet.html#r2plus1d_18 |
* For SeLaVi, MoCo, VideoMoCO, Pretext-Contrast, CtP, TCLR and GDT we use the Kinetics-400 pretrained R(2+1D)-18 weights provided by the Authors.
* For RSPNet and AVID-CMA the author provided R(2+1D)-18 weights differ from the R(2+1D)-18 architecture defined in ['A Closer Look at Spatiotemporal Convolutions for Action Recognition'](https://arxiv.org/abs/1711.11248). Thus we use the official implementation of the RSPNet and AVID-CMA and to pretrain with the common R(2+1D)-18 backbone on Kinetics-400 dataset.
* For Supervised, We use the Kinetics-400 pretrained R(2+1D)-18 weights from the pytorch library.
Download Kinetics-400 pretrained R(2+1D)-18 weights for each method from [here](https://surfdrive.surf.nl/files/index.php/s/Zw9tbuOYAInzVQC). Unzip the downloaded file and it shall create a folder `checkpoints_pretraining/` with all the pretraining model weights.
## Experiments
We divide these downstream evaluations across four axes:
### I. Downstream domain-shift
We evaluate the sensitivity of self-supervised methods to the domain shift in downstream dataset with respect to the pre-training dataset i.e. [Kinetics](https://arxiv.org/abs/1705.06950).
Please refer to [action_recognition/README.md](./action_recognition/README.md) for steps to reproduce the experiments with varying downstream domain datasets like .
### II. Downstream sample-sizes
We evaluate the sensitivity of self-supervised methods to the amount of downstream samples available for finetuning.
Please refer to [action_recognition/README.md](./action_recognition/README.md) for steps to reproduce the experiments with varying downstream samples.
### III. Downstream action granularities
We investigate whether self-supervised methods can learn fine-grained features required for recognizing semantically similar actions.
Please refer to [action_recognition/README.md](./action_recognition/README.md) for steps to reproduce the experiments with varying downstream actions.
### IV. Downstream task-shift
We study the sensitivity of video self-supervised methods to nature of the downstream task.
**In-domain task shift**: For task-shift within-domain, we evaluate the UCF dataset for the task of repetition counting. Please refer to [Repetition-Counting/README.md](./Repetition-Counting/README.md) for steps to reproduce experiments.
**Out-of-domain task shift**: For task-shift as well as domain shift, we evaluate on multi-label action classification on Charades and action detection on AVA. Please refer to [action_detection_multi_label_classification/README.md](./action_detection_multi_label_classification/README.md) for steps to reproduce the experiments.
## The SEVERE Benchmark
From our analysis we distill the SEVERE-benchmark, a subset of our experiments, that can be useful for evaluating current and future video representations beyond standard benchmarks.
### Citation
If you use our work or code, kindly consider citing our paper:
```
@inproceedings{thoker2022severe,
author = {Thoker, Fida Mohammad and Doughty, Hazel and Bagad, Piyush and Snoek, Cees},
title = {How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning?},
journal = {ECCV},
year = {2022},
}
```
### Acknowledgements
### Maintainers
* [Fida Thoker](https://fmthoker.github.io/)
* [Piyush Bagad](https://bpiyush.github.io/)
:bell: If you face an issue or have suggestions, please create a Github issue and we will try our best to address soon.