https://github.com/showlab/all-in-one

[CVPR2023] All in One: Exploring Unified Video-Language Pre-training
https://github.com/showlab/all-in-one

codebase pre-training pytorch video-language

Last synced: 3 months ago
JSON representation

[CVPR2023] All in One: Exploring Unified Video-Language Pre-training

Host: GitHub
URL: https://github.com/showlab/all-in-one
Owner: showlab
Created: 2022-03-14T13:35:03.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2023-03-25T11:46:31.000Z (over 2 years ago)
Last Synced: 2025-04-02T10:49:43.085Z (3 months ago)
Topics: codebase, pre-training, pytorch, video-language
Language: Python
Homepage: https://arxiv.org/abs/2203.07303
Size: 1.53 MB
Stars: 281
Watchers: 6
Forks: 17
Open Issues: 4
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/all-in-one-exploring-unified-video-language/visual-question-answering-on-msrvtt-qa-1)](

https://paperswithcode.com/sota/visual-question-answering-on-msrvtt-qa-1?p=all-in-one-exploring-unified-video-language)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/all-in-one-exploring-unified-video-language/visual-question-answering-on-msvd-qa-1)](

https://paperswithcode.com/sota/visual-question-answering-on-msvd-qa-1?p=all-in-one-exploring-unified-video-language)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/all-in-one-exploring-unified-video-language/tgif-frame-on-tgif-qa)](

https://paperswithcode.com/sota/tgif-frame-on-tgif-qa?p=all-in-one-exploring-unified-video-language)

[comment]: <> ([![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/all-in-one-exploring-unified-video-language/video-retrieval-on-msr-vtt)]()

[comment]: <> (https://paperswithcode.com/sota/video-retrieval-on-msr-vtt?p=all-in-one-exploring-unified-video-language))

# All-in-one

Code for the paper: All in One: Exploring Unified Video-Language Pre-training [Arxiv](https://arxiv.org/abs/2203.07303)

---

![ppl](figures/ppl.jpg)

## News

- 2022.03.25 Update Readme.

- 2022.06.07 Release the model AllInOne+ pre-trained on Eight Dataset (YTT+WebVid+HowTo+CC3+CC12+CoCo+VG+SBU). 

- 2022.05.07 AllInOne+ is released. The main different between AllInOne is the Image and Video Co-train. 

- 2022.03.14 The first version of AllInOne is released.

## Install

### 1.  PytorchLighting

In this work, we use PytorchLighting for distributed training with mixed precision.

Install pytorch and PytorchLighting first.

```bash

conda create -n allinone python=3.7

source activate allinone

cd [Path_To_This_Code]

pip install -r requirements.txt

```

If all packages include ffmpeg installed, please skip step 2.

### 2. On-the-fly decode (may skip)

To speed up the pre-training, we adopt on-the-fly decode for fast IO.

Install ffmpeg as below.

#### 1. ffmpeg

```bash

sudo conda install -y ffmpeg

```

Please install the required packages if not included in the requirements.txt.

If you server cannot connect to http or install ffmpeg slowly. Please download static binary file from [FFmpeg Static Builds](https://johnvansickle.com/ffmpeg/) and then add to path variable, as follows:

```bash

export PATH=[PATH_TO_Dir/]ffmpeg-git-20220108-amd64-static:$PATH

```

#### 2. pytorch video

Install pytorchvideo (for data augmentation) as below:

```bash

pip install ffmpeg-python

pip install pytorchvideo

```

## Download Pretrained Weights

We provide three pretrained weights in google driver.

|  Model  | PT Data | Parameter | Pretrained Weight  | Trained Log | Hparams |

|  ----  |  ----|  ---- | ----  | ---- | ---- |

| All-in-one-Ti |Webvid+HowTo| 12M| [Google Driver](https://drive.google.com/file/d/1-mS9U1xRnvumaftjhxJsr_t4WjJ-gp7t/view?usp=sharing) | [Google Driver](https://drive.google.com/file/d/1j27-i7WsNDtj9k0CSnDC9sThMMjMRF-U/view?usp=sharing) | [Google Driver](https://drive.google.com/file/d/1DmZ5apWqIuUMRg7igdN2sHM2INrT_UZo/view?usp=sharing)|

| All-in-one-S | Webvid+HowTo|33M| [Google Driver](https://drive.google.com/file/d/1ntyEsFWLG8XQZ9oliYsrRZmhp_OMbQJ-/view?usp=sharing) | [Google Driver](https://drive.google.com/file/d/10uJZUMH10D1QD_o2g0WmXfv47xTAV5hJ/view?usp=sharing) |  [Google Driver](https://drive.google.com/file/d/12levE9kXQbWykJHUKqXNQZz32vtOPRLt/view?usp=sharing)|

| All-in-one-B | Webvid+HowTo|110M| [Google Driver](https://drive.google.com/file/d/1z3g891ND6CGCUkVzCXr2647wVG-15uUS/view?usp=sharing) | [Google Driver](https://drive.google.com/file/d/1FBs6HOeXr3Bo_UZLDq13qscLTMqITGWC/view?usp=sharing) | [Google Driver](https://drive.google.com/file/d/1D7OiF9HpIIsFk20LkCUWYThpXo_NPzT0/view?usp=sharing) |

| All-in-one-B+ | Webvid+HowTo+
CC3|110M| [Google Driver](https://drive.google.com/file/d/1t-yWNjXJxGslBkKujlyYh-HUIdCc_gF7/view?usp=sharing) | [Google Driver](https://drive.google.com/file/d/1EN1D0KjqOze9tDW15raC2AULIEqfd2DQ/view?usp=sharing) | [Google Driver](https://drive.google.com/file/d/1uxtfWhVmi1BAhHzOzJMXjmwE6H3go2L9/view?usp=sharing) |

| All-in-one-B+ | Webvid+YTT+HowTo+
CC3+CC12+Coco+VG+SBU|110M| [Google Driver](https://drive.google.com/file/d/1Yd2lKppaduqG_RO1gCA6OpAfB0_IXDoX/view?usp=sharing) | [Google Driver](https://drive.google.com/file/d/1azTwITjlo7YA1pLP42mlJ45K9IV4JSxR/view?usp=sharing) | [Google Driver](https://drive.google.com/file/d/1ddz8wtd0VSnqhu3Dd0MKiHqcNklG3NSv/view?usp=sharing) |

After downloaded these pretrained weights, move them into pretrained dir.

```bash

mkdir pretrained

cp *.ckpt pretrained/

```

### Compare with state-of-the-arts

|Model|Param|Data|Frames|TGIF-Action|TGIF-Frame|MSR R@5|MSR R@10|

|---|---|---|---|---|---|---|---|

|ClipBERT|137M|I:Coco+VG|8 x 2|82.9|59.4|49.2|63.5|

|VIOLET|198M|V:Webvid+
I:CC3|16|87.1|-|63.0|73.4|

|All-in-one-S|33M|V:WebVid+Howto|3|91.2|64.0|61.5|70.9|

|All-in-one-B|110M|V:WebVid+Howto|3|**92.9**|**64.2**|**67.0**|**77.1**|

|All-in-one-B+|110M|V:Webvid+
I:CC3|3|**95.4**|**67.2**|**68.1**|**77.3**|

|All-in-one-B+|110M|V:Webvid+YTT+HowTo+
I:CC3+CC12+Coco+VG+SBU|3|**96.3**|**68.5**|**70.3**|**79.2**|

I is short for Image and V is short for Video in this table.

## Dataset Preparation

See [`DATA.md`](DATA.md)

## Pre-training

### Full Video Pre-training

See [`TRAIN.md`](TRAIN.md)

### Co-training with Image Dataset (All-in-one+)

See [`COTRAIN.md`](COTRAIN.md)

## Evaluation on Downstream Tasks

See [`EVAL.md`](EVAL.md)

By unified design and sparse sampling, AllInOne show much small flops.

![](figures/introduction.jpg)

## Citation

If you find our work helps, please cite our paper.

```bash

@article{wang2022allinone,

  title={All in One: Exploring Unified Video-Language Pre-training},

  author={Wang, Alex Jinpeng and Ge, Yixiao and Yan, Rui and Ge Yuying and Lin, Xudong and Cai, Guanyu  and Wu, Jianping and Shan, Ying and Qie, Xiaohu and Shou, Mike Zheng},

  journal={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},

  year={2023}

}

```

## Contact

Email: _awinyimgprocess at gmail dot com_

If you have any problem or have difficult in reproducing the results reported in this code, you can email to me or open a question in issues.

We are also willing to merge the code if transfer our All-in-one to different tasks or datasets.

## Acknowledgement

This work is mainly based on [ViLT](https://github.com/dandelin/ViLT), [Frozen](https://github.com/m-bain/frozen-in-time) and [Merlot](https://github.com/rowanz/merlot).

## License

MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/showlab/all-in-one

Awesome Lists containing this project

README