Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/showlab/all-in-one
[CVPR2023] All in One: Exploring Unified Video-Language Pre-training
https://github.com/showlab/all-in-one
codebase pre-training pytorch video-language
Last synced: 5 days ago
JSON representation
[CVPR2023] All in One: Exploring Unified Video-Language Pre-training
- Host: GitHub
- URL: https://github.com/showlab/all-in-one
- Owner: showlab
- Created: 2022-03-14T13:35:03.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-03-25T11:46:31.000Z (over 1 year ago)
- Last Synced: 2024-04-28T05:08:04.532Z (7 months ago)
- Topics: codebase, pre-training, pytorch, video-language
- Language: Python
- Homepage: https://arxiv.org/abs/2203.07303
- Size: 1.53 MB
- Stars: 272
- Watchers: 6
- Forks: 16
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/all-in-one-exploring-unified-video-language/visual-question-answering-on-msrvtt-qa-1)](
https://paperswithcode.com/sota/visual-question-answering-on-msrvtt-qa-1?p=all-in-one-exploring-unified-video-language)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/all-in-one-exploring-unified-video-language/visual-question-answering-on-msvd-qa-1)](
https://paperswithcode.com/sota/visual-question-answering-on-msvd-qa-1?p=all-in-one-exploring-unified-video-language)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/all-in-one-exploring-unified-video-language/tgif-frame-on-tgif-qa)](
https://paperswithcode.com/sota/tgif-frame-on-tgif-qa?p=all-in-one-exploring-unified-video-language)[comment]: <> ([![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/all-in-one-exploring-unified-video-language/video-retrieval-on-msr-vtt)]()
[comment]: <> (https://paperswithcode.com/sota/video-retrieval-on-msr-vtt?p=all-in-one-exploring-unified-video-language))
# All-in-one
Code for the paper: All in One: Exploring Unified Video-Language Pre-training [Arxiv](https://arxiv.org/abs/2203.07303)
---![ppl](figures/ppl.jpg)
## News
- 2022.03.25 Update Readme.
- 2022.06.07 Release the model AllInOne+ pre-trained on Eight Dataset (YTT+WebVid+HowTo+CC3+CC12+CoCo+VG+SBU).
- 2022.05.07 AllInOne+ is released. The main different between AllInOne is the Image and Video Co-train.
- 2022.03.14 The first version of AllInOne is released.## Install
### 1. PytorchLighting
In this work, we use PytorchLighting for distributed training with mixed precision.
Install pytorch and PytorchLighting first.```bash
conda create -n allinone python=3.7
source activate allinone
cd [Path_To_This_Code]
pip install -r requirements.txt
```If all packages include ffmpeg installed, please skip step 2.
### 2. On-the-fly decode (may skip)
To speed up the pre-training, we adopt on-the-fly decode for fast IO.
Install ffmpeg as below.#### 1. ffmpeg
```bash
sudo conda install -y ffmpeg
```Please install the required packages if not included in the requirements.txt.
If you server cannot connect to http or install ffmpeg slowly. Please download static binary file from [FFmpeg Static Builds](https://johnvansickle.com/ffmpeg/) and then add to path variable, as follows:
```bash
export PATH=[PATH_TO_Dir/]ffmpeg-git-20220108-amd64-static:$PATH
```#### 2. pytorch video
Install pytorchvideo (for data augmentation) as below:```bash
pip install ffmpeg-python
pip install pytorchvideo
```## Download Pretrained Weights
We provide three pretrained weights in google driver.| Model | PT Data | Parameter | Pretrained Weight | Trained Log | Hparams |
| ---- | ----| ---- | ---- | ---- | ---- |
| All-in-one-Ti |Webvid+HowTo| 12M| [Google Driver](https://drive.google.com/file/d/1-mS9U1xRnvumaftjhxJsr_t4WjJ-gp7t/view?usp=sharing) | [Google Driver](https://drive.google.com/file/d/1j27-i7WsNDtj9k0CSnDC9sThMMjMRF-U/view?usp=sharing) | [Google Driver](https://drive.google.com/file/d/1DmZ5apWqIuUMRg7igdN2sHM2INrT_UZo/view?usp=sharing)|
| All-in-one-S | Webvid+HowTo|33M| [Google Driver](https://drive.google.com/file/d/1ntyEsFWLG8XQZ9oliYsrRZmhp_OMbQJ-/view?usp=sharing) | [Google Driver](https://drive.google.com/file/d/10uJZUMH10D1QD_o2g0WmXfv47xTAV5hJ/view?usp=sharing) | [Google Driver](https://drive.google.com/file/d/12levE9kXQbWykJHUKqXNQZz32vtOPRLt/view?usp=sharing)|
| All-in-one-B | Webvid+HowTo|110M| [Google Driver](https://drive.google.com/file/d/1z3g891ND6CGCUkVzCXr2647wVG-15uUS/view?usp=sharing) | [Google Driver](https://drive.google.com/file/d/1FBs6HOeXr3Bo_UZLDq13qscLTMqITGWC/view?usp=sharing) | [Google Driver](https://drive.google.com/file/d/1D7OiF9HpIIsFk20LkCUWYThpXo_NPzT0/view?usp=sharing) |
| All-in-one-B+ | Webvid+HowTo+
CC3|110M| [Google Driver](https://drive.google.com/file/d/1t-yWNjXJxGslBkKujlyYh-HUIdCc_gF7/view?usp=sharing) | [Google Driver](https://drive.google.com/file/d/1EN1D0KjqOze9tDW15raC2AULIEqfd2DQ/view?usp=sharing) | [Google Driver](https://drive.google.com/file/d/1uxtfWhVmi1BAhHzOzJMXjmwE6H3go2L9/view?usp=sharing) |
| All-in-one-B+ | Webvid+YTT+HowTo+
CC3+CC12+Coco+VG+SBU|110M| [Google Driver](https://drive.google.com/file/d/1Yd2lKppaduqG_RO1gCA6OpAfB0_IXDoX/view?usp=sharing) | [Google Driver](https://drive.google.com/file/d/1azTwITjlo7YA1pLP42mlJ45K9IV4JSxR/view?usp=sharing) | [Google Driver](https://drive.google.com/file/d/1ddz8wtd0VSnqhu3Dd0MKiHqcNklG3NSv/view?usp=sharing) |After downloaded these pretrained weights, move them into pretrained dir.
```bash
mkdir pretrained
cp *.ckpt pretrained/
```### Compare with state-of-the-arts
|Model|Param|Data|Frames|TGIF-Action|TGIF-Frame|MSR R@5|MSR R@10|
|---|---|---|---|---|---|---|---|
|ClipBERT|137M|I:Coco+VG|8 x 2|82.9|59.4|49.2|63.5|
|VIOLET|198M|V:Webvid+
I:CC3|16|87.1|-|63.0|73.4|
|All-in-one-S|33M|V:WebVid+Howto|3|91.2|64.0|61.5|70.9|
|All-in-one-B|110M|V:WebVid+Howto|3|**92.9**|**64.2**|**67.0**|**77.1**|
|All-in-one-B+|110M|V:Webvid+
I:CC3|3|**95.4**|**67.2**|**68.1**|**77.3**|
|All-in-one-B+|110M|V:Webvid+YTT+HowTo+
I:CC3+CC12+Coco+VG+SBU|3|**96.3**|**68.5**|**70.3**|**79.2**|I is short for Image and V is short for Video in this table.
## Dataset Preparation
See [`DATA.md`](DATA.md)## Pre-training
### Full Video Pre-training
See [`TRAIN.md`](TRAIN.md)
### Co-training with Image Dataset (All-in-one+)
See [`COTRAIN.md`](COTRAIN.md)## Evaluation on Downstream Tasks
See [`EVAL.md`](EVAL.md)By unified design and sparse sampling, AllInOne show much small flops.
![](figures/introduction.jpg)
## Citation
If you find our work helps, please cite our paper.```bash
@article{wang2022allinone,
title={All in One: Exploring Unified Video-Language Pre-training},
author={Wang, Alex Jinpeng and Ge, Yixiao and Yan, Rui and Ge Yuying and Lin, Xudong and Cai, Guanyu and Wu, Jianping and Shan, Ying and Qie, Xiaohu and Shou, Mike Zheng},
journal={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2023}
}
```## Contact
Email: _awinyimgprocess at gmail dot com_
If you have any problem or have difficult in reproducing the results reported in this code, you can email to me or open a question in issues.
We are also willing to merge the code if transfer our All-in-one to different tasks or datasets.## Acknowledgement
This work is mainly based on [ViLT](https://github.com/dandelin/ViLT), [Frozen](https://github.com/m-bain/frozen-in-time) and [Merlot](https://github.com/rowanz/merlot).## License
MIT