Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/zhaoyue-zephyrus/avion

[arXiv:2309.16669] Code release for "Training a Large Video Model on a Single Machine in a Day"
https://github.com/zhaoyue-zephyrus/avion

efficient-training video-understanding

Last synced: 4 days ago
JSON representation

[arXiv:2309.16669] Code release for "Training a Large Video Model on a Single Machine in a Day"

Awesome Lists containing this project

README

        

# :airplane: avion
AVION is short for A VIdeo model in ONe day. AVION (meaning plane in French and Spanish) is fast.

[**Training a Large Video Model on a Single Machine in a Day**](http://arxiv.org/abs/2309.16669)
Yue Zhao, Philipp Krähenbühl
UT Austin
[arxiv](http://arxiv.org/abs/2309.16669) | [bibtex](#citing-avion)

## Installation

See [INSTALL.md](docs/INSTALL.md) to install this code.

## Main results

1. AVION enables video-language contrastive pre-training on Ego4D (original narratives) **on a single node of 8× consumer-grade GPUs within a day**.

| Method | Backbone | batch-size
per GPU | GPU memory | Hardware | GPU×hour^ | EK100 MIR
0-shot Avg. mAP |
| :----: | :------: | :-------------------: | :--------: | :------: | :-------: | :--------------------------: |
| EgoVLP | TSF-B | 16 | 22 | 32× A100 | 1536 | 22.1 |
| Ours | ViT-B | 256 | 19 | 8× A5000 | 130 | 27.4 |

^The reported GPU×hour is *not* normalized for GPU generations. The cost for EgoVLP is obtained from the [original paper](https://arxiv.org/abs/2206.01670) (Sec 6.1).

2. AVION speeds up LLM-augmented video-language contrastive pre-training (LaViLa) on Ego4D.

a. Pretraining cost and performance.

| Method | Backbone | batch-size
per GPU | GPU memory | Hardware | GPU×hour^ | EK100 MIR
0-shot Avg. mAP |
| :----: | :------: | :-------------------: | :--------: | :------: | :-------: | :--------------------------: |
| LaViLa | TSF-B | 32 | 25 | 32× V100 | 1824 | 30.9 |
| Ours | ViT-B | 256 | 19 | 8× A5000 | 260 | 33.2 |

^The reported GPU×hour is *not* normalized for GPU generations.

b. Downstream performance.

| Method | Backbone | EK100 MIR
Avg. mAP | EK100 MIR
Avg. nDCG | EK100 CLS
Action Top-1 |
| :----: | :------: | :-------------------: | :--------------------: | :-----------------------: |
| LaViLa | TSF-B | 50.5 | 65.0 | 46.9 |
| Ours | ViT-B | 51.7 | 66.8 | 49.5 |
| LaViLa | TSF-L | 50.9 | 66.5 | 51.0 |
| Ours | ViT-L | 54.5 | 69.0 | 54.5 |

:trophy: LaViLa+AVION helps us win [CVPR 2023 EPIC-Kitchens Challenges](https://epic-kitchens.github.io/2023#results) in both Action Recognition and Multi-Instance Retrieval Tasks by a significant margin.

3. AVION speeds up VideoMAE pre-training.

| Method | Backbone | Epochs | GPU×hour^^ | top-1/top-5 (w/. FT) |
| :------: | :------: | :----: | :--------: | :------------------: |
| VideoMAE | ViT-B | 800 | 995 | 80.0/94.4 |
| Ours | ViT-B | 800 | 583 | 80.1/94.5 |

^^Both GPU×hour are measured on the same hardware environment (4× A5000 GPU).

For more details, please refer to [MODEL_ZOO](./docs/MODEL_ZOO.md).

## License

[MIT License](./LICENSE).

## Acknowledgements

* The vision-language contrastive pretraining part is refactored from [LaViLa](https://github.com/facebookresearch/LaViLa).
* The MAE-style self-supervised pre-training part is built upon [VideoMAE](https://github.com/MCG-NJU/VideoMAE/).

## Citing AVION

```bibtex
@article{zhao2023training,
title={Training a large video model on a single machine in a day},
author={Zhao, Yue and Kr{\"a}henb{\"u}hl, Philipp},
journal={arXiv preprint arXiv:2309.16669},
year={2023}
}
```

```bibtex
@inproceedings{zhao2023lavila,
title={Learning Video Representations from Large Language Models},
author={Zhao, Yue and Misra, Ishan and Kr{\"a}henb{\"u}hl, Philipp and Girdhar, Rohit},
booktitle={CVPR},
year={2023}
}
```