Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/zhaoyue-zephyrus/avion
Code release for "Training a Large Video Model on a Single Machine in a Day"
https://github.com/zhaoyue-zephyrus/avion
efficient-training video-understanding
Last synced: 7 days ago
JSON representation
Code release for "Training a Large Video Model on a Single Machine in a Day"
- Host: GitHub
- URL: https://github.com/zhaoyue-zephyrus/avion
- Owner: zhaoyue-zephyrus
- License: mit
- Created: 2023-08-30T19:50:21.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-07-31T07:20:49.000Z (3 months ago)
- Last Synced: 2024-07-31T08:44:18.021Z (3 months ago)
- Topics: efficient-training, video-understanding
- Language: Python
- Homepage: http://arxiv.org/abs/2309.16669
- Size: 1.33 MB
- Stars: 102
- Watchers: 1
- Forks: 4
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# :airplane: avion
AVION is short for A VIdeo model in ONe day. AVION (meaning plane in French and Spanish) is fast.[**Training a Large Video Model on a Single Machine in a Day**](http://arxiv.org/abs/2309.16669)
Yue Zhao, Philipp Krähenbühl
UT Austin
[arxiv](http://arxiv.org/abs/2309.16669) | [bibtex](#citing-avion)## Installation
See [INSTALL.md](docs/INSTALL.md) to install this code.
## Main results
1. AVION enables video-language contrastive pre-training on Ego4D (original narratives) **on a single node of 8× consumer-grade GPUs within a day**.
| Method | Backbone | batch-size
per GPU | GPU memory | Hardware | GPU×hour^ | EK100 MIR
0-shot Avg. mAP |
| :----: | :------: | :-------------------: | :--------: | :------: | :-------: | :--------------------------: |
| EgoVLP | TSF-B | 16 | 22 | 32× A100 | 1536 | 22.1 |
| Ours | ViT-B | 256 | 19 | 8× A5000 | 130 | 27.4 |^The reported GPU×hour is *not* normalized for GPU generations. The cost for EgoVLP is obtained from the [original paper](https://arxiv.org/abs/2206.01670) (Sec 6.1).
2. AVION speeds up LLM-augmented video-language contrastive pre-training (LaViLa) on Ego4D.
a. Pretraining cost and performance.
| Method | Backbone | batch-size
per GPU | GPU memory | Hardware | GPU×hour^ | EK100 MIR
0-shot Avg. mAP |
| :----: | :------: | :-------------------: | :--------: | :------: | :-------: | :--------------------------: |
| LaViLa | TSF-B | 32 | 25 | 32× V100 | 1824 | 30.9 |
| Ours | ViT-B | 256 | 19 | 8× A5000 | 260 | 33.2 |^The reported GPU×hour is *not* normalized for GPU generations.
b. Downstream performance.
| Method | Backbone | EK100 MIR
Avg. mAP | EK100 MIR
Avg. nDCG | EK100 CLS
Action Top-1 |
| :----: | :------: | :-------------------: | :--------------------: | :-----------------------: |
| LaViLa | TSF-B | 50.5 | 65.0 | 46.9 |
| Ours | ViT-B | 51.7 | 66.8 | 49.5 |
| LaViLa | TSF-L | 50.9 | 66.5 | 51.0 |
| Ours | ViT-L | 54.5 | 69.0 | 54.5 |:trophy: LaViLa+AVION helps us win [CVPR 2023 EPIC-Kitchens Challenges](https://epic-kitchens.github.io/2023#results) in both Action Recognition and Multi-Instance Retrieval Tasks by a significant margin.
3. AVION speeds up VideoMAE pre-training.
| Method | Backbone | Epochs | GPU×hour^^ | top-1/top-5 (w/. FT) |
| :------: | :------: | :----: | :--------: | :------------------: |
| VideoMAE | ViT-B | 800 | 995 | 80.0/94.4 |
| Ours | ViT-B | 800 | 583 | 80.1/94.5 |^^Both GPU×hour are measured on the same hardware environment (4× A5000 GPU).
For more details, please refer to [MODEL_ZOO](./docs/MODEL_ZOO.md).
## License
[MIT License](./LICENSE).
## Acknowledgements
* The vision-language contrastive pretraining part is refactored from [LaViLa](https://github.com/facebookresearch/LaViLa).
* The MAE-style self-supervised pre-training part is built upon [VideoMAE](https://github.com/MCG-NJU/VideoMAE/).## Citing AVION
```bibtex
@article{zhao2023training,
title={Training a large video model on a single machine in a day},
author={Zhao, Yue and Kr{\"a}henb{\"u}hl, Philipp},
journal={arXiv preprint arXiv:2309.16669},
year={2023}
}
``````bibtex
@inproceedings{zhao2023lavila,
title={Learning Video Representations from Large Language Models},
author={Zhao, Yue and Misra, Ishan and Kr{\"a}henb{\"u}hl, Philipp and Girdhar, Rohit},
booktitle={CVPR},
year={2023}
}
```