Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/zinengtang/perceiver_vl
PyTorch code for "Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention" (WACV 2023)
https://github.com/zinengtang/perceiver_vl
efficiency retrieval scalability video-language vision-and-language
Last synced: 28 days ago
JSON representation
PyTorch code for "Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention" (WACV 2023)
- Host: GitHub
- URL: https://github.com/zinengtang/perceiver_vl
- Owner: zinengtang
- License: mit
- Created: 2022-03-14T02:25:39.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2023-02-05T12:44:38.000Z (almost 2 years ago)
- Last Synced: 2024-04-28T05:14:30.743Z (8 months ago)
- Topics: efficiency, retrieval, scalability, video-language, vision-and-language
- Language: Python
- Homepage:
- Size: 1.09 MB
- Stars: 32
- Watchers: 3
- Forks: 3
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Perceiver-VL
### **[Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention](https://arxiv.org/abs/2211.11701) [WACV 2023 [bib](https://github.com/zinengtang/Perceiver_VL#citation)]**
[Zineng Tang*](https://zinengtang.github.io/), [Jaemin Cho*](https://j-min.io/), [Jie Lei](https://jayleicn.github.io/), [Mohit Bansal](https://www.cs.unc.edu/~mbansal/)Learning vision-language representation by iterative latent attention that scales with long inputs linearly.
## Introduction
Perceiver-VL Architecture Overview
## Install
### Setup `python` environment
```
conda create -n Perceiver-VL python=3.8 # You can also use other environment.
```### Install other dependencies
```
pip install -r requirements.txt
```## Training
TODO: Finish datasets/tasks instructions and scripts
### Pretraining (scripts)
```
# Pretrain on Webvid + GCC
bash scripts/co_pretrain.sh
``````
# Pretrain on Webvid
bash scripts/webvid_pretrain.sh
``````
# Pretrain on GCC
bash scripts/gcc_pretrain.sh
``````
# Pretrain on ImageNet
bash scripts/imagenet_pretrain.sh
```### Pretrained Checkpoint
Download Checkpoint [[link]](https://huggingface.co/murgelab/PerceiverVL/resolve/main/perceivervl_mlm_itm_vtm.ckpt)### Finetuning on Downstream (scripts)
```
# Fintune on MSRVTT Retrieval
bash scripts/msrvtt_vrtr_finetune.sh
``````
# Fintune on VQA
bash scripts/vqa_finetune.sh
```## Code Structure
```
Perceiver_VL
│
├── assets # illustrations
│ └── architecture.png
│
├── model # main source
│ ├── datamodules # pytorch-lightning wrap
│ │ ├── datamodule_base.py
│ │ └── ...
│ └── datasets # Datasets
│ │ ├── vqa_dataset.py
│ │ └── ...
│ ├── gadgets
│ │ └── my_metrics.py # metric utils
│ ├── modules
│ │ ├── heads.py # model heads
│ │ ├── model_module.py # pytorch-lightning wrap for model
│ │ ├── model_utils.py # pytorch-lightning wrap for training metrics
│ │ ├── objectives.py # pretraining/finetuning objectives
│ │ └── perceiver_vl.py # main model
│ ├── transforms # image transformation utils
│ │ └── ...
│ └── config.py # all configurations
│
├── scripts # all scripts
│ ├── vqa_finetune.sh
│ ├── co_pretrain.sh
│ └── ...
│
├── run.py # main
└── requirements.txt
```## Citation
```
@inproceedings{tang2023wacv,
title = {Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention},
author = {Zineng Tang and Jaemin Cho and Jie Lei and Mohit Bansal},
booktitle = {WACV},
year = {2023}
}
```## Acknowledgement
Our codebase is based on [ViLT](https://github.com/dandelin/ViLT).
We thank the authors for their open-source contributions.## Contact
Zineng Tang ([email protected])