https://github.com/zinengtang/perceiver_vl

PyTorch code for "Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention" (WACV 2023)
https://github.com/zinengtang/perceiver_vl

efficiency retrieval scalability video-language vision-and-language

Last synced: 1 day ago
JSON representation

PyTorch code for "Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention" (WACV 2023)

Host: GitHub
URL: https://github.com/zinengtang/perceiver_vl
Owner: zinengtang
License: mit
Created: 2022-03-14T02:25:39.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2023-02-05T12:44:38.000Z (over 2 years ago)
Last Synced: 2025-04-10T13:24:05.159Z (3 months ago)
Topics: efficiency, retrieval, scalability, video-language, vision-and-language
Language: Python
Homepage:
Size: 1.09 MB
Stars: 33
Watchers: 2
Forks: 4
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Perceiver-VL

### **[Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention](https://arxiv.org/abs/2211.11701) [WACV 2023 [bib](https://github.com/zinengtang/Perceiver_VL#citation)]**
[Zineng Tang*](https://zinengtang.github.io/), [Jaemin Cho*](https://j-min.io/), [Jie Lei](https://jayleicn.github.io/), [Mohit Bansal](https://www.cs.unc.edu/~mbansal/)

Learning vision-language representation by iterative latent attention that scales with long inputs linearly.

## Introduction

Perceiver-VL Architecture Overview

## Install
### Setup `python` environment
```
conda create -n Perceiver-VL python=3.8 # You can also use other environment.
```

### Install other dependencies
```
pip install -r requirements.txt
```

## Training

TODO: Finish datasets/tasks instructions and scripts

### Pretraining (scripts)

```
# Pretrain on Webvid + GCC
bash scripts/co_pretrain.sh
```

```
# Pretrain on Webvid
bash scripts/webvid_pretrain.sh
```

```
# Pretrain on GCC
bash scripts/gcc_pretrain.sh
```

```
# Pretrain on ImageNet
bash scripts/imagenet_pretrain.sh
```

### Pretrained Checkpoint
Download Checkpoint [[link]](https://huggingface.co/murgelab/PerceiverVL/resolve/main/perceivervl_mlm_itm_vtm.ckpt)

### Finetuning on Downstream (scripts)

```
# Fintune on MSRVTT Retrieval
bash scripts/msrvtt_vrtr_finetune.sh
```

```
# Fintune on VQA
bash scripts/vqa_finetune.sh
```

## Code Structure

```
Perceiver_VL
│
├── assets # illustrations
│ └── architecture.png
│
├── model # main source
│ ├── datamodules # pytorch-lightning wrap
│ │ ├── datamodule_base.py
│ │ └── ...
│ └── datasets # Datasets
│ │ ├── vqa_dataset.py
│ │ └── ...
│ ├── gadgets
│ │ └── my_metrics.py # metric utils
│ ├── modules
│ │ ├── heads.py # model heads
│ │ ├── model_module.py # pytorch-lightning wrap for model
│ │ ├── model_utils.py # pytorch-lightning wrap for training metrics
│ │ ├── objectives.py # pretraining/finetuning objectives
│ │ └── perceiver_vl.py # main model
│ ├── transforms # image transformation utils
│ │ └── ...
│ └── config.py # all configurations
│
├── scripts # all scripts
│ ├── vqa_finetune.sh
│ ├── co_pretrain.sh
│ └── ...
│
├── run.py # main
└── requirements.txt
```

## Citation
```
@inproceedings{tang2023wacv,
title = {Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention},
author = {Zineng Tang and Jaemin Cho and Jie Lei and Mohit Bansal},
booktitle = {WACV},
year = {2023}
}
```

## Acknowledgement

Our codebase is based on [ViLT](https://github.com/dandelin/ViLT).
We thank the authors for their open-source contributions.

## Contact

Zineng Tang ([email protected])

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zinengtang/perceiver_vl

Awesome Lists containing this project

README