Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/jiasenlu/vilbert_beta

Last synced: 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/jiasenlu/vilbert_beta
Owner: jiasenlu
Created: 2019-08-17T00:49:01.000Z (almost 5 years ago)
Default Branch: master
Last Pushed: 2022-11-21T21:32:12.000Z (over 1 year ago)
Last Synced: 2024-01-27T19:45:33.777Z (5 months ago)
Language: Jupyter Notebook
Size: 69.7 MB
Stars: 469
Watchers: 15
Forks: 96
Open Issues: 44
Metadata Files:
- Readme: README.md

Lists

awesome-vision-and-language - vilbert

README

# ViLBERT

#### ViLBERT_beta has been deprecated. Please see [vilbert-multi-task](https://github.com/facebookresearch/vilbert-multi-task), which includes implementations for [12-in-1: Multi-Task Vision and Language Representation Learning](https://arxiv.org/abs/1912.02315)

Code and pre-trained models for **[ViLBERT: Pretraining Task-Agnostic VisiolinguisticRepresentations for Vision-and-Language Tasks](https://arxiv.org/abs/1908.02265)**.

*Note: This codebase is still in beta release to replicate the paper's preformance. *

## Repository Setup

1. Create a fresh conda environment, and install all dependencies.

```text
conda create -n vilbert python=3.6
conda activate vilbert
git clone https://github.com/jiasenlu/vilbert_beta
cd vilbert_beta
pip install -r requirements.txt
```

2. Install pytorch
```
conda install pytorch torchvision cudatoolkit=10.0 -c pytorch
```

3. Install apx, follows https://github.com/NVIDIA/apex

4. compile tools

```
cd tools/refer
make
```
## Data Setup

Check `README.md` under `data` for more details. Check `vlbert_tasks.yml` for more details.

## Pre-trained model for Evaluation

## Evaluation

### Zero-Shot Image Retrieval

We can directly use the Pre-trained ViLBERT model for zero-shot image retrieval tasks on Flickr30k.

1: Download the pretrained model with objective `Conceptual Caption` and put it under `save`

2: Update `featyres_h5path1` and `val_annotations_jsonpath` in `vlbert_task.yml` to load the Flickr30k testset image feature and jsonfile (defualt is training feature).

3: Use the following command to evaluate pre-trained 6 layer ViLBERT model. (only support single GPU for evaluation now):

```bash
python eval_retrieval.py --bert_model bert-base-uncased --from_pretrained save/bert_base_6_layer_6_connect/pytorch_model_9.bin --config_file config/bert_base_6layer_6conect.json --task 3 --split test --batch_size 1 --zero_shot
```

### Image Retrieval

1: Download the pretrained model with objective `Image Retrieval` and put it under `save`

2: Update `featyres_h5path1` and `val_annotations_jsonpath` in `vlbert_task.yml` to load the Flickr30k testset image feature and jsonfile (defualt is training feature).

3: Use the following command to evaluate pre-trained 6 layer ViLBERT model. (only support single GPU for evaluation now):

```bash
python eval_retrieval.py --bert_model bert-base-uncased --from_pretrained save/RetrievalFlickr30k_bert_base_6layer_6conect-pretrained/pytorch_model_19.bin --config_file config/bert_base_6layer_6conect.json --task 3 --split test --batch_size 1
```

### VQA

1: Download the pretrained model with objective `VQA` and put it under `save`

2: To test on held out validation split, use the following command:

```
python eval_tasks.py --bert_model bert-base-uncased --from_pretrained save/VQA_bert_base_6layer_6conect-pretrained/pytorch_model_19.bin --config_file config/bert_base_6layer_6conect.json --task 0 --split minval
```

### VCR

1: Download the pretrained model with objective `VCR` and put it under `save`

2: To test on VCR Q->A

```
python eval_tasks.py --bert_model bert-base-uncased --from_pretrained save/VCR_Q-A-VCR_QA-R_bert_base_6layer_6conect-pretrained/pytorch_model_19.bin --config_file config/bert_base_6layer_6conect.json --task 1 --split val
```

3: To test on VCR QA->R

### RefCOCO+

1: Download the pretrained model with objective `RefCOCO+` and put it under `save`

2: We use the Pre-computed detections/masks from [MAttNet](https://github.com/lichengunc/MAttNet) for fully-automatic comprehension task, Check the MAttNet repository for more details.

3: To test on the RefCOCO+ val set and use the following command:

```bash
python eval_tasks.py --bert_model bert-base-uncased --from_pretrained save/refcoco+_bert_base_6layer_6conect-pretrained/pytorch_model_19.bin --config_file config/bert_base_6layer_6conect.json --task 4
```

## Visiolinguistic Pre-training

Once you extracted all the image features, to train a 6-layer ViLBERT model on conceptual caption:

```bash
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 train_concap.py --from_pretrained bert-base-uncased --bert_model bert-base-uncased --conf
ig_file config/bert_base_6layer_6conect.json --learning_rate 1e-4 --train_batch_size 512 --save_name pretrained
```

### Train ViLBERT for DownStream Tasks

### VQA

To fintune a 6-layer ViLBERT model for VQA with 8 GPU. `--tasks 0` means VQA tasks. Check `vlbert_tasks.yml` for more settings for VQA tasks.

```bash
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 train_tasks.py --bert_model bert-base-uncased --from_pretrained save/bert_base_6_layer_6_connect_freeze_0/pytorch_model_8.bin --config_file config/bert_base_6layer_6conect.json --learning_rate 4e-5 --num_workers 16 --tasks 0 --save_name pretrained
```

### VCR

Similarly, to finetune a 6-layer vilbert model for VCR task, run the following commands. Here we joint train `Q->A ` and `QA->R` tasks, so the tasks is specified as `--tasks 1-2`

```bash
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 train_tasks.py --bert_model bert-base-uncased --from_pretrained save/bert_base_6_layer_6_connect_freeze_0/pytorch_model_8.bin --config_file config/bert_base_6layer_6conect.json --learning_rate 2e-5 --num_workers 16 --tasks 1-2 --save_name pretrained
```

### Image Retrieval

### Refer Expression

- For single GPU training, use smaller batch size and simply remove ` -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 `

## References

If you find this code is useful for your research, please cite our paper

```
@article{lu2019vilbert,
title={ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks},
author={Lu, Jiasen and Batra, Dhruv and Parikh, Devi and Lee, Stefan},
journal={arXiv preprint arXiv:1908.02265},
year={2019}
}
```

## Why does ViLBERT look like ?