Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/vision-cair/visualgpt

VisualGPT, CVPR 2022 Proceeding, GPT as a decoder for vision-language models
https://github.com/vision-cair/visualgpt

data-efficient-image-caption image-caption visualgpt

Last synced: 3 days ago
JSON representation

VisualGPT, CVPR 2022 Proceeding, GPT as a decoder for vision-language models

Awesome Lists containing this project

README

        

# VisualGPT

Our Paper [VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning](https://arxiv.org/abs/2102.10407)

## Main Architecture of Our VisualGPT
![image](images/final_architecture.jpg)

## Download the GPT-2 pretrained weights
```
curl --output gpt2-pytorch_model.bin https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-pytorch_model.bin
```

## Enviroment setup
Clone the repository and create the `visualgpt` conda environmnet

```
conda env create -f environment.yml
conda activate visualgpt
```

Then download spacy data

```
python -m spacy download en
```

## Data preparation
We provide the COCO dataset for downloading. Please download the annotations file [annotations.zip](https://drive.google.com/file/d/1i8mqKFKhqvBr8kEp3DbIh9-9UNAfKGmE/view?usp=sharing) and extract it.
and [coco_detections.hdf5](https://drive.google.com/open?id=1MV6dSnqViQfyvgyHrmAT_lLpFbkzp3mx), in which the data is stored in a `` where key is the image id and value is a tensor (N, 2048). N it the number of detections

## code structure

create the log folder ``mkdir logs`` and start the training

## Train the model
```
python train_visualGPT.py --batch_size 50 --head 12 --tau 0.2 --features_path coco_detections.hdf5 --annotation_folder annotations --lr 1e-4 --gpt_model_type gpt --random_seed 42 --log_file logs/log --exp_name experiment_log --lr 1e-4 --decoder_layer 12 --optimizer_type adamw --gradient_accumulation_steps 2 --train_percentage 0.001 --split_train_data
```

## Acknowledgement
This code used resources from [Meshed Memory Transformer](https://github.com/aimagelab/meshed-memory-transformer) and [Transformers](https://github.com/huggingface/transformers)

Please cite our paper from the following bibtex

```
@@InProceedings{Chen_2022_CVPR,
author = {Chen, Jun and Guo, Han and Yi, Kai and Li, Boyang and Elhoseiny, Mohamed},
title = {VisualGPT: Data-Efficient Adaptation of Pretrained Language Models for Image Captioning},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2022},
pages = {18030-18040}
}

```