https://github.com/vision-cair/visualgpt
VisualGPT, CVPR 2022 Proceeding, GPT as a decoder for vision-language models
https://github.com/vision-cair/visualgpt
data-efficient-image-caption image-caption visualgpt
Last synced: 10 months ago
JSON representation
VisualGPT, CVPR 2022 Proceeding, GPT as a decoder for vision-language models
- Host: GitHub
- URL: https://github.com/vision-cair/visualgpt
- Owner: Vision-CAIR
- License: mit
- Created: 2021-02-15T08:45:53.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2023-05-16T06:13:12.000Z (about 3 years ago)
- Last Synced: 2025-07-29T06:55:00.820Z (10 months ago)
- Topics: data-efficient-image-caption, image-caption, visualgpt
- Language: Python
- Homepage:
- Size: 6.18 MB
- Stars: 336
- Watchers: 13
- Forks: 54
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# VisualGPT
Our Paper [VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning](https://arxiv.org/abs/2102.10407)
## Main Architecture of Our VisualGPT

## Download the GPT-2 pretrained weights
```
curl --output gpt2-pytorch_model.bin https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-pytorch_model.bin
```
## Enviroment setup
Clone the repository and create the `visualgpt` conda environmnet
```
conda env create -f environment.yml
conda activate visualgpt
```
Then download spacy data
```
python -m spacy download en
```
## Data preparation
We provide the COCO dataset for downloading. Please download the annotations file [annotations.zip](https://drive.google.com/file/d/1i8mqKFKhqvBr8kEp3DbIh9-9UNAfKGmE/view?usp=sharing) and extract it.
and [coco_detections.hdf5](https://drive.google.com/open?id=1MV6dSnqViQfyvgyHrmAT_lLpFbkzp3mx), in which the data is stored in a `` where key is the image id and value is a tensor (N, 2048). N it the number of detections
## code structure
create the log folder ``mkdir logs`` and start the training
## Train the model
```
python train_visualGPT.py --batch_size 50 --head 12 --tau 0.2 --features_path coco_detections.hdf5 --annotation_folder annotations --lr 1e-4 --gpt_model_type gpt --random_seed 42 --log_file logs/log --exp_name experiment_log --lr 1e-4 --decoder_layer 12 --optimizer_type adamw --gradient_accumulation_steps 2 --train_percentage 0.001 --split_train_data
```
## Acknowledgement
This code used resources from [Meshed Memory Transformer](https://github.com/aimagelab/meshed-memory-transformer) and [Transformers](https://github.com/huggingface/transformers)
Please cite our paper from the following bibtex
```
@@InProceedings{Chen_2022_CVPR,
author = {Chen, Jun and Guo, Han and Yi, Kai and Li, Boyang and Elhoseiny, Mohamed},
title = {VisualGPT: Data-Efficient Adaptation of Pretrained Language Models for Image Captioning},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2022},
pages = {18030-18040}
}
```