https://github.com/salesforce/albef

Code for ALBEF: a new vision-language pre-training method
https://github.com/salesforce/albef

contrastive-learning image-text representation-learning vision-and-language weakly-supervised-learning

Last synced: 11 months ago
JSON representation

Code for ALBEF: a new vision-language pre-training method

Host: GitHub
URL: https://github.com/salesforce/albef
Owner: salesforce
License: bsd-3-clause
Created: 2021-07-13T00:07:09.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2022-09-20T04:57:34.000Z (over 3 years ago)
Last Synced: 2025-04-01T10:08:11.850Z (11 months ago)
Topics: contrastive-learning, image-text, representation-learning, vision-and-language, weakly-supervised-learning
Language: Python
Homepage:
Size: 69.9 MB
Stars: 1,625
Watchers: 12
Forks: 205
Open Issues: 65
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
- Code of conduct: CODE_OF_CONDUCT.md
- Codeowners: CODEOWNERS
- Security: SECURITY.md

Awesome Lists containing this project

README

## Align before Fuse: Vision and Language Representation Learning with Momentum Distillation, NeurIPS 2021 Spotlight (Salesforce Research).

## Announcement: ALBEF is now officially integrated into [LAVIS](https://github.com/salesforce/LAVIS) - a one-stop library for language-and-vision research and applications!

This is the official PyTorch implementation of the ALBEF paper [Blog].
This repository supports pre-training on custom datasets, as well as finetuning on VQA, SNLI-VE, NLVR2, Image-Text Retrieval on MSCOCO and Flickr30k,
and visual grounding on RefCOCO+. Pre-trained and finetuned checkpoints are released.

### Requirements:
* pytorch 1.8.0
* transformers 4.8.1
* timm 0.4.9

### Download:

* Pre-trained checkpoint [[14M](https://storage.googleapis.com/sfr-pcl-data-research/ALBEF/ALBEF.pth)] / [[4M](https://storage.googleapis.com/sfr-pcl-data-research/ALBEF/ALBEF_4M.pth)]
* Dataset json files for downstream tasks
* Dataset json files for pre-training (the image paths in each json file need to be changed to your own directory)
* Finetuned checkpoint for retrieval on MSCOCO
* Finetuned checkpoint for retrieval on Flickr30k
* Finetuned checkpoint for VQA
* Finetuned checkpoint for visual grounding on RefCOCO+

### Visualization:
We provide code in visualize.ipynb to visualize the important areas in an image for each word in a text.
Here is an example visualization using the visual grounding checkpoint.

Try the Replicate demo here [![Replicate](https://replicate.com/salesforce/albef/badge)](https://replicate.com/salesforce/albef).

### Pre-training on custom datasets:
1. Prepare training json files where each json file contains a list. Each item in the list is a dictonary with two key-value pairs: {'image': path_of_image, 'caption': text_of_image}.
2. In configs/Pretrain.yaml, set the paths for the json files.
3. Pre-train the model using 8 A100 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env Pretrain.py --config ./configs/Pretrain.yaml --output_dir output/Pretrain

### Image-Text Retrieval:

1. Download MSCOCO or Flickr30k datasets from the original websites.
2. Download and extract the provided dataset json files.
3. In configs/Retrieval_coco.yaml or configs/Retrieval_flickr.yaml, set the paths for the json files and the image path.
4. Finetune the pre-trained checkpoint using 8 A100 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env Retrieval.py \

--config ./configs/Retrieval_flickr.yaml \

--output_dir output/Retrieval_flickr \

--checkpoint [Pretrained checkpoint]

### VQA:
1. Download VQA v2 dataset and Visual Genome dataset from the original websites.
2. Download and extract the provided dataset json files.
3. In configs/VQA.yaml, set the paths for the json files and the image paths.
4. Finetune the pre-trained checkpoint using 8 A100 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env VQA.py \

--config ./configs/VQA.yaml \

--output_dir output/vqa \

--checkpoint [Pretrained checkpoint]

5. Evaluate the result using the official evaluation server.

### Visual Entailment:
1. Download SNLI-VE dataset from the original website.
2. Download and extract the provided dataset json files.
3. In configs/VE.yaml, set the paths for the json files and the image path.
4. Finetune the pre-trained checkpoint using 8 A100 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env VE.py \

--config ./configs/VE.yaml \

--output_dir output/VE \

--checkpoint [Pretrained checkpoint]

### Visual Grounding on RefCOCO+:
1. Download MSCOCO dataset from the original website.
2. Download and extract the provided dataset json files.
3. In configs/Grounding.yaml, set the paths for the json files and the image path.
4. Finetune the pre-trained checkpoint using 8 A100 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env Grounding.py \

--config ./configs/Grounding.yaml \

--output_dir output/RefCOCO \

--gradcam_mode itm \ 

--block_num 8 \

--checkpoint [Pretrained checkpoint]

### NLVR2:
NLVR2 requires an additional pre-training step with text-assignment (TA) to adapt the model for image-pair inputs. In order to perform TA, first set the paths for the json training files in configs/NLVR_pretrain.yaml, then run:

python -m torch.distributed.launch --nproc_per_node=8 --use_env Pretrain_nlvr.py \

--config ./configs/NLVR_pretrain.yaml \

--output_dir output/NLVR_pretrain \

--checkpoint [Pretrained checkpoint]

We provide the checkpoint after TA pre-training, which can be fine-tuned with the following steps.
1. Download NLVR2 dataset from the original website.
2. Download and extract the provided dataset json files.
3. In configs/NLVR.yaml, set the paths for the json files and the image path.
4. Finetune the pre-trained checkpoint using 8 A100 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env NLVR.py \

--config ./configs/NLVR.yaml \

--output_dir output/NLVR \

--checkpoint [TA pretrained checkpoint]

### Citation
If you find this code to be useful for your research, please consider citing.


@inproceedings{ALBEF,

      title={Align before Fuse: Vision and Language Representation Learning with Momentum Distillation}, 

      author={Junnan Li and Ramprasaath R. Selvaraju and Akhilesh Deepak Gotmare and Shafiq Joty and Caiming Xiong and Steven Hoi},

      year={2021},

      booktitle={NeurIPS},

}

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/salesforce/albef

Awesome Lists containing this project

README