{"id":18950248,"url":"https://github.com/salesforce/albef","last_synced_at":"2025-04-08T11:13:09.629Z","repository":{"id":37369335,"uuid":"385418235","full_name":"salesforce/ALBEF","owner":"salesforce","description":"Code for ALBEF: a new vision-language pre-training method","archived":false,"fork":false,"pushed_at":"2022-09-20T04:57:34.000Z","size":73284,"stargazers_count":1625,"open_issues_count":65,"forks_count":205,"subscribers_count":12,"default_branch":"main","last_synced_at":"2025-04-01T10:08:11.850Z","etag":null,"topics":["contrastive-learning","image-text","representation-learning","vision-and-language","weakly-supervised-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/salesforce.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":"SECURITY.md","support":null}},"created_at":"2021-07-13T00:07:09.000Z","updated_at":"2025-03-31T16:45:43.000Z","dependencies_parsed_at":"2022-07-12T16:17:38.803Z","dependency_job_id":null,"html_url":"https://github.com/salesforce/ALBEF","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2FALBEF","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2FALBEF/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2FALBEF/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2FALBEF/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/salesforce","download_url":"https://codeload.github.com/salesforce/ALBEF/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247829512,"owners_count":21002997,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["contrastive-learning","image-text","representation-learning","vision-and-language","weakly-supervised-learning"],"created_at":"2024-11-08T13:21:58.679Z","updated_at":"2025-04-08T11:13:09.609Z","avatar_url":"https://github.com/salesforce.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Align before Fuse: Vision and Language Representation Learning with Momentum Distillation, NeurIPS 2021 Spotlight (Salesforce Research).\n\n## Announcement: ALBEF is now officially integrated into [LAVIS](https://github.com/salesforce/LAVIS) - a one-stop library for language-and-vision research and applications!\n\nThis is the official PyTorch implementation of the \u003ca href=\"https://arxiv.org/abs/2107.07651\"\u003eALBEF paper\u003c/a\u003e \u003ca href=\"https://blog.salesforceairesearch.com/align-before-fuse/\"\u003e[Blog]\u003c/a\u003e. \nThis repository supports pre-training on custom datasets, as well as finetuning on VQA, SNLI-VE, NLVR2, Image-Text Retrieval on MSCOCO and Flickr30k,\nand visual grounding on RefCOCO+. Pre-trained and finetuned checkpoints are released.\n\n\u003cimg src=\"img.png\" width=\"600\"\u003e\n\n\n### Requirements:\n* pytorch 1.8.0\n* transformers 4.8.1\n* timm 0.4.9\n\n### Download:\n\n* Pre-trained checkpoint [[14M](https://storage.googleapis.com/sfr-pcl-data-research/ALBEF/ALBEF.pth)] / [[4M](https://storage.googleapis.com/sfr-pcl-data-research/ALBEF/ALBEF_4M.pth)]\n* \u003ca href=\"https://storage.googleapis.com/sfr-pcl-data-research/ALBEF/data.tar.gz\"\u003e Dataset json files for downstream tasks\u003c/a\u003e\n* \u003ca href=\"https://storage.googleapis.com/sfr-pcl-data-research/ALBEF/json_pretrain.zip\"\u003e Dataset json files for pre-training\u003c/a\u003e (the image paths in each json file need to be changed to your own directory)\n* \u003ca href=\"https://storage.googleapis.com/sfr-pcl-data-research/ALBEF/mscoco.pth\"\u003e Finetuned checkpoint for retrieval on MSCOCO \u003c/a\u003e\n* \u003ca href=\"https://storage.googleapis.com/sfr-pcl-data-research/ALBEF/flickr30k.pth\"\u003e Finetuned checkpoint for retrieval on Flickr30k \u003c/a\u003e\n* \u003ca href=\"https://storage.googleapis.com/sfr-pcl-data-research/ALBEF/vqa.pth\"\u003e Finetuned checkpoint for VQA \u003c/a\u003e\n* \u003ca href=\"https://storage.googleapis.com/sfr-pcl-data-research/ALBEF/refcoco.pth\"\u003e Finetuned checkpoint for visual grounding on RefCOCO+ \u003c/a\u003e\n\n### Visualization:\nWe provide code in visualize.ipynb to visualize the important areas in an image for each word in a text. \nHere is an example visualization using the visual grounding checkpoint.\n\nTry the Replicate demo here [![Replicate](https://replicate.com/salesforce/albef/badge)](https://replicate.com/salesforce/albef).\n\n\u003cimg src=\"examples/visualization.png\" width=\"700\"\u003e\n\n### Pre-training on custom datasets:\n1. Prepare training json files where each json file contains a list. Each item in the list is a dictonary with two key-value pairs: {'image': path_of_image, 'caption': text_of_image}. \n2. In configs/Pretrain.yaml, set the paths for the json files.\n3. Pre-train the model using 8 A100 GPUs:\n\u003cpre\u003epython -m torch.distributed.launch --nproc_per_node=8 --use_env Pretrain.py --config ./configs/Pretrain.yaml --output_dir output/Pretrain \u003c/pre\u003e \n\n### Image-Text Retrieval:\n\n1. Download MSCOCO or Flickr30k datasets from the original websites.\n2. Download and extract the provided dataset json files.\n3. In configs/Retrieval_coco.yaml or configs/Retrieval_flickr.yaml, set the paths for the json files and the image path.\n4. Finetune the pre-trained checkpoint using 8 A100 GPUs:\n\u003cpre\u003epython -m torch.distributed.launch --nproc_per_node=8 --use_env Retrieval.py \\\n--config ./configs/Retrieval_flickr.yaml \\\n--output_dir output/Retrieval_flickr \\\n--checkpoint [Pretrained checkpoint]\u003c/pre\u003e \n\n### VQA:\n1. Download VQA v2 dataset and Visual Genome dataset from the original websites.\n2. Download and extract the provided dataset json files.\n3. In configs/VQA.yaml, set the paths for the json files and the image paths.\n4. Finetune the pre-trained checkpoint using 8 A100 GPUs:\n\u003cpre\u003epython -m torch.distributed.launch --nproc_per_node=8 --use_env VQA.py \\\n--config ./configs/VQA.yaml \\\n--output_dir output/vqa \\\n--checkpoint [Pretrained checkpoint]\u003c/pre\u003e \n5. Evaluate the result using the official evaluation server.\n\n### Visual Entailment:\n1. Download SNLI-VE dataset from the original website.\n2. Download and extract the provided dataset json files.\n3. In configs/VE.yaml, set the paths for the json files and the image path.\n4. Finetune the pre-trained checkpoint using 8 A100 GPUs:\n\u003cpre\u003epython -m torch.distributed.launch --nproc_per_node=8 --use_env VE.py \\\n--config ./configs/VE.yaml \\\n--output_dir output/VE \\\n--checkpoint [Pretrained checkpoint]\u003c/pre\u003e \n\n### Visual Grounding on RefCOCO+:\n1. Download MSCOCO dataset from the original website.\n2. Download and extract the provided dataset json files.\n3. In configs/Grounding.yaml, set the paths for the json files and the image path.\n4. Finetune the pre-trained checkpoint using 8 A100 GPUs:\n\u003cpre\u003epython -m torch.distributed.launch --nproc_per_node=8 --use_env Grounding.py \\\n--config ./configs/Grounding.yaml \\\n--output_dir output/RefCOCO \\\n--gradcam_mode itm \\ \n--block_num 8 \\\n--checkpoint [Pretrained checkpoint]\u003c/pre\u003e \n\n### NLVR2:\nNLVR2 requires an additional pre-training step with text-assignment (TA) to adapt the model for image-pair inputs. In order to perform TA, first set the paths for the json training files in configs/NLVR_pretrain.yaml, then run:\n\u003cpre\u003epython -m torch.distributed.launch --nproc_per_node=8 --use_env Pretrain_nlvr.py \\\n--config ./configs/NLVR_pretrain.yaml \\\n--output_dir output/NLVR_pretrain \\\n--checkpoint [Pretrained checkpoint]\u003c/pre\u003e \n\nWe provide the \u003ca href=\"https://storage.googleapis.com/sfr-pcl-data-research/ALBEF/pretrain_model_nlvr.pth\"\u003e checkpoint \u003c/a\u003e after TA pre-training, which can be fine-tuned with the following steps.\n1. Download NLVR2 dataset from the original website.\n2. Download and extract the provided dataset json files.\n3. In configs/NLVR.yaml, set the paths for the json files and the image path.\n4. Finetune the pre-trained checkpoint using 8 A100 GPUs:\n\u003cpre\u003epython -m torch.distributed.launch --nproc_per_node=8 --use_env NLVR.py \\\n--config ./configs/NLVR.yaml \\\n--output_dir output/NLVR \\\n--checkpoint [TA pretrained checkpoint]\u003c/pre\u003e \n\n### Citation\nIf you find this code to be useful for your research, please consider citing.\n\u003cpre\u003e\n@inproceedings{ALBEF,\n      title={Align before Fuse: Vision and Language Representation Learning with Momentum Distillation}, \n      author={Junnan Li and Ramprasaath R. Selvaraju and Akhilesh Deepak Gotmare and Shafiq Joty and Caiming Xiong and Steven Hoi},\n      year={2021},\n      booktitle={NeurIPS},\n}\u003c/pre\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsalesforce%2Falbef","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsalesforce%2Falbef","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsalesforce%2Falbef/lists"}