{"id":13429631,"url":"https://github.com/salesforce/BLIP","last_synced_at":"2025-03-16T03:32:01.954Z","repository":{"id":37429773,"uuid":"451691984","full_name":"salesforce/BLIP","owner":"salesforce","description":"PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation  ","archived":false,"fork":false,"pushed_at":"2024-08-05T12:45:49.000Z","size":6647,"stargazers_count":5086,"open_issues_count":132,"forks_count":675,"subscribers_count":31,"default_branch":"main","last_synced_at":"2025-03-11T23:09:14.443Z","etag":null,"topics":["image-captioning","image-text-retrieval","vision-and-language-pre-training","vision-language","vision-language-transformer","visual-question-answering","visual-reasoning"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/salesforce.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-01-25T01:19:25.000Z","updated_at":"2025-03-11T20:54:04.000Z","dependencies_parsed_at":"2024-11-19T07:15:20.503Z","dependency_job_id":null,"html_url":"https://github.com/salesforce/BLIP","commit_stats":{"total_commits":57,"total_committers":4,"mean_commits":14.25,"dds":"0.29824561403508776","last_synced_commit":"3a29b7410476bf5f2ba0955827390eb6ea1f4f9d"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2FBLIP","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2FBLIP/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2FBLIP/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2FBLIP/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/salesforce","download_url":"https://codeload.github.com/salesforce/BLIP/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243822309,"owners_count":20353496,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["image-captioning","image-text-retrieval","vision-and-language-pre-training","vision-language","vision-language-transformer","visual-question-answering","visual-reasoning"],"created_at":"2024-07-31T02:00:42.819Z","updated_at":"2025-03-16T03:32:01.329Z","avatar_url":"https://github.com/salesforce.png","language":"Jupyter Notebook","funding_links":[],"categories":["2 Foundation Models","Jupyter Notebook","Papers","其他_机器视觉","Multimodal Models","[:robot: machine-learning]([robot-machine-learning)](\u003chttps://github.com/stars/ketsapiwiq/lists/robot-machine-learning\u003e))","Vision LLM for Generation","Vision-Language Models (VLMs)"],"sub_categories":["2.3 Multimodal Foundation Models","2022","网络服务_其他"],"readme":"## BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation\n\n## Announcement: BLIP is now officially integrated into [LAVIS](https://github.com/salesforce/LAVIS) - a one-stop library for language-and-vision research and applications!\n\n\u003cimg src=\"BLIP.gif\" width=\"700\"\u003e\n\nThis is the PyTorch code of the \u003ca href=\"https://arxiv.org/abs/2201.12086\"\u003eBLIP paper\u003c/a\u003e [[blog](https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/)]. The code has been tested on PyTorch 1.10.\nTo install the dependencies, run \u003cpre/\u003epip install -r requirements.txt\u003c/pre\u003e \n\nCatalog:\n- [x] Inference demo\n- [x] Pre-trained and finetuned checkpoints\n- [x] Finetuning code for Image-Text Retrieval, Image Captioning, VQA, and NLVR2\n- [x] Pre-training code\n- [x] Zero-shot video-text retrieval\n- [x] Download of bootstrapped pre-training datasets \n\n\n### Inference demo:\nRun our interactive demo using [Colab notebook](https://colab.research.google.com/github/salesforce/BLIP/blob/main/demo.ipynb) (no GPU needed).\nThe demo includes code for: \n1. Image captioning\n2. Open-ended visual question answering\n3. Multimodal / unimodal feature extraction\n4. Image-text matching\n\nTry out the [Web demo](https://huggingface.co/spaces/Salesforce/BLIP), integrated into [Huggingface Spaces 🤗](https://huggingface.co/spaces) using [Gradio](https://github.com/gradio-app/gradio). \n\nReplicate web demo and Docker image is also available at [![Replicate](https://replicate.com/salesforce/blip/badge)](https://replicate.com/salesforce/blip)\n\n### Pre-trained checkpoints:\nNum. pre-train images | BLIP w/ ViT-B | BLIP w/ ViT-B and CapFilt-L | BLIP w/ ViT-L \n--- | :---: | :---: | :---: \n14M | \u003ca href=\"https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_14M.pth\"\u003eDownload\u003c/a\u003e| - | -\n129M | \u003ca href=\"https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth\"\u003eDownload\u003c/a\u003e| \u003ca href=\"https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth\"\u003eDownload\u003c/a\u003e | \u003ca href=\"https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large.pth\"\u003eDownload\u003c/a\u003e\n\n### Finetuned checkpoints:\nTask | BLIP w/ ViT-B | BLIP w/ ViT-B and CapFilt-L | BLIP w/ ViT-L \n--- | :---: | :---: | :---:\nImage-Text Retrieval (COCO) | \u003ca href=\"https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_retrieval_coco.pth\"\u003eDownload\u003c/a\u003e| - | \u003ca href=\"https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_retrieval_coco.pth\"\u003eDownload\u003c/a\u003e\nImage-Text Retrieval (Flickr30k) | \u003ca href=\"https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_retrieval_flickr.pth\"\u003eDownload\u003c/a\u003e|  - | \u003ca href=\"https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_retrieval_flickr.pth\"\u003eDownload\u003c/a\u003e\nImage Captioning (COCO) | - | \u003ca href=\"https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_caption_capfilt_large.pth\"\u003eDownload\u003c/a\u003e| \u003ca href=\"https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_caption.pth\"\u003eDownload\u003c/a\u003e | \nVQA | \u003ca href=\"https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_vqa.pth\"\u003eDownload\u003c/a\u003e| \u003ca href=\"https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_vqa_capfilt_large.pth\"\u003eDownload\u003c/a\u003e | - \nNLVR2 | \u003ca href=\"https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_nlvr.pth\"\u003eDownload\u003c/a\u003e| - | - \n\n\n### Image-Text Retrieval:\n1. Download COCO and Flickr30k datasets from the original websites, and set 'image_root' in configs/retrieval_{dataset}.yaml accordingly.\n2. To evaluate the finetuned BLIP model on COCO, run:\n\u003cpre\u003epython -m torch.distributed.run --nproc_per_node=8 train_retrieval.py \\\n--config ./configs/retrieval_coco.yaml \\\n--output_dir output/retrieval_coco \\\n--evaluate\u003c/pre\u003e \n3. To finetune the pre-trained checkpoint using 8 A100 GPUs, first set 'pretrained' in configs/retrieval_coco.yaml as \"https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth\". Then run:\n\u003cpre\u003epython -m torch.distributed.run --nproc_per_node=8 train_retrieval.py \\\n--config ./configs/retrieval_coco.yaml \\\n--output_dir output/retrieval_coco \u003c/pre\u003e \n\n### Image-Text Captioning:\n1. Download COCO and NoCaps datasets from the original websites, and set 'image_root' in configs/caption_coco.yaml and configs/nocaps.yaml accordingly.\n2. To evaluate the finetuned BLIP model on COCO, run:\n\u003cpre\u003epython -m torch.distributed.run --nproc_per_node=8 train_caption.py --evaluate\u003c/pre\u003e \n3. To evaluate the finetuned BLIP model on NoCaps, generate results with: (evaluation needs to be performed on official server)\n\u003cpre\u003epython -m torch.distributed.run --nproc_per_node=8 eval_nocaps.py \u003c/pre\u003e \n4. To finetune the pre-trained checkpoint using 8 A100 GPUs, first set 'pretrained' in configs/caption_coco.yaml as \"https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth\". Then run:\n\u003cpre\u003epython -m torch.distributed.run --nproc_per_node=8 train_caption.py \u003c/pre\u003e \n\n### VQA:\n1. Download VQA v2 dataset and Visual Genome dataset from the original websites, and set 'vqa_root' and 'vg_root' in configs/vqa.yaml.\n2. To evaluate the finetuned BLIP model, generate results with: (evaluation needs to be performed on official server)\n\u003cpre\u003epython -m torch.distributed.run --nproc_per_node=8 train_vqa.py --evaluate\u003c/pre\u003e \n3. To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/vqa.yaml as \"https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth\". Then run:\n\u003cpre\u003epython -m torch.distributed.run --nproc_per_node=16 train_vqa.py \u003c/pre\u003e \n\n### NLVR2:\n1. Download NLVR2 dataset from the original websites, and set 'image_root' in configs/nlvr.yaml.\n2. To evaluate the finetuned BLIP model, run\n\u003cpre\u003epython -m torch.distributed.run --nproc_per_node=8 train_nlvr.py --evaluate\u003c/pre\u003e \n3. To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/nlvr.yaml as \"https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth\". Then run:\n\u003cpre\u003epython -m torch.distributed.run --nproc_per_node=16 train_nlvr.py \u003c/pre\u003e \n\n### Finetune with ViT-L:\nIn order to finetune a model with ViT-L, simply change the config file to set 'vit' as large. Batch size and learning rate may also need to be adjusted accordingly (please see the paper's appendix for hyper-parameter details). \u003ca href=\"https://github.com/facebookresearch/fairscale\"\u003eGradient checkpoint\u003c/a\u003e can also be activated in the config file to reduce GPU memory usage. \n\n### Pre-train:\n1. Prepare training json files where each json file contains a list. Each item in the list is a dictonary with two key-value pairs: {'image': path_of_image, 'caption': text_of_image}. \n2. In configs/pretrain.yaml, set 'train_file' as the paths for the json files .\n3. Pre-train the model using 8 A100 GPUs:\n\u003cpre\u003epython -m torch.distributed.run --nproc_per_node=8 pretrain.py --config ./configs/Pretrain.yaml --output_dir output/Pretrain \u003c/pre\u003e \n\n### Zero-shot video-text retrieval:\n1. Download MSRVTT dataset following the instructions from https://github.com/salesforce/ALPRO, and set 'video_root' accordingly in configs/retrieval_msrvtt.yaml.\n2. Install [decord](https://github.com/dmlc/decord) with \u003cpre\u003epip install decord\u003c/pre\u003e \n3. To perform zero-shot evaluation, run\n\u003cpre\u003epython -m torch.distributed.run --nproc_per_node=8 eval_retrieval_video.py\u003c/pre\u003e \n\n### Pre-training datasets download:\nWe provide bootstrapped pre-training datasets as json files. Each json file contains a list. Each item in the list is a dictonary with two key-value pairs: {'url': url_of_image, 'caption': text_of_image}. \n\nImage source | Filtered web caption | Filtered synthetic caption by ViT-B | Filtered synthetic caption by ViT-L\n--- | :---: | :---: | :---:\nCC3M+CC12M+SBU |  \u003ca href=\"https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/ccs_filtered.json\"\u003eDownload\u003c/a\u003e|  \u003ca href=\"https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/ccs_synthetic_filtered.json\"\u003eDownload\u003c/a\u003e|  \u003ca href=\"https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/ccs_synthetic_filtered_large.json\"\u003eDownload\u003c/a\u003e\nLAION115M | \u003ca href=\"https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/laion_filtered.json\"\u003eDownload\u003c/a\u003e|  \u003ca href=\"https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/laion_synthetic_filtered.json\"\u003eDownload\u003c/a\u003e|  \u003ca href=\"https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/laion_synthetic_filtered_large.json\"\u003eDownload\u003c/a\u003e\n\n### Citation\nIf you find this code to be useful for your research, please consider citing.\n\u003cpre\u003e\n@inproceedings{li2022blip,\n      title={BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation}, \n      author={Junnan Li and Dongxu Li and Caiming Xiong and Steven Hoi},\n      year={2022},\n      booktitle={ICML},\n}\u003c/pre\u003e\n\n### Acknowledgement\nThe implementation of BLIP relies on resources from \u003ca href=\"https://github.com/salesforce/ALBEF\"\u003eALBEF\u003c/a\u003e, \u003ca href=\"https://github.com/huggingface/transformers\"\u003eHuggingface Transformers\u003c/a\u003e, and \u003ca href=\"https://github.com/rwightman/pytorch-image-models/tree/master/timm\"\u003etimm\u003c/a\u003e. We thank the original authors for their open-sourcing.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsalesforce%2FBLIP","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsalesforce%2FBLIP","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsalesforce%2FBLIP/lists"}