Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/microsoft/BridgeTower

Open source code for AAAI 2023 Paper "BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning"
https://github.com/microsoft/BridgeTower

Last synced: 16 days ago
JSON representation

Open source code for AAAI 2023 Paper "BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning"

Host: GitHub
URL: https://github.com/microsoft/BridgeTower
Owner: microsoft
License: mit
Created: 2022-06-17T02:53:07.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2023-07-06T23:21:24.000Z (12 months ago)
Last Synced: 2024-05-09T17:56:53.792Z (about 1 month ago)
Language: Python
Homepage: https://arxiv.org/abs/2206.08657
Size: 15.5 MB
Stars: 134
Watchers: 12
Forks: 5
Open Issues: 8
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
- Support: SUPPORT.md

Lists

awesome-stars - microsoft/BridgeTower - Open source code for AAAI 2023 Paper "BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning" (Python)

README

        # BridgeTower

This repo is the official `Pytorch` implementation of ["BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning"](https://arxiv.org/abs/2206.08657).

## Updates

- Feb. 2023: BridgeTower was integrated into [Hugging Face - Transformers](https://github.com/huggingface/transformers/). 

  - [Model Hub](https://huggingface.co/BridgeTower), [Code](https://github.com/huggingface/transformers/tree/main/src/transformers/models/bridgetower) and [Documentation](https://huggingface.co/docs/transformers/main/en/model_doc/bridgetower) are available. 

  - Thanks to [Anahita Bhiwandiwalla](https://github.com/abhiwand), [Tiep Le](https://github.com/tileintel) and [Shaoyen Tseng](https://github.com/shaoyent-IL) from [Intel Labs](https://github.com/IntelLabs) for their great work! 

- Nov. 2022: BridgeTower got accepted by [AAAI'23](https://aaai.org/Conferences/AAAI-23/). Code and checkpoints are released.

- Jun. 2022: We released the preprint version in [Arxiv](https://arxiv.org/abs/2206.08657).

- May. 2022: BridgeTower (single model, 4M data) achieved 78.73% and 81.15% (base and large) on the [VQAv2 Challenge](https://eval.ai/web/challenges/challenge-page/830/leaderboard/2278) test-std set.

## Abstract

Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language representation learning in recent years. Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a deep cross-modal encoder, or feed the last-layer uni-modal representations from the deep pre-trained uni-modal encoders into the top cross-modal encoder. Both approaches potentially restrict vision-language representation learning and limit model performance. In this paper, we propose BridgeTower, which introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder. This enables effective bottom-up cross-modal alignment and fusion between visual and textual representations of different semantic levels of pre-trained uni-modal encoders in the cross-modal encoder. Pre-trained with only 4M images, BridgeTower achieves state-of-the-art performance on various downstream vision-language tasks. In particular, on the VQAv2 test-std set, BridgeTower achieves an accuracy of 78.73%, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data and almost negligible additional parameters and computational costs. Notably, when further scaling the model, BridgeTower achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets.

## Architecture

![Architecture](images/framework.jpg)

## Main Results

![Result1](images/result1.jpg)

![Result2](images/result2.jpg)

## Deployment

- Run `setup.sh` to set up the environment.

- [Optional] We use [wandb](https://wandb.ai/) to track experiments! Please remember to `wandb login` and paste your token before running the script.

## Dataset Preparation

- We follow [ViLT](https://github.com/dandelin/ViLT) and use pyarrow to serialize the datasets. See [here](https://github.com/dandelin/ViLT/blob/master/DATA.md) for details.

- For SNLI-VE dataset, we follow [here](https://github.com/necla-ml/SNLI-VE).

- For VG-QA dataset, except the image-text pairs in [VG](https://visualgenome.org/api/v0/api_home.html) got from [here](https://github.com/dandelin/ViLT/blob/master/DATA.md), [image meta data](https://visualgenome.org/static/data/dataset/image_data_v1.json.zip), [question answers data](https://visualgenome.org/static/data/dataset/question_answers.json.zip) and [coco split information](https://github.com/peteanderson80/bottom-up-attention/tree/master/data/genome/coco_splits) also need to be downloaded.

- The final file structure of datasets are shown in `setup.sh`.

## Checkpoints

- Pre-trained checkpoints on 4M data: [BASE](https://chenfei.blob.core.windows.net/data/G/LCI/best_checkpoints/BridgeTower_pt_base.ckpt?sv=2020-10-02&st=2022-11-24T12%3A18%3A49Z&se=2027-11-25T12%3A18%3A00Z&sr=b&sp=r&sig=BJigddAMHfNUtQuTGH8bJUrzAO3LfaeSm48AXUqZngY%3D) and [LARGE](https://chenfei.blob.core.windows.net/data/G/LCI/best_checkpoints/BridgeTower_pt_large.ckpt?sv=2020-10-02&st=2022-11-24T12%3A19%3A19Z&se=2027-11-25T12%3A19%3A00Z&sr=b&sp=r&sig=8yWqesQACrJSi0JMLIA0uAbNlMQKb653gOXjXjQuIW4%3D)

- Fine-tuned checkpoints for

  - Visual Question Answering on VQAv2: [BASE](https://chenfei.blob.core.windows.net/data/G/LCI/best_checkpoints/BridgeTower_ftfpt_base_vqav2.ckpt?sv=2020-10-02&st=2022-11-24T12%3A16%3A38Z&se=2027-11-25T12%3A16%3A00Z&sr=b&sp=r&sig=t35v4kezDcSOm9Q9E767PhNGAQRsiYm%2FMSDgHIz%2Fvto%3D), [BASE(w/ VGQA)](https://chenfei.blob.core.windows.net/data/G/LCI/best_checkpoints/BridgeTower_ftfpt_base_vqav2_vgqa.ckpt?sv=2020-10-02&st=2022-11-24T12%3A17%3A18Z&se=2027-11-25T12%3A17%3A00Z&sr=b&sp=r&sig=BD%2BOsI%2F6R905vBJUlrWlgx3%2BmaBRsa2rQcHBChhW0eE%3D), [LARGE](https://chenfei.blob.core.windows.net/data/G/LCI/best_checkpoints/BridgeTower_ftfpt_large_vqav2.ckpt?sv=2020-10-02&st=2022-11-24T12%3A17%3A47Z&se=2027-11-25T12%3A17%3A00Z&sr=b&sp=r&sig=RqL7Eeye4385oaO1nvVvRwC4d%2ByhpEVGM3xmS4GcKkQ%3D), [LARGE(w/ VGQA)](https://chenfei.blob.core.windows.net/data/G/LCI/best_checkpoints/BridgeTower_ftfpt_large_vqav2_vgqa.ckpt?sv=2020-10-02&st=2022-11-24T12%3A18%3A29Z&se=2027-11-25T12%3A18%3A00Z&sr=b&sp=r&sig=xtI8rmEqjMmN1b1bcE0KB9ePUax3SuRfOt%2Bp2ATH9ng%3D)

  - Image-Text Retrieval on Flickr30k: [BASE](https://chenfei.blob.core.windows.net/data/G/LCI/best_checkpoints/BridgeTower_ftfpt_base_irtr_itm_itc_f30k.ckpt?sv=2020-10-02&st=2022-11-24T12%3A13%3A42Z&se=2027-11-25T12%3A13%3A00Z&sr=b&sp=r&sig=0BP3pOiE4AFkK4BTgQl5Dy6iJWxHuJffpjU4LFMTfWY%3D)

  - Visual Entailment on SNLI-VE: [BASE](https://chenfei.blob.core.windows.net/data/G/LCI/best_checkpoints/BridgeTower_ftfpt_base_snlive.ckpt?sv=2020-10-02&st=2022-11-24T12%3A15%3A27Z&se=2027-11-25T12%3A15%3A00Z&sr=b&sp=r&sig=IccPmnxQYIpWO8m6kwtEFir9wmVq1SsLOqmw0FRc9hY%3D)

  - Visual Reasoning on NLVR$^2$: [BASE](https://chenfei.blob.core.windows.net/data/G/LCI/best_checkpoints/BridgeTower_ftfpt_base_nlvr2.ckpt?sv=2020-10-02&st=2022-11-24T12%3A15%3A09Z&se=2027-11-25T12%3A15%3A00Z&sr=b&sp=r&sig=AL3q15eyhPBHaWY0FOop9goHVq8CbNluABDk%2FS94rkI%3D)

  - Image-Text Retrieval on MSCOCO: [BASE](https://chenfei.blob.core.windows.net/data/G/LCI/best_checkpoints/BridgeTower_ftfpt_base_irtr_itm_itc_coco.ckpt?sv=2020-10-02&st=2022-11-24T12%3A13%3A18Z&se=2027-11-25T12%3A13%3A00Z&sr=b&sp=r&sig=ahM%2FyI8fg9D4obCZsNKaxLzPVz2y8RX8ydZNToGavC4%3D)

- Here is an example for downloading a checkpoint.

  ```Shell

  # download azcopy

  wget https://aka.ms/downloadazcopy-v10-linux

  tar -xvf downloadazcopy-v10-linux

  sudo cp ./azcopy_linux_amd64_*/azcopy /usr/bin/

  sudo chmod -R 777 /usr/bin/azcopy

  # azcopy copy [remote path] [local path]

  azcopy copy "https://chenfei.blob.core.windows.net/data/G/LCI/best_checkpoints/BridgeTower_pt_base.ckpt?sv=2020-10-02&st=2022-11-24T12%3A18%3A49Z&se=2027-11-25T12%3A18%3A00Z&sr=b&sp=r&sig=BJigddAMHfNUtQuTGH8bJUrzAO3LfaeSm48AXUqZngY%3D" "./BridgeTower_pt_base.ckpt"

  ```

## Pre-training on Image-Text Datasets

```bash

# Pre-train BridgeTower Base Model

bash scripts/pre_train.sh

# Pre-train BridgeTower Large Model

bash scripts/pre_train_large.sh

```

## Fine-tuning on Downstream VL Tasks

- VQAv2 Evaluation needs to submit the `json` file in the `logs/` directory to [eval.ai](https://eval.ai/web/challenges/challenge-page/830/overview) evaluation server to get the test-dev and/or test-std scores.

```bash

# Base Model on VQAv2 without VLP

bash scripts/ftfs_base_vqa.sh

# Large Model on VQAv2 without VLP

bash scripts/ftfs_large_vqa.sh

# Base Model on VQAv2 with VLP

bash scripts/ftfpt_base_vqa.sh

# Large Model on VQAv2 with VLP

bash scripts/ftfpt_large_vqa.sh

# Base Model on IRTR-Flickr30K with VLP (directly use ITM with multiple false texts)

bash scripts/ftfpt_base_irtr_f30k.sh

# Base Model on IRTR-Flickr30K with VLP (follow ALBEF to use ITC to sample hard negatives for ITM)

bash scripts/ftfpt_base_irtr_itm_itc_f30k.sh

# Base Model on SNLI-VE with VLP

bash scripts/ftfpt_base_snlive.sh

# Base Model on NLVR^2 with VLP

bash scripts/ftfpt_base_nlvr2.sh

# Base Model on IRTR-MSCOCO with VLP (follow ALBEF to use ITC to sample hard negatives for ITM)

bash scripts/ftfpt_base_irtr_itm_itc_coco.sh

```

## Fine-tuning on Uni-Modal Tasks

```bash

# Base Model on CIFAR with VLP

bash scripts/ftfpt_base_cifar.sh

# Base Model on GLUE with VLP

bash scripts/ftfpt_base_glue.sh

```

## Citation

```

@article{xu2022bridge,

  title={BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning},

  author={Xu, Xiao and Wu, Chenfei and Rosenman, Shachar and Lal, Vasudev and Che, Wanxiang and Duan, Nan},

  journal={arXiv preprint arXiv:2206.08657},

  year={2022}

}

```

## Acknowledgement

We are highly grateful for the public code of the following papers, our code is partly based on them:

- Main Code: [METER](https://github.com/zdou0830/METER), [ViLT](https://github.com/dandelin/ViLT)

- Others: [CLIP](https://github.com/openai/CLIP), [ALBEF](https://github.com/salesforce/ALBEF), [BLIP](https://github.com/salesforce/BLIP)