{"id":13958439,"url":"https://github.com/j-min/VL-T5","last_synced_at":"2025-07-21T00:30:46.571Z","repository":{"id":39578437,"uuid":"336132980","full_name":"j-min/VL-T5","owner":"j-min","description":"PyTorch code for \"Unifying Vision-and-Language Tasks via Text Generation\" (ICML 2021)","archived":false,"fork":false,"pushed_at":"2023-07-29T20:37:22.000Z","size":867,"stargazers_count":372,"open_issues_count":17,"forks_count":58,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-05-25T19:09:11.829Z","etag":null,"topics":["pretraining","transformers","vision-and-language","vl-bart","vl-t5"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2102.02779","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/j-min.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-02-05T01:44:13.000Z","updated_at":"2025-04-28T21:48:36.000Z","dependencies_parsed_at":"2024-11-19T13:43:43.104Z","dependency_job_id":"49c8d4a1-a1dc-41dc-9a19-93eb97011333","html_url":"https://github.com/j-min/VL-T5","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/j-min/VL-T5","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/j-min%2FVL-T5","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/j-min%2FVL-T5/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/j-min%2FVL-T5/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/j-min%2FVL-T5/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/j-min","download_url":"https://codeload.github.com/j-min/VL-T5/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/j-min%2FVL-T5/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266221246,"owners_count":23894964,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["pretraining","transformers","vision-and-language","vl-bart","vl-t5"],"created_at":"2024-08-08T13:01:35.443Z","updated_at":"2025-07-21T00:30:46.299Z","avatar_url":"https://github.com/j-min.png","language":"Python","funding_links":[],"categories":["其他_机器视觉"],"sub_categories":["网络服务_其他"],"readme":"# Unifying Vision-and-Language Tasks via Text Generation\n\n* Authors: [Jaemin Cho](https://j-min.io), [Jie Lei](https://www.cs.unc.edu/~jielei/), [Hao Tan](https://www.cs.unc.edu/~airsplay/), and [Mohit Bansal](https://www.cs.unc.edu/~mbansal/)\n* [Paper](https://arxiv.org/abs/2102.02779) (To appear in [ICML 2021](https://icml.cc/Conferences/2021))\n* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/j-min/VL-T5/blob/main/inference_example.ipynb) (VQA inference using pretrained model on custom image/question)\n* Try web demo and docker image on VQA here [![Replicate](https://replicate.com/j-min/vl-t5/badge)](https://replicate.com/j-min/vl-t5)\n\n![teaser image](./assets/teaser_square.png)\n\n## Setup\n```bash\n# Create python environment (optional)\nconda create -n vlt5 python=3.7\nsource activate vlt5\n\n# Install python dependencies\npip install -r requirements.txt\n\n# Download T5/BART backbone checkpoint\npython download_backbones.py\n\n# For MSCOCO captioning evaluation (optional; for captioning only)\npython -c \"import language_evaluation; language_evaluation.download('coco')\"\n```\n\n## Code structure\n```bash\n# Store images, features, and annotations\n./datasets\n    COCO/\n        images/\n        featuers/\n    VG/\n        images/\n        features/\n    GQA/\n        images/\n        features/\n    nlvr/\n        images/\n        features/\n    RefCOCO/\n\n    ...\n\n# Run feature extraction\n./feature_extraction\n\n# Train VL-T5\n./VL-T5/\n    src/\n        modeling_t5.py modeling_bart.py                       \u003c= VL-T5/VL-BART model classes\n        pretrain.py, pretrain_data.py, pretrain_model.py      \u003c= pretraining\n        vqa.py, vqa_data.py vqa_model.py ...                  \u003c= fine-tuning on downstream tasks (ex. VQA, GQA, NLVR2)\n        multitask.py, multitask_data.py multiask_model.py     \u003c= multitask learning on 7 downstream tasks\n        param.py                                              \u003c= (argparse) configuration\n        tokenization.py                                       \u003c= custom tokenizer\n        utils.py, dist_utils.py                               \u003c= utility functions\n    snap/                                                     \u003c= store weight checkpoints\n    scripts/                                                  \u003c= bash scripts for pretraining and finetuning\n```\n\n## API\n```python\nimport sys\nsys.path.append('./VL-T5/src')\n\n# Parse configuration\nfrom param import parse_args\nargs = parse_args(\n    backbone='t5-base' # Backbone architecture\n    load='./snap/pretrain/VLT5/Epoch30' # Pretrained checkpoint\n    parse=False, # False for interactive env (ex. jupyter)\n)\n# Assign GPU\nargs.gpu = 0\n\n# Load data loaders\nfrom vqa_data import get_loader\ntrain_loader = get_loader(\n    args,\n    split=args.train,\n    ...\n)\nval_loader = get_loader(\n    args,\n    split=args.valid,\n    ...\n)\ntest_loader = get_loader(\n    args,\n    split=args.test,\n    ...\n)\n\n# Import trainer\nfrom vqa import Trainer\ntrainer = Trainer(\n    args,\n    train_loader=train_loader\n    val_loader=val_loader\n    test_loader=test_loader,\n)\n\n# model is attached to trainer\nmodel = trainer.model\n\n# Each task-specific model class is inherited from VLT5/VLBart classes, which are inherited from Huggingface transformers T5/BART classes\nprint(model)\n\u003e\u003e\u003e VLT5VQA(\n    (shared): Embedding(...)\n    (encoder): JointEncoder(...)\n    ...\n)\n\n# Training\ntrain_batch = next(iter(train_loader))\nmodel.train_step(train_batch)\n\u003e\u003e\u003e {'loss': ... }\n\n# Inference\ntest_batch = next(iter(test_loader))\nmodel.test_step(test_batch)\n\u003e\u003e\u003e {'pred_ans': ... }\n```\n\nTo add a new task, you can start with writing 3 files by editing from existing ones.\n``` bash\nNEW_TASK_model.py # Define a VLT5NewTask/VLBartNewTask model which inherits VLT5/VLBart class\nNEW_TASK_data.py # Define Dataset/DataLoader/Evaluator\nNEW_TASK.py # Define a trainer which inherits TrainerBase (trainer_base.py)\n```\n\n## Download Pre-trained models / Pre-extracted features\nWe host model checkpoints and features via google drive.\nWe recommend using [gdrive](https://github.com/prasmussen/gdrive) to download them.\n\n## Pretrained Models\n- Download `snap/` from [Google Drive](https://drive.google.com/drive/folders/1_SBj4sZ0gUqfBon1gFBiNRAmfHv5w_ph?usp=sharing)\n```bash\ngdrive download 1_SBj4sZ0gUqfBon1gFBiNRAmfHv5w_ph --recursive\n```\n\n### COCO+VG pretraining (default)\n* `VL-T5/snap/pretrain/VLT5/Epoch30.pth`: VL-T5 pretrained for 30 epochs on COCO+VG\n* `VL-T5/snap/pretrain/VLBart/Epoch30.pth`: VL-BART pretrained for 30 epochs on COCO+VG\n\n### VCR pretraining (2nd stage)\n* `VL-T5/snap/vcr_pretrain/VLT5/Epoch20.pth`: VL-T5 further pretrained for 20 epochs on VCR\n* `VL-T5/snap/vcr_pretrain/VLBart/Epoch20.pth`: VL-BART further pretrained for 20 epochs on VCR\n\n\n## Dataset Preparation / Feature extraction\n- Download `datasets/` from [Google Drive](https://drive.google.com/drive/folders/1MBBhlkP83VMKS2Qe0SmFfzkHhMpIG5wf?usp=sharing)\n```bash\ngdrive download 1MBBhlkP83VMKS2Qe0SmFfzkHhMpIG5wf --recursive\n```\n\n  - Multi30K only\n    - `git clone --recursive https://github.com/multi30k/dataset ./datasets/multi30k-dataset`\n    - unzip `train.en.gz`, `val.en.gz`, `test_2017_flickr.en.gz`, `test_2018_flickr.en.gz` in `./datasets/multi30k-dataset/data/task1/raw/`\n    - unzip `train.de.gz`, `val.de.gz`, `test_2017_flickr.de.gz`, `test_2018_flickr.de.gz` in `./datasets/multi30k-dataset/data/task1/raw/`\n- For manual feature extraction, please checkout [./feature_extraction](./feature_extraction)\n\n## Pretraining on COCO+VG\n```bash\n# Pretraining with 4 gpus\ncd VL-T5/\nbash scripts/COCOVG_pretrain_VLT5.sh 4\nbash scripts/COCOVG_pretrain_VLBart.sh 4\n```\n\n## Downstream tasks\n\n### [VQA](https://visualqa.org/)\n```bash\n# Finetuning with 4 gpus\ncd VL-T5/\nbash scripts/VQA_VLT5.sh 4\nbash scripts/VQA_VLBart.sh 4\n```\n\n### [GQA](https://cs.stanford.edu/people/dorarad/gqa/)\n```bash\n# Finetuning with 4 gpus\ncd VL-T5/\nbash scripts/GQA_VLT5.sh 4\nbash scripts/GQA_VLBart.sh 4\n```\n\n### [NLVR2](http://lil.nlp.cornell.edu/nlvr/)\n```bash\n# Finetuning with 4 gpus\ncd VL-T5/\nbash scripts/NLVR_VLT5.sh 4\nbash scripts/NLVR_VLBart.sh 4\n```\n\n### [RefCOCOg](https://github.com/mjhucla/Google_Refexp_toolbox)\n```bash\n# Finetuning with 4 gpus\ncd VL-T5/\nbash scripts/RefCOCOg_VLT5.sh 4\nbash scripts/RefCOCOG_VLBart.sh 4\n```\n\n### [VCR](https://visualcommonsense.com/)\n```bash\n# Pretraining on VCR with 4 gpus (optional)\ncd VL-T5/\nbash scripts/VCR_pretrain_VLT5.sh 4\nbash scripts/VCR_pretrain_VLBart.sh 4\n\n# Finetuning with 4 gpus\ncd VL-T5/\nbash scripts/VCR_VLT5.sh 4\nbash scripts/VCR_VLBart.sh 4\n```\n\n### [COCO Caption](https://cocodataset.org/)\n```bash\n# Finetuning with 4 gpus\ncd VL-T5/\nbash scripts/COCOCaption_VLT5.sh 4\nbash scripts/COCOCaption_VLBart.sh 4\n```\n\n### [Multi30K](https://github.com/multi30k/dataset)\n```bash\n# Finetuning with 4 gpus\ncd VL-T5/\nbash scripts/Multi30K_VLT5.sh 4\nbash scripts/Multi30K_VLBart.sh 4\n```\n\n\n# Reference\nPlease cite our paper if you use our models in your works:\n```bibtex\n@inproceedings{cho2021vlt5,\n  title     = {Unifying Vision-and-Language Tasks via Text Generation},\n  author    = {Jaemin Cho and Jie Lei and Hao Tan and Mohit Bansal},\n  booktitle = {ICML},\n  year      = {2021}\n}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fj-min%2FVL-T5","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fj-min%2FVL-T5","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fj-min%2FVL-T5/lists"}