{"id":13540751,"url":"https://github.com/jackroos/VL-BERT","last_synced_at":"2025-04-02T08:30:45.337Z","repository":{"id":36316321,"uuid":"223335609","full_name":"jackroos/VL-BERT","owner":"jackroos","description":"Code for ICLR 2020 paper \"VL-BERT: Pre-training of Generic Visual-Linguistic Representations\".","archived":false,"fork":false,"pushed_at":"2023-05-22T22:33:35.000Z","size":5672,"stargazers_count":738,"open_issues_count":20,"forks_count":110,"subscribers_count":14,"default_branch":"master","last_synced_at":"2024-11-03T06:32:49.700Z","etag":null,"topics":["bert","iclr2020","pre-training","pytorch","representation-learning","self-supervised-learning","vision-and-language","vl-bert"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jackroos.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2019-11-22T06:08:58.000Z","updated_at":"2024-10-30T10:17:23.000Z","dependencies_parsed_at":"2023-01-17T01:17:10.121Z","dependency_job_id":"dd5f39cc-6803-4ee8-a080-96c6d4390691","html_url":"https://github.com/jackroos/VL-BERT","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jackroos%2FVL-BERT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jackroos%2FVL-BERT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jackroos%2FVL-BERT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jackroos%2FVL-BERT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jackroos","download_url":"https://codeload.github.com/jackroos/VL-BERT/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246781818,"owners_count":20832910,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","iclr2020","pre-training","pytorch","representation-learning","self-supervised-learning","vision-and-language","vl-bert"],"created_at":"2024-08-01T10:00:32.202Z","updated_at":"2025-04-02T08:30:44.646Z","avatar_url":"https://github.com/jackroos.png","language":"Jupyter Notebook","readme":"# VL-BERT\n\nBy \n[Weijie Su](https://www.weijiesu.com/), \n[Xizhou Zhu](https://scholar.google.com/citations?user=02RXI00AAAAJ\u0026hl=en), \n[Yue Cao](http://yue-cao.me/), \n[Bin Li](http://staff.ustc.edu.cn/~binli/), \n[Lewei Lu](https://www.linkedin.com/in/lewei-lu-94015977/), \n[Furu Wei](http://mindio.org/), \n[Jifeng Dai](https://jifengdai.org/).\n\nThis repository is an official implementation of the paper \n[VL-BERT: Pre-training of Generic Visual-Linguistic Representations](https://arxiv.org/abs/1908.08530).\n\n\n\n*Update on 2020/01/16* Add code of visualization.\n\n\n\n*Update on 2019/12/20* Our VL-BERT got accepted by ICLR 2020.\n\n## Introduction\n\nVL-BERT is a simple yet powerful pre-trainable generic representation for visual-linguistic tasks. \nIt is pre-trained on the massive-scale caption dataset and text-only corpus, \nand can be fine-tuned for various down-stream visual-linguistic tasks, \nsuch as Visual Commonsense Reasoning, Visual Question Answering and Referring Expression Comprehension.\n\n![](./figs/pretrain.png)\n\n![](./figs/attention_viz.png)\n\nThanks to PyTorch and its 3rd-party libraries, this codebase also contains following features:\n* Distributed Training\n* FP16 Mixed-Precision Training\n* Various Optimizers and Learning Rate Schedulers\n* Gradient Accumulation\n* Monitoring the Training Using TensorboardX\n\n## Citing VL-BERT\n```bibtex\n@inproceedings{\n  Su2020VL-BERT:,\n  title={VL-BERT: Pre-training of Generic Visual-Linguistic Representations},\n  author={Weijie Su and Xizhou Zhu and Yue Cao and Bin Li and Lewei Lu and Furu Wei and Jifeng Dai},\n  booktitle={International Conference on Learning Representations},\n  year={2020},\n  url={https://openreview.net/forum?id=SygXPaEYvH}\n}\n```\n\n## Prepare\n\n### Environment\n* Ubuntu 16.04, CUDA 9.0, GCC 4.9.4\n* Python 3.6.x\n    ```bash\n    # We recommend you to use Anaconda/Miniconda to create a conda environment\n    conda create -n vl-bert python=3.6 pip\n    conda activate vl-bert\n    ```\n* PyTorch 1.0.0 or 1.1.0\n    ```bash\n    conda install pytorch=1.1.0 cudatoolkit=9.0 -c pytorch\n    ```\n* Apex (optional, for speed-up and fp16 training)\n    ```bash\n    git clone https://github.com/jackroos/apex\n    cd ./apex\n    pip install -v --no-cache-dir --global-option=\"--cpp_ext\" --global-option=\"--cuda_ext\" ./  \n    ```\n* Other requirements:\n    ```bash\n    pip install Cython\n    pip install -r requirements.txt\n    ```\n* Compile\n    ```bash\n    ./scripts/init.sh\n    ```\n\n### Data\n\nSee [PREPARE_DATA.md](data/PREPARE_DATA.md).\n\n### Pre-trained Models\n\nSee [PREPARE_PRETRAINED_MODELS.md](model/pretrained_model/PREPARE_PRETRAINED_MODELS.md).\n\n\n\n## Training\n\n### Distributed Training on Single-Machine\n\n```\n./scripts/dist_run_single.sh \u003cnum_gpus\u003e \u003ctask\u003e/train_end2end.py \u003cpath_to_cfg\u003e \u003cdir_to_store_checkpoint\u003e\n```\n* ```\u003cnum_gpus\u003e```: number of gpus to use.\n* ```\u003ctask\u003e```: pretrain/vcr/vqa/refcoco.\n* ```\u003cpath_to_cfg\u003e```: config yaml file under ```./cfgs/\u003ctask\u003e```.\n* ```\u003cdir_to_store_checkpoint\u003e```: root directory to store checkpoints.\n\n\nFollowing is a more concrete example:\n```\n./scripts/dist_run_single.sh 4 vcr/train_end2end.py ./cfgs/vcr/base_q2a_4x16G_fp32.yaml ./\n```\n\n### Distributed Training on Multi-Machine\n\nFor example, on 2 machines (A and B), each with 4 GPUs, \n\nrun following command on machine A:\n```\n./scripts/dist_run_multi.sh 2 0 \u003cip_addr_of_A\u003e 4 \u003ctask\u003e/train_end2end.py \u003cpath_to_cfg\u003e \u003cdir_to_store_checkpoint\u003e\n```\n\nrun following command on machine B:\n```\n./scripts/dist_run_multi.sh 2 1 \u003cip_addr_of_A\u003e 4 \u003ctask\u003e/train_end2end.py \u003cpath_to_cfg\u003e \u003cdir_to_store_checkpoint\u003e\n```\n\n\n### Non-Distributed Training\n```\n./scripts/nondist_run.sh \u003ctask\u003e/train_end2end.py \u003cpath_to_cfg\u003e \u003cdir_to_store_checkpoint\u003e\n```\n\n***Note***:\n\n1. In yaml files under ```./cfgs```, we set batch size for GPUs with at least 16G memory, you may need to adapt the batch size and \ngradient accumulation steps according to your actual case, e.g., if you decrease the batch size, you should also \nincrease the gradient accumulation steps accordingly to keep 'actual' batch size for SGD unchanged.\n\n2. For efficiency, we recommend you to use distributed training even on single-machine. But for RefCOCO+, you may meet deadlock\nusing distributed training due to unknown reason (it may be related to [PyTorch dataloader deadloack](https://github.com/pytorch/pytorch/issues/1355)), you can simply use\nnon-distributed training to solve this problem.\n\n## Evaluation\n\n### VCR\n* Local evaluation on val set:\n  ```\n  python vcr/val.py \\\n    --a-cfg \u003ccfg_of_q2a\u003e --r-cfg \u003ccfg_of_qa2r\u003e \\\n    --a-ckpt \u003ccheckpoint_of_q2a\u003e --r-ckpt \u003ccheckpoint_of_qa2r\u003e \\\n    --gpus \u003cindexes_of_gpus_to_use\u003e \\\n    --result-path \u003cdir_to_save_result\u003e --result-name \u003cresult_file_name\u003e\n  ```\n  ***Note***: ```\u003cindexes_of_gpus_to_use\u003e``` is gpu indexes, e.g., ```0 1 2 3```.\n\n* Generate prediction results on test set for [leaderboard submission](https://visualcommonsense.com/leaderboard/):\n  ```\n  python vcr/test.py \\\n    --a-cfg \u003ccfg_of_q2a\u003e --r-cfg \u003ccfg_of_qa2r\u003e \\\n    --a-ckpt \u003ccheckpoint_of_q2a\u003e --r-ckpt \u003ccheckpoint_of_qa2r\u003e \\\n    --gpus \u003cindexes_of_gpus_to_use\u003e \\\n    --result-path \u003cdir_to_save_result\u003e --result-name \u003cresult_file_name\u003e\n  ```\n\n### VQA\n* Generate prediction results on test set for [EvalAI submission](https://evalai.cloudcv.org/web/challenges/challenge-page/163/overview):\n  ```\n  python vqa/test.py \\\n    --cfg \u003ccfg_file\u003e \\\n    --ckpt \u003ccheckpoint\u003e \\\n    --gpus \u003cindexes_of_gpus_to_use\u003e \\\n    --result-path \u003cdir_to_save_result\u003e --result-name \u003cresult_file_name\u003e\n  ```\n\n### RefCOCO+\n\n* Local evaluation on val/testA/testB set:\n  ```\n  python refcoco/test.py \\\n    --split \u003cval|testA|testB\u003e \\\n    --cfg \u003ccfg_file\u003e \\\n    --ckpt \u003ccheckpoint\u003e \\\n    --gpus \u003cindexes_of_gpus_to_use\u003e \\\n    --result-path \u003cdir_to_save_result\u003e --result-name \u003cresult_file_name\u003e\n  ```\n\n## Visualization\nSee [VISUALIZATION.md](./viz/VISUALIZATION.md).\n\n## Acknowledgements\n\nMany thanks to following codes that help us a lot in building this codebase:\n* [transformers (pytorch-pretrained-bert)](https://github.com/huggingface/transformers) \n* [Deformable-ConvNets](https://github.com/msracver/Deformable-ConvNets/)\n* [maskrcnn-benchmark](https://github.com/facebookresearch/maskrcnn-benchmark)\n* [mmdetection](https://github.com/open-mmlab/mmdetection)\n* [r2c](https://github.com/rowanz/r2c)\n* [allennlp](https://github.com/allenai/allennlp)\n* [bottom-up-attention](https://github.com/peteanderson80/bottom-up-attention)\n* [pythia](https://github.com/facebookresearch/pythia)\n* [MAttNet](https://github.com/lichengunc/MAttNet)\n* [bertviz](https://github.com/jessevig/bertviz)\n","funding_links":[],"categories":["Representation Learning","Jupyter Notebook","其他_机器视觉","Fundamental MIM Methods","NLP"],"sub_categories":["网络服务_其他","MIM for Multi-Modality","2024"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjackroos%2FVL-BERT","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjackroos%2FVL-BERT","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjackroos%2FVL-BERT/lists"}