{"id":11563613,"url":"https://github.com/airsplay/lxmert","last_synced_at":"2025-10-03T14:30:57.797Z","repository":{"id":40505214,"uuid":"203417155","full_name":"airsplay/lxmert","owner":"airsplay","description":"PyTorch code for EMNLP 2019 paper \"LXMERT: Learning Cross-Modality Encoder Representations from Transformers\".","archived":false,"fork":false,"pushed_at":"2022-10-22T00:05:56.000Z","size":248,"stargazers_count":926,"open_issues_count":54,"forks_count":158,"subscribers_count":18,"default_branch":"master","last_synced_at":"2024-09-29T14:31:38.769Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/airsplay.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-08-20T16:50:29.000Z","updated_at":"2024-09-22T08:29:24.000Z","dependencies_parsed_at":"2022-07-13T13:30:28.671Z","dependency_job_id":null,"html_url":"https://github.com/airsplay/lxmert","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airsplay%2Flxmert","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airsplay%2Flxmert/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airsplay%2Flxmert/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airsplay%2Flxmert/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/airsplay","download_url":"https://codeload.github.com/airsplay/lxmert/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":235139069,"owners_count":18942104,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-06-23T05:57:09.885Z","updated_at":"2025-10-03T14:30:52.406Z","avatar_url":"https://github.com/airsplay.png","language":"Python","funding_links":[],"categories":["其他_机器视觉"],"sub_categories":["网络服务_其他"],"readme":"# LXMERT: Learning Cross-Modality Encoder Representations from Transformers\n\n**Our servers break again :(. I have updated the links so that they should work fine now. Sorry for the inconvenience. Please let me for any further issues. Thanks! --Hao, Dec 03**\n\n## Introduction\nPyTorch code for the EMNLP 2019 paper [\"LXMERT: Learning Cross-Modality Encoder Representations from Transformers\"](https://arxiv.org/abs/1908.07490). Slides of our EMNLP 2019 talk are avialable [here](http://www.cs.unc.edu/~airsplay/EMNLP_2019_LXMERT_slides.pdf). \n\n- To analyze the output of pre-trained model (instead of fine-tuning on downstreaming tasks), please load the weight `https://nlp.cs.unc.edu/data/github_pretrain/lxmert20/Epoch20_LXRT.pth`, which is trained as in section [pre-training](#pre-training). The default weight [here](#pre-trained-models) is trained with a slightly different protocal as this code.\n\n\n## Results (with this Github version)\n\n| Split            | [VQA](https://visualqa.org/)     | [GQA](https://cs.stanford.edu/people/dorarad/gqa/)     | [NLVR2](http://lil.nlp.cornell.edu/nlvr/)  |\n|-----------       |:----:   |:---:    |:------:|\n| Local Validation | 69.90%  | 59.80%  | 74.95% |\n| Test-Dev         | 72.42%  | 60.00%  | 74.45% (Test-P) |\n| Test-Standard    | 72.54%  | 60.33%  | 76.18% (Test-U) |\n\nAll the results in the table are produced exactly with this code base.\nSince [VQA](https://evalai.cloudcv.org/web/challenges/challenge-page/163/overview) and [GQA](https://evalai.cloudcv.org/web/challenges/challenge-page/225/overview) test servers only allow limited number of 'Test-Standard' submissions,\nwe use our remaining submission entry from the [VQA](https://visualqa.org/challenge.html)/[GQA](https://cs.stanford.edu/people/dorarad/gqa/challenge.html) challenges 2019 to get these results.\nFor [NLVR2](http://lil.nlp.cornell.edu/nlvr/), we only test once on the unpublished test set (test-U).\n\nWe use this code (with model ensemble) to participate in [VQA 2019](https://visualqa.org/roe.html) and [GQA 2019](https://drive.google.com/open?id=1CtFk0ldbN5w2qhwvfKrNzAFEj-I9Tjgy) challenge in May 2019.\nWe are the **only** team ranking **top-3** in both challenges.\n\n\n## Pre-trained models\nThe pre-trained model (870 MB) is available at http://nlp.cs.unc.edu/data/model_LXRT.pth, and can be downloaded with:\n```bash\nmkdir -p snap/pretrained \nwget https://nlp.cs.unc.edu/data/model_LXRT.pth -P snap/pretrained\n```\n\n\nIf download speed is slower than expected, the pre-trained model could also be downloaded from [other sources](#alternative-dataset-and-features-download-links).\nPlease help put the downloaded file at `snap/pretrained/model_LXRT.pth`.\n\nWe also provide data and commands to pre-train the model in [pre-training](#pre-training). The default setup needs 4 GPUs and takes around a week to finish. The pre-trained weights with this code base could be downloaded from `https://nlp.cs.unc.edu/data/github_pretrain/lxmert/EpochXX_LXRT.pth`, `XX` from 01 to 12. It is pre-trained for 12 epochs (instead of 20 in EMNLP paper) thus the fine-tuned reuslts are about 0.3% lower on each datasets. \n\n\n\n## Fine-tune on Vision-and-Language Tasks\nWe fine-tune our LXMERT pre-trained model on each task with following hyper-parameters:\n\n|Dataset      | Batch Size   | Learning Rate   | Epochs  | Load Answers  |\n|---   |:---:|:---:   |:---:|:---:|\n|VQA   | 32  | 5e-5   | 4   | Yes |\n|GQA   | 32  | 1e-5   | 4   | Yes |\n|NLVR2 | 32  | 5e-5   | 4   | No  |\n\nAlthough the fine-tuning processes are almost the same except for different hyper-parameters,\nwe provide descriptions for each dataset to take care of all details.\n\n### General \nThe code requires **Python 3** and please install the Python dependencies with the command:\n```bash\npip install -r requirements.txt\n```\n\nBy the way, a Python 3 virtual environment could be set up and run with:\n```bash\nvirtualenv name_of_environment -p python3\nsource name_of_environment/bin/activate\n```\n### VQA\n#### Fine-tuning\n1. Please make sure the LXMERT pre-trained model is either [downloaded](#pre-trained-models) or [pre-trained](#pre-training).\n\n2. Download the re-distributed json files for VQA 2.0 dataset. The raw VQA 2.0 dataset could be downloaded from the [official website](https://visualqa.org/download.html).\n    ```bash\n    mkdir -p data/vqa\n    wget https://nlp.cs.unc.edu/data/lxmert_data/vqa/train.json -P data/vqa/\n    wget https://nlp.cs.unc.edu/data/lxmert_data/vqa/nominival.json -P  data/vqa/\n    wget https://nlp.cs.unc.edu/data/lxmert_data/vqa/minival.json -P data/vqa/\n    ```\n3. Download faster-rcnn features for MS COCO train2014 (17 GB) and val2014 (8 GB) images (VQA 2.0 is collected on MS COCO dataset).\nThe image features are\nalso available on Google Drive and Baidu Drive (see [Alternative Download](#alternative-dataset-and-features-download-links) for details).\n    ```bash\n    mkdir -p data/mscoco_imgfeat\n    wget https://nlp.cs.unc.edu/data/lxmert_data/mscoco_imgfeat/train2014_obj36.zip -P data/mscoco_imgfeat\n    unzip data/mscoco_imgfeat/train2014_obj36.zip -d data/mscoco_imgfeat \u0026\u0026 rm data/mscoco_imgfeat/train2014_obj36.zip\n    wget https://nlp.cs.unc.edu/data/lxmert_data/mscoco_imgfeat/val2014_obj36.zip -P data/mscoco_imgfeat\n    unzip data/mscoco_imgfeat/val2014_obj36.zip -d data \u0026\u0026 rm data/mscoco_imgfeat/val2014_obj36.zip\n    ```\n\n4. Before fine-tuning on whole VQA 2.0 training set, verifying the script and model on a small training set (512 images) is recommended. \nThe first argument `0` is GPU id. The second argument `vqa_lxr955_tiny` is the name of this experiment.\n    ```bash\n    bash run/vqa_finetune.bash 0 vqa_lxr955_tiny --tiny\n    ```\n5. If no bug came out, then the model is ready to be trained on the whole VQA corpus:\n    ```bash\n    bash run/vqa_finetune.bash 0 vqa_lxr955\n    ```\nIt takes around 8 hours (2 hours per epoch * 4 epochs) to converge. \nThe **logs** and **model snapshots** will be saved under folder `snap/vqa/vqa_lxr955`. \nThe validation result after training will be around **69.7%** to **70.2%**. \n\n#### Local Validation\nThe results on the validation set (our minival set) are printed while training.\nThe validation result is also saved to `snap/vqa/[experiment-name]/log.log`.\nIf the log file was accidentally deleted, the validation result in training is also reproducible from the model snapshot:\n```bash\nbash run/vqa_test.bash 0 vqa_lxr955_results --test minival --load snap/vqa/vqa_lxr955/BEST\n```\n#### Submitted to VQA test server\n1. Download our re-distributed json file containing VQA 2.0 test data.\n    ```bash\n    wget https://nlp.cs.unc.edu/data/lxmert_data/vqa/test.json -P data/vqa/\n    ```\n2. Download the faster rcnn features for MS COCO test2015 split (16 GB).\n    ```bash\n    wget https://nlp.cs.unc.edu/data/lxmert_data/mscoco_imgfeat/test2015_obj36.zip -P data/mscoco_imgfeat\n    unzip data/mscoco_imgfeat/test2015_obj36.zip -d data \u0026\u0026 rm data/mscoco_imgfeat/test2015_obj36.zip\n    ```\n3. Since VQA submission system requires submitting whole test data, we need to run inference over all test splits \n(i.e., test dev, test standard, test challenge, and test held-out). \nIt takes around 10~15 mins to run test inference (448K instances to run).\n    ```bash\n    bash run/vqa_test.bash 0 vqa_lxr955_results --test test --load snap/vqa/vqa_lxr955/BEST\n    ```\n The test results will be saved in `snap/vqa_lxr955_results/test_predict.json`. \nThe VQA 2.0 challenge for this year is host on [EvalAI](https://evalai.cloudcv.org/) at [https://evalai.cloudcv.org/web/challenges/challenge-page/163/overview](https://evalai.cloudcv.org/web/challenges/challenge-page/163/overview)\nIt still allows submission after the challenge ended.\nPlease check the official website of [VQA Challenge](https://visualqa.org/challenge.html) for detailed information and \nfollow the instructions in [EvalAI](https://evalai.cloudcv.org/web/challenges/challenge-page/163/overview) to submit.\nIn general, after registration, the only thing remaining is to upload the `test_predict.json` file and wait for the result back.\n\nThe testing accuracy with exact this code is **72.42%** for test-dev and **72.54%**  for test-standard.\nThe results with the code base are also publicly shown on the [VQA 2.0 leaderboard](\nhttps://evalai.cloudcv.org/web/challenges/challenge-page/163/leaderboard/498\n) with entry `LXMERT github version`.\n\n\n### GQA\n\n#### Fine-tuning\n1. Please make sure the LXMERT pre-trained model is either [downloaded](#pre-trained-models) or [pre-trained](#pre-training).\n\n2. Download the re-distributed json files for GQA balanced version dataset.\nThe original GQA dataset is available [in the Download section of its website](https://cs.stanford.edu/people/dorarad/gqa/download.html)\nand the script to preprocess these datasets is under `data/gqa/process_raw_data_scripts`.\n    ```bash\n    mkdir -p data/gqa\n    wget https://nlp.cs.unc.edu/data/lxmert_data/gqa/train.json -P data/gqa/\n    wget https://nlp.cs.unc.edu/data/lxmert_data/gqa/valid.json -P data/gqa/\n    wget https://nlp.cs.unc.edu/data/lxmert_data/gqa/testdev.json -P data/gqa/\n    ```\n3. Download Faster R-CNN features for Visual Genome and GQA testing images (30 GB).\nGQA's training and validation data are collected from Visual Genome.\nIts testing images come from MS COCO test set (I have verified this with one of GQA authors [Drew A. Hudson](https://www.linkedin.com/in/drew-a-hudson/)).\nThe image features are\nalso available on Google Drive and Baidu Drive (see [Alternative Download](#alternative-dataset-and-features-download-links) for details).\n    ```bash\n    mkdir -p data/vg_gqa_imgfeat\n    wget https://nlp.cs.unc.edu/data/lxmert_data/vg_gqa_imgfeat/vg_gqa_obj36.zip -P data/vg_gqa_imgfeat\n    unzip data/vg_gqa_imgfeat/vg_gqa_obj36.zip -d data \u0026\u0026 rm data/vg_gqa_imgfeat/vg_gqa_obj36.zip\n    wget https://nlp.cs.unc.edu/data/lxmert_data/vg_gqa_imgfeat/gqa_testdev_obj36.zip -P data/vg_gqa_imgfeat\n    unzip data/vg_gqa_imgfeat/gqa_testdev_obj36.zip -d data \u0026\u0026 rm data/vg_gqa_imgfeat/gqa_testdev_obj36.zip\n    ```\n\n4. Before fine-tuning on whole GQA training+validation set, verifying the script and model on a small training set (512 images) is recommended. \nThe first argument `0` is GPU id. The second argument `gqa_lxr955_tiny` is the name of this experiment.\n    ```bash\n    bash run/gqa_finetune.bash 0 gqa_lxr955_tiny --tiny\n    ```\n\n5. If no bug came out, then the model is ready to be trained on the whole GQA corpus (train + validation), and validate on \nthe testdev set:\n    ```bash\n    bash run/gqa_finetune.bash 0 gqa_lxr955\n    ```\nIt takes around 16 hours (4 hours per epoch * 4 epochs) to converge. \nThe **logs** and **model snapshots** will be saved under folder `snap/gqa/gqa_lxr955`. \nThe validation result after training will be around **59.8%** to **60.1%**. \n\n#### Local Validation\nThe results on testdev is printed out while training and saved in `snap/gqa/gqa_lxr955/log.log`.\nIt could be also re-calculated with command:\n```bash\nbash run/gqa_test.bash 0 gqa_lxr955_results --load snap/gqa/gqa_lxr955/BEST --test testdev --batchSize 1024\n```\n\n\u003e Note: Our local testdev result is usually 0.1% to 0.5% lower than the \nsubmitted testdev result. \nThe reason is that the test server takes an [advanced \nevaluation system](https://cs.stanford.edu/people/dorarad/gqa/evaluate.html) while our local evaluator only \ncalculates the exact matching.\nPlease use [this official evaluator](https://nlp.stanford.edu/data/gqa/eval.zip) (784 MB) if you \nwant to have the exact number without submitting.\n\n\n#### Submitted to GQA test server\n1. Download our re-distributed json file containing GQA test data.\n    ```bash\n    wget https://nlp.cs.unc.edu/data/lxmert_data/gqa/submit.json -P data/gqa/\n    ```\n\n2. Since GQA submission system requires submitting the whole test data, \nwe need to run inference over all test splits.\nIt takes around 30~60 mins to run test inference (4.2M instances to run).\n    ```bash\n    bash run/gqa_test.bash 0 gqa_lxr955_results --load snap/gqa/gqa_lxr955/BEST --test submit --batchSize 1024\n    ```\n\n3. After running test script, a json file `submit_predict.json` under `snap/gqa/gqa_lxr955_results` will contain \nall the prediction results and is ready to be submitted.\nThe GQA challenge 2019 is hosted by [EvalAI](https://evalai.cloudcv.org/) at [https://evalai.cloudcv.org/web/challenges/challenge-page/225/overview](https://evalai.cloudcv.org/web/challenges/challenge-page/225/overview).\nAfter registering the account, uploading the `submit_predict.json` and waiting for the results are the only thing remained.\nPlease also check [GQA official website](https://cs.stanford.edu/people/dorarad/gqa/) \nin case the test server is changed.\n\nThe testing accuracy with exactly this code is **60.00%** for test-dev and **60.33%**  for test-standard.\nThe results with the code base are also publicly shown on the [GQA leaderboard](\nhttps://evalai.cloudcv.org/web/challenges/challenge-page/225/leaderboard\n) with entry `LXMERT github version`.\n\n### NLVR2\n\n#### Fine-tuning\n\n1. Download the NLVR2 data from the official [GitHub repo](https://github.com/lil-lab/nlvr).\n    ```bash\n    git submodule update --init\n    ```\n\n\n2. Process the NLVR2 data to json files.\n    ```bash\n    bash -c \"cd data/nlvr2/process_raw_data_scripts \u0026\u0026 python process_dataset.py\"\n    ```\n\n3. Download the NLVR2 image features for train (21 GB) \u0026 valid (1.6 GB) splits. \nThe image features are\nalso available on Google Drive and Baidu Drive (see [Alternative Download](#alternative-dataset-and-features-download-links) for details).\nTo access to the original images, please follow the instructions on [NLVR2 official Github](https://github.com/lil-lab/nlvr/tree/master/nlvr2).\nThe images could either be downloaded with the urls or by signing an agreement form for data usage. And the feature could be extracted as described in [feature extraction](#faster-r-cnn-feature-extraction)\n    ```bash\n    mkdir -p data/nlvr2_imgfeat\n    wget https://nlp.cs.unc.edu/data/lxmert_data/nlvr2_imgfeat/train_obj36.zip -P data/nlvr2_imgfeat\n    unzip data/nlvr2_imgfeat/train_obj36.zip -d data \u0026\u0026 rm data/nlvr2_imgfeat/train_obj36.zip\n    wget https://nlp.cs.unc.edu/data/lxmert_data/nlvr2_imgfeat/valid_obj36.zip -P data/nlvr2_imgfeat\n    unzip data/nlvr2_imgfeat/valid_obj36.zip -d data \u0026\u0026 rm data/nlvr2_imgfeat/valid_obj36.zip\n    ```\n\n4. Before fine-tuning on whole NLVR2 training set, verifying the script and model on a small training set (512 images) is recommended. \nThe first argument `0` is GPU id. The second argument `nlvr2_lxr955_tiny` is the name of this experiment.\nDo not worry if the result is low (50~55) on this tiny split, \nthe whole training data would bring the performance back.\n    ```bash\n    bash run/nlvr2_finetune.bash 0 nlvr2_lxr955_tiny --tiny\n    ```\n\n5. If no bugs are popping up from the previous step, \nit means that the code, the data, and image features are ready.\nPlease use this command to train on the full training set. \nThe result on NLVR2 validation (dev) set would be around **74.0** to **74.5**.\n    ```bash\n    bash run/nlvr2_finetune.bash 0 nlvr2_lxr955\n    ```\n\n#### Inference on Public Test Split\n1. Download NLVR2 image features for the public test split (1.6 GB).\n    ```bash\n    wget https://nlp.cs.unc.edu/data/lxmert_data/nlvr2_imgfeat/test_obj36.zip -P data/nlvr2_imgfeat\n    unzip data/nlvr2_imgfeat/test_obj36.zip -d data/nlvr2_imgfeat \u0026\u0026 rm data/nlvr2_imgfeat/test_obj36.zip\n    ```\n\n2. Test on the public test set (corresponding to 'test-P' on [NLVR2 leaderboard](http://lil.nlp.cornell.edu/nlvr/)) with:\n    ```bash\n    bash run/nlvr2_test.bash 0 nlvr2_lxr955_results --load snap/nlvr2/nlvr2_lxr955/BEST --test test --batchSize 1024\n    ```\n\n3. The test accuracy would be shown on the screen after around 5~10 minutes.\nIt also saves the predictions in the file `test_predict.csv` \nunder `snap/nlvr2_lxr955_reuslts`, which is compatible to NLVR2 [official evaluation script](https://github.com/lil-lab/nlvr/tree/master/nlvr2/eval).\nThe official eval script also calculates consistency ('Cons') besides the accuracy.\nWe could use this official script to verify the results by running:\n    ```bash\n    python data/nlvr2/nlvr/nlvr2/eval/metrics.py snap/nlvr2/nlvr2_lxr955_results/test_predict.csv data/nlvr2/nlvr/nlvr2/data/test1.json\n    ```\n\nThe accuracy of public test ('test-P') set should be almost same to the validation set ('dev'),\nwhich is around 74.0% to 74.5%.\n\n\n#### Unreleased Test Sets\nTo be tested on the unreleased held-out test set (test-U on the \n[leaderboard](http://lil.nlp.cornell.edu/nlvr/)\n),\nthe code needs to be sent.\nPlease check the [NLVR2 official github](https://github.com/lil-lab/nlvr/tree/master/nlvr2) \nand [NLVR project website](http://lil.nlp.cornell.edu/nlvr/) for details.\n\n\n### General Debugging Options\nSince it takes a few minutes to load the features, the code has an option to prototype with a small amount of\ntraining data. \n```bash\n# Training with 512 images:\nbash run/vqa_finetune.bash 0 --tiny \n# Training with 4096 images:\nbash run/vqa_finetune.bash 0 --fast\n```\n\n## Pre-training\n\n1. Download our aggregated LXMERT dataset from MS COCO, Visual Genome, VQA, and GQA (around 700MB in total). The joint answer labels are saved in `data/lxmert/all_ans.json`.\n    ```bash\n    mkdir -p data/lxmert\n    wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_train.json -P data/lxmert/\n    wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_nominival.json -P data/lxmert/\n    wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/vgnococo.json -P data/lxmert/\n    wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_minival.json -P data/lxmert/\n    ```\n\n2. [*Skip this if you have run [VQA fine-tuning](#vqa).*] Download the detection features for MS COCO images.\n    ```bash\n    mkdir -p data/mscoco_imgfeat\n    wget https://nlp.cs.unc.edu/data/lxmert_data/mscoco_imgfeat/train2014_obj36.zip -P data/mscoco_imgfeat\n    unzip data/mscoco_imgfeat/train2014_obj36.zip -d data/mscoco_imgfeat \u0026\u0026 rm data/mscoco_imgfeat/train2014_obj36.zip\n    wget https://nlp.cs.unc.edu/data/lxmert_data/mscoco_imgfeat/val2014_obj36.zip -P data/mscoco_imgfeat\n    unzip data/mscoco_imgfeat/val2014_obj36.zip -d data \u0026\u0026 rm data/mscoco_imgfeat/val2014_obj36.zip\n    ```\n\n3. [*Skip this if you have run [GQA fine-tuning](#gqa).*] Download the detection features for Visual Genome images.\n    ```bash\n    mkdir -p data/vg_gqa_imgfeat\n    wget https://nlp.cs.unc.edu/data/lxmert_data/vg_gqa_imgfeat/vg_gqa_obj36.zip -P data/vg_gqa_imgfeat\n    unzip data/vg_gqa_imgfeat/vg_gqa_obj36.zip -d data \u0026\u0026 rm data/vg_gqa_imgfeat/vg_gqa_obj36.zip\n    ```\n\n4. Test on a small split of the MS COCO + Visual Genome datasets:\n    ```bash\n    bash run/lxmert_pretrain.bash 0,1,2,3 --multiGPU --tiny\n    ```\n\n5. Run on the whole [MS COCO](http://cocodataset.org) and [Visual Genome](https://visualgenome.org/) related datasets (i.e., [VQA](https://visualqa.org/), [GQA](https://cs.stanford.edu/people/dorarad/gqa/index.html), [COCO caption](http://cocodataset.org/#captions-2015), [VG Caption](https://visualgenome.org/), [VG QA](https://github.com/yukezhu/visual7w-toolkit)). \nHere, we take a simple single-stage pre-training strategy (20 epochs with all pre-training tasks) rather than the two-stage strategy in our paper (10 epochs without image QA and 10 epochs with image QA).\nThe pre-training finishes in **8.5 days** on **4 GPUs**.  By the way, I hope that [my experience](experience_in_pretraining.md) in this project would help anyone with limited computational resources.\n    ```bash\n    bash run/lxmert_pretrain.bash 0,1,2,3 --multiGPU\n    ```\n    \u003e Multiple GPUs: Argument `0,1,2,3` indicates taking 4 GPUs to pre-train LXMERT. If the server does not have 4 GPUs (I am sorry to hear that), please consider halving the batch-size or using the [NVIDIA/apex](https://github.com/NVIDIA/apex) library to support half-precision computation. \n    The code uses the default data parallelism in PyTorch and thus extensible to less/more GPUs. The python main thread would take charge of the data loading. On 4 GPUs, we do not find that the data loading becomes a bottleneck (around 5% overhead). \n    \u003e\n    \u003e GPU Types: We find that either Titan XP, GTX 2080, and Titan V could support this pre-training. However, GTX 1080, with its 11G memory, is a little bit small thus please change the batch_size to 224 (instead of 256).\n\n6. I have **verified these pre-training commands** with 12 epochs. The pre-trained weights from previous process could be downloaded from `https://nlp.cs.unc.edu/data/github_pretrain/lxmert/EpochXX_LXRT.pth`, XX from `01` to `12`. The results are roughly the same (around 0.3% lower in downstream tasks because of fewer epochs). \n\n7. Explanation of arguments in the pre-training script `run/lxmert_pretrain.bash`:\n    ```bash\n    python src/pretrain/lxmert_pretrain_new.py \\\n        # The pre-training tasks\n        --taskMaskLM --taskObjPredict --taskMatched --taskQA \\  \n        \n        # Vision subtasks\n        # obj / attr: detected object/attribute label prediction.\n        # feat: RoI feature regression.\n        --visualLosses obj,attr,feat \\\n        \n        # Mask rate for words and objects\n        --wordMaskRate 0.15 --objMaskRate 0.15 \\\n        \n        # Training and validation sets\n        # mscoco_nominival + mscoco_minival = mscoco_val2014\n        # visual genome - mscoco = vgnococo\n        --train mscoco_train,mscoco_nominival,vgnococo --valid mscoco_minival \\\n        \n        # Number of layers in each encoder\n        --llayers 9 --xlayers 5 --rlayers 5 \\\n        \n        # Train from scratch (Using intialized weights) instead of loading BERT weights.\n        --fromScratch \\\n    \n        # Hyper parameters\n        --batchSize 256 --optim bert --lr 1e-4 --epochs 20 \\\n        --tqdm --output $output ${@:2}\n    ```\n\n\n## Alternative Dataset and Features Download Links \nAll default download links are provided by our servers in [UNC CS department](https://cs.unc.edu) and under \nour [NLP group website](https://nlp.cs.unc.edu) but the network bandwidth might be limited. \nWe thus provide a few other options with Google Drive and Baidu Drive.\n\nThe files in online drives are almost structured in the same way \nas our repo but have a few differences due to specific policies.\nAfter downloading the data and features from the drives, \nplease re-organize them under `data/` folder according to the following example:\n```\nREPO ROOT\n |\n |-- data                  \n |    |-- vqa\n |    |    |-- train.json\n |    |    |-- minival.json\n |    |    |-- nominival.json\n |    |    |-- test.json\n |    |\n |    |-- mscoco_imgfeat\n |    |    |-- train2014_obj36.tsv\n |    |    |-- val2014_obj36.tsv\n |    |    |-- test2015_obj36.tsv\n |    |\n |    |-- vg_gqa_imgfeat -- *.tsv\n |    |-- gqa -- *.json\n |    |-- nlvr2_imgfeat -- *.tsv\n |    |-- nlvr2 -- *.json\n |    |-- lxmert -- *.json          # Pre-training data\n | \n |-- snap\n |-- src\n```\n\nPlease also kindly contact us if anything is missing!\n\n### Google Drive\nAs an alternative way to download feature from our UNC server,\nyou could also download the feature from google drive with link [https://drive.google.com/drive/folders/1Gq1uLUk6NdD0CcJOptXjxE6ssY5XAuat?usp=sharing](https://drive.google.com/drive/folders/1Gq1uLUk6NdD0CcJOptXjxE6ssY5XAuat?usp=sharing).\nThe structure of the folders on drive is:\n```\nGoogle Drive Root\n |-- data                  # The raw data and image features without compression\n |    |-- vqa\n |    |-- gqa\n |    |-- mscoco_imgfeat\n |    |-- ......\n |\n |-- image_feature_zips    # The image-feature zip files (Around 45% compressed)\n |    |-- mscoco_imgfeat.zip\n |    |-- nlvr2_imgfeat.zip\n |    |-- vg_gqa_imgfeat.zip\n |\n |-- snap -- pretrained -- model_LXRT.pth # The pytorch pre-trained model weights.\n```\nNote: image features in zip files (e.g., `mscoco_mgfeat.zip`) are the same to which in `data/` (i.e., `data/mscoco_imgfeat`). \nIf you want to save network bandwidth, please download the feature zips and skip downloading the `*_imgfeat` folders under `data/`.\n### Baidu Drive\n\nSince [Google Drive](\nhttps://drive.google.com/drive/folders/1Gq1uLUk6NdD0CcJOptXjxE6ssY5XAuat?usp=sharing\n) is not officially available across the world,\nwe also create a mirror on Baidu drive (i.e., Baidu PAN). \nThe dataset and features could be downloaded with shared link \n[https://pan.baidu.com/s/1m0mUVsq30rO6F1slxPZNHA](https://pan.baidu.com/s/1m0mUVsq30rO6F1slxPZNHA) \nand access code `wwma`.\n```\nBaidu Drive Root\n |\n |-- vqa\n |    |-- train.json\n |    |-- minival.json\n |    |-- nominival.json\n |    |-- test.json\n |\n |-- mscoco_imgfeat\n |    |-- train2014_obj36.zip\n |    |-- val2014_obj36.zip\n |    |-- test2015_obj36.zip\n |\n |-- vg_gqa_imgfeat -- *.zip.*  # Please read README.txt under this folder\n |-- gqa -- *.json\n |-- nlvr2_imgfeat -- *.zip.*   # Please read README.txt under this folder\n |-- nlvr2 -- *.json\n |-- lxmert -- *.json\n | \n |-- pretrained -- model_LXRT.pth\n```\n\nSince Baidu Drive does not support extremely large files, \nwe `split` a few features zips into multiple small files. \nPlease follow the `README.txt` under `baidu_drive/vg_gqa_imgfeat` and \n`baidu_drive/nlvr2_imgfeat` to concatenate back to the feature zips with command `cat`.\n\n\n## Code and Project Explanation\n- All code is in folder `src`. The basics in `lxrt`.\nThe python files related to pre-training and fine-tuning are saved in `src/pretrain` and `src/tasks` respectively.\n- I kept folders containing image features (e.g., mscoco_imgfeat) separated from vision-and-language dataset (e.g., vqa, lxmert) because\nmultiple vision-and-language datasets would share common images.\n- We use the name `lxmert` for our framework and use the name `lxrt`\n(Language, Cross-Modality, and object-Relationship Transformers) to refer to our our models.\n- To be consistent with the name `lxrt` (Language, Cross-Modality, and object-Relationship Transformers), \nwe use `lxrXXX` to denote the number of layers.\nE.g., `lxr955` (used in current pre-trained model) indicates \na model with 9 Language layers, 5 cross-modality layers, and 5 object-Relationship layers. \nIf we consider a single-modality layer as a half of cross-modality layer, \nthe total number of layers is `(9 + 5) / 2 + 5 = 12`, which is the same as `BERT_BASE`.\n- We share the weight between the two cross-modality attention sub-layers. Please check the [`visual_attention` variable](blob/master/src/lxrt/modeling.py#L521), which is used to compute both `lang-\u003evisn` attention and `visn-\u003elang` attention. (I am sorry that the name `visual_attention` is misleading because I deleted the `lang_attention` there.) Sharing weights is mostly used for saving computational resources and it also (intuitively) helps forcing the features from visn/lang into a joint subspace.\n- The box coordinates are not normalized from [0, 1] to [-1, 1], which looks like a typo but actually not ;). Normalizing the coordinate would not affect the output of box encoder (mathematically and almost numerically). ~~(Hint: consider the LayerNorm in positional encoding)~~\n\n\n## Faster R-CNN Feature Extraction\n\n\nWe use the Faster R-CNN feature extractor demonstrated in [\"Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering\", CVPR 2018](https://arxiv.org/abs/1707.07998)\nand its released code at [Bottom-Up-Attention github repo](https://github.com/peteanderson80/bottom-up-attention).\nIt was trained on [Visual Genome](https://visualgenome.org/) dataset and implemented based on a specific [Caffe](https://caffe.berkeleyvision.org/) version.\n\n\nTo extract features with this Caffe Faster R-CNN, we publicly release a docker image `airsplay/bottom-up-attention` on docker hub that takes care of all the dependencies and library installation . Instructions and examples are demonstrated below. You could also follow the installation instructions in the bottom-up attention github to setup the tool: [https://github.com/peteanderson80/bottom-up-attention](https://github.com/peteanderson80/bottom-up-attention). \n\nThe BUTD feature extractor is widely used in many other projects. If you want to reproduce the results from their paper, feel free to use our docker as a tool.\n\n\n### Feature Extraction with Docker\n[Docker](https://www.docker.com/) is a easy-to-use virtualization tool which allows you to plug and play without installing libraries.\n\nThe built docker file for bottom-up-attention is released on [docker hub](https://hub.docker.com/r/airsplay/bottom-up-attention) and could be downloaded with command: \n```bash\nsudo docker pull airsplay/bottom-up-attention\n```\n\u003e The `Dockerfile` could be downloaed [here](https://drive.google.com/file/d/1KJjwQtqisXvinWm8OORk-_3XYLBHYCIK/view?usp=sharing), which allows using other CUDA versions.\n\nAfter pulling the docker, you could test running the docker container with command:\n```bash\ndocker run --gpus all --rm -it airsplay/bottom-up-attention bash\n``` \n\n\nIf errors about `--gpus all` popped up, please read the next section.\n\n#### Docker GPU Access\nNote that the purpose of the argument `--gpus all` is to expose GPU devices to the docker container, and it requires Docker \u003e= 19.03 along with `nvidia-container-toolkit`:\n1. [Docker CE 19.03](https://docs.docker.com/install/linux/docker-ce/ubuntu/)\n2. [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-docker)\n\nFor running Docker with an older version, either update it to 19.03 or use the flag `--runtime=nvidia` instead of `--gpus all`.\n\n#### An Example: Feature Extraction for NLVR2 \nWe demonstrate how to extract Faster R-CNN features of NLVR2 images.\n\n1. Please first follow the instructions on the [NLVR2 official repo](https://github.com/lil-lab/nlvr/tree/master/nlvr2) to get the images.\n\n2. Download the pre-trained Faster R-CNN model. Instead of using the default pre-trained model (trained with 10 to 100 boxes), we use the ['alternative pretrained model'](https://github.com/peteanderson80/bottom-up-attention#demo) which was trained with 36 boxes. \n    ```bash\n    wget 'https://www.dropbox.com/s/2h4hmgcvpaewizu/resnet101_faster_rcnn_final_iter_320000.caffemodel?dl=1' -O data/nlvr2_imgfeat/resnet101_faster_rcnn_final_iter_320000.caffemodel\n    ```\n\n3. Run docker container with command:\n    ```bash\n    docker run --gpus all -v /path/to/nlvr2/images:/workspace/images:ro -v /path/to/lxrt_public/data/nlvr2_imgfeat:/workspace/features --rm -it airsplay/bottom-up-attention bash\n    ```\n    `-v` mounts the folders on host os to the docker image container.\n    \u003e Note0: If it says something about 'privilege', add `sudo` before the command.\n    \u003e\n    \u003e Note1: If it says something about '--gpus all', it means that the GPU options are not correctly set. Please read [Docker GPU Access](#docker-gpu-access) for the instructions to allow GPU access.\n    \u003e\n    \u003e Note2: /path/to/nlvr2/images would contain subfolders `train`, `dev`, `test1` and `test2`.\n    \u003e\n    \u003e Note3: Both paths '/path/to/nlvr2/images/' and '/path/to/lxrt_public' requires absolute paths.\n\n\n4. Extract the features **inside the docker container**. The extraction script is copied from [butd/tools/generate_tsv.py](https://github.com/peteanderson80/bottom-up-attention/blob/master/tools/generate_tsv.py) and modified by [Jie Lei](http://www.cs.unc.edu/~jielei/) and me.\n    ```bash\n    cd /workspace/features\n    CUDA_VISIBLE_DEVICES=0 python extract_nlvr2_image.py --split train \n    CUDA_VISIBLE_DEVICES=0 python extract_nlvr2_image.py --split valid\n    CUDA_VISIBLE_DEVICES=0 python extract_nlvr2_image.py --split test\n    ```\n\n5. It would takes around 5 to 6 hours for the training split and 1 to 2 hours for the valid and test splits. Since it is slow, I recommend to run them parallelly if there are multiple GPUs. It could be achieved by changing the `gpu_id` in `CUDA_VISIBLE_DEVICES=$gpu_id`.\n\nThe features will be saved in `train.tsv`, `valid.tsv`, and `test.tsv` under the directory `data/nlvr2_imgfeat`, outside the docker container. I have verified the extracted image features are the same to the ones I provided in [NLVR2 fine-tuning](#nlvr2).\n\n#### Yet Another Example: Feature Extraction for MS COCO Images\n1. Download the MS COCO train2014, val2014, and test2015 images from [MS COCO official website](http://cocodataset.org/#download).\n\n2. Download the pre-trained Faster R-CNN model. \n    ```bash\n    mkdir -p data/mscoco_imgfeat\n    wget 'https://www.dropbox.com/s/2h4hmgcvpaewizu/resnet101_faster_rcnn_final_iter_320000.caffemodel?dl=1' -O data/mscoco_imgfeat/resnet101_faster_rcnn_final_iter_320000.caffemodel\n    ```\n\n3. Run the docker container with the command:\n    ```bash\n    docker run --gpus all -v /path/to/mscoco/images:/workspace/images:ro -v $(pwd)/data/mscoco_imgfeat:/workspace/features --rm -it airsplay/bottom-up-attention bash\n    ```\n    \u003e Note: Option `-v` mounts the folders outside container to the paths inside the container.\n    \u003e \n    \u003e Note1: Please use the **absolute path** to the MS COCO images folder `images`. The `images` folder containing the `train2014`, `val2014`, and `test2015` sub-folders. (It's the standard way to save MS COCO images.)\n\n4. Extract the features **inside the docker container**.\n    ```bash\n    cd /workspace/features\n    CUDA_VISIBLE_DEVICES=0 python extract_coco_image.py --split train \n    CUDA_VISIBLE_DEVICES=0 python extract_coco_image.py --split valid\n    CUDA_VISIBLE_DEVICES=0 python extract_coco_image.py --split test\n    ```\n \n5. Exit from the docker container (by executing `exit` command in bash). The extracted features would be saved under folder `data/mscoco_imgfeat`. \n\n\n## Reference\nIf you find this project helps, please cite our paper :)\n\n```bibtex\n@inproceedings{tan2019lxmert,\n  title={LXMERT: Learning Cross-Modality Encoder Representations from Transformers},\n  author={Tan, Hao and Bansal, Mohit},\n  booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing},\n  year={2019}\n}\n```\n\n## Acknowledgement\nWe thank the funding support from ARO-YIP Award #W911NF-18-1-0336, \u0026 awards from Google, Facebook, Salesforce, and Adobe.\n\nWe thank [Peter Anderson](https://panderson.me/) for providing the faster R-CNN code and pre-trained models under\n[Bottom-Up-Attention Github Repo](https://github.com/peteanderson80/bottom-up-attention).  We thank [Hengyuan Hu](https://www.linkedin.com/in/hengyuan-hu-8963b313b) for his [PyTorch VQA](https://github.com/hengyuan-hu/bottom-up-attention-vqa) implementation, our VQA implementation borrows its pre-processed answers.\nWe thank [hugginface](https://github.com/huggingface) for releasing the excellent PyTorch code \n[PyTorch Transformers](https://github.com/huggingface/pytorch-transformers).  \n\nWe thank [Drew A. Hudson](https://www.linkedin.com/in/drew-a-hudson/) to answer all our questions about GQA specification.\nWe thank [Alane Suhr](http://alanesuhr.com/) for helping test LXMERT on NLVR2 unreleased test split and provide [a detailed analysis](http://lil.nlp.cornell.edu/nlvr/NLVR2BiasAnalysis.html).\n\nWe thank all the authors and annotators of vision-and-language datasets \n(i.e., \n[MS COCO](http://cocodataset.org/#home), \n[Visual Genome](https://visualgenome.org/),\n[VQA](https://visualqa.org/),\n[GQA](https://cs.stanford.edu/people/dorarad/gqa/),\n[NLVR2](http://lil.nlp.cornell.edu/nlvr/)\n), \nwhich allows us to develop a pre-trained model for vision-and-language tasks.\n\nWe thank [Jie Lei](http://www.cs.unc.edu/~jielei/) and [Licheng Yu](http://www.cs.unc.edu/~licheng/) for their helpful discussions. I also want to thank [Shaoqing Ren](https://www.shaoqingren.com/) to teach me vision knowledge when I was in MSRA.  We also thank you to help look into our code. Please kindly contact us if you find any issue. Comments are always welcome.\n\nLXRThanks.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fairsplay%2Flxmert","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fairsplay%2Flxmert","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fairsplay%2Flxmert/lists"}