{"id":13958461,"url":"https://github.com/ChenRocks/UNITER","last_synced_at":"2025-07-21T00:30:49.007Z","repository":{"id":37698874,"uuid":"236624100","full_name":"ChenRocks/UNITER","owner":"ChenRocks","description":"Research code for ECCV 2020 paper \"UNITER: UNiversal Image-TExt Representation Learning\"","archived":false,"fork":false,"pushed_at":"2021-06-30T21:46:55.000Z","size":176,"stargazers_count":786,"open_issues_count":46,"forks_count":109,"subscribers_count":18,"default_branch":"master","last_synced_at":"2024-11-28T02:34:44.076Z","etag":null,"topics":["pre-training","pytorch","transformers","vision-and-language"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/1909.11740","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ChenRocks.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-01-28T00:09:59.000Z","updated_at":"2024-11-28T00:30:53.000Z","dependencies_parsed_at":"2022-08-03T09:30:15.649Z","dependency_job_id":null,"html_url":"https://github.com/ChenRocks/UNITER","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ChenRocks/UNITER","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ChenRocks%2FUNITER","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ChenRocks%2FUNITER/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ChenRocks%2FUNITER/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ChenRocks%2FUNITER/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ChenRocks","download_url":"https://codeload.github.com/ChenRocks/UNITER/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ChenRocks%2FUNITER/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266221246,"owners_count":23894964,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["pre-training","pytorch","transformers","vision-and-language"],"created_at":"2024-08-08T13:01:36.867Z","updated_at":"2025-07-21T00:30:48.690Z","avatar_url":"https://github.com/ChenRocks.png","language":"Python","funding_links":[],"categories":["其他_机器视觉"],"sub_categories":["网络服务_其他"],"readme":"# UNITER: UNiversal Image-TExt Representation Learning\nThis is the official repository of [UNITER](https://arxiv.org/abs/1909.11740) (ECCV 2020).\nThis repository currently supports finetuning UNITER on\n[NLVR2](http://lil.nlp.cornell.edu/nlvr/), [VQA](https://visualqa.org/), [VCR](https://visualcommonsense.com/),\n[SNLI-VE](https://github.com/necla-ml/SNLI-VE), \nImage-Text Retrieval for [COCO](https://cocodataset.org/#home) and\n[Flickr30k](http://shannon.cs.illinois.edu/DenotationGraph/), and\n[Referring Expression Comprehensions](https://github.com/lichengunc/refer) (RefCOCO, RefCOCO+, and RefCOCO-g).\nBoth UNITER-base and UNITER-large pre-trained checkpoints are released.\nUNITER-base pre-training with in-domain data is also available.\n\n![Overview of UNITER](https://acvrpublicycchen.blob.core.windows.net/uniter/uniter_overview_v2.png)\n\nSome code in this repo are copied/modified from opensource implementations made available by\n[PyTorch](https://github.com/pytorch/pytorch),\n[HuggingFace](https://github.com/huggingface/transformers),\n[OpenNMT](https://github.com/OpenNMT/OpenNMT-py),\nand [Nvidia](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch).\nThe image features are extracted using [BUTD](https://github.com/peteanderson80/bottom-up-attention).\n\n\n## Requirements\nWe provide Docker image for easier reproduction. Please install the following:\n  - [nvidia driver](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#package-manager-installation) (418+), \n  - [Docker](https://docs.docker.com/install/linux/docker-ce/ubuntu/) (19.03+), \n  - [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-docker#quickstart).\n\nOur scripts require the user to have the [docker group membership](https://docs.docker.com/install/linux/linux-postinstall/)\nso that docker commands can be run without sudo.\nWe only support Linux with NVIDIA GPUs. We test on Ubuntu 18.04 and V100 cards.\nWe use mixed-precision training hence GPUs with Tensor Cores are recommended.\n\n## Quick Start\n*NOTE*: Please run `bash scripts/download_pretrained.sh $PATH_TO_STORAGE` to get our latest pretrained\ncheckpoints. This will download both the base and large models.\n\nWe use NLVR2 as an end-to-end example for using this code base.\n\n1. Download processed data and pretrained models with the following command.\n    ```bash\n    bash scripts/download_nlvr2.sh $PATH_TO_STORAGE\n    ```\n    After downloading you should see the following folder structure:\n    ```\n    ├── ann\n    │   ├── dev.json\n    │   └── test1.json\n    ├── finetune\n    │   ├── nlvr-base\n    │   └── nlvr-base.tar\n    ├── img_db\n    │   ├── nlvr2_dev\n    │   ├── nlvr2_dev.tar\n    │   ├── nlvr2_test\n    │   ├── nlvr2_test.tar\n    │   ├── nlvr2_train\n    │   └── nlvr2_train.tar\n    ├── pretrained\n    │   └── uniter-base.pt\n    └── txt_db\n        ├── nlvr2_dev.db\n        ├── nlvr2_dev.db.tar\n        ├── nlvr2_test1.db\n        ├── nlvr2_test1.db.tar\n        ├── nlvr2_train.db\n        └── nlvr2_train.db.tar\n    ```\n\n2. Launch the Docker container for running the experiments.\n    ```bash\n    # docker image should be automatically pulled\n    source launch_container.sh $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/img_db \\\n        $PATH_TO_STORAGE/finetune $PATH_TO_STORAGE/pretrained\n    ```\n    The launch script respects $CUDA_VISIBLE_DEVICES environment variable.\n    Note that the source code is mounted into the container under `/src` instead \n    of built into the image so that user modification will be reflected without\n    re-building the image. (Data folders are mounted into the container separately\n    for flexibility on folder structures.)\n\n\n3. Run finetuning for the NLVR2 task.\n    ```bash\n    # inside the container\n    python train_nlvr2.py --config config/train-nlvr2-base-1gpu.json\n\n    # for more customization\n    horovodrun -np $N_GPU python train_nlvr2.py --config $YOUR_CONFIG_JSON\n    ```\n\n4. Run inference for the NLVR2 task and then evaluate.\n    ```bash\n    # inference\n    python inf_nlvr2.py --txt_db /txt/nlvr2_test1.db/ --img_db /img/nlvr2_test/ \\\n        --train_dir /storage/nlvr-base/ --ckpt 6500 --output_dir . --fp16\n\n    # evaluation\n    # run this command outside docker (tested with python 3.6)\n    # or copy the annotation json into mounted folder\n    python scripts/eval_nlvr2.py ./results.csv $PATH_TO_STORAGE/ann/test1.json\n    ```\n    The above command runs inference on the model we trained. Feel free to replace\n    `--train_dir` and `--ckpt` with your own model trained in step 3.\n    Currently we only support single GPU inference.\n\n\n5. Customization\n    ```bash\n    # training options\n    python train_nlvr2.py --help\n    ```\n    - command-line argument overwrites JSON config files\n    - JSON config overwrites `argparse` default value.\n    - use horovodrun to run multi-GPU training\n    - `--gradient_accumulation_steps` emulates multi-gpu training\n\n\n6. Misc.\n    ```bash\n    # text annotation preprocessing\n    bash scripts/create_txtdb.sh $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/ann\n\n    # image feature extraction (Tested on Titan-Xp; may not run on latest GPUs)\n    bash scripts/extract_imgfeat.sh $PATH_TO_IMG_FOLDER $PATH_TO_IMG_NPY\n\n    # image preprocessing\n    bash scripts/create_imgdb.sh $PATH_TO_IMG_NPY $PATH_TO_STORAGE/img_db\n    ```\n    In case you would like to reproduce the whole preprocessing pipeline.\n\n## Downstream Tasks Finetuning\n\n### VQA\nNOTE: train and inference should be ran inside the docker container\n1. download data\n    ```\n    bash scripts/download_vqa.sh $PATH_TO_STORAGE\n    ```\n2. train\n    ```\n    horovodrun -np 4 python train_vqa.py --config config/train-vqa-base-4gpu.json \\\n        --output_dir $VQA_EXP\n    ```\n3. inference\n    ```\n    python inf_vqa.py --txt_db /txt/vqa_test.db --img_db /img/coco_test2015 \\\n        --output_dir $VQA_EXP --checkpoint 6000 --pin_mem --fp16\n    ```\n    The result file will be written at `$VQA_EXP/results_test/results_6000_all.json`, which can be\n    submitted to the evaluation server\n\n### VCR\nNOTE: train and inference should be ran inside the docker container\n1. download data\n    ```\n    bash scripts/download_vcr.sh $PATH_TO_STORAGE\n    ```\n2. train\n    ```\n    horovodrun -np 4 python train_vcr.py --config config/train-vcr-base-4gpu.json \\\n        --output_dir $VCR_EXP\n    ```\n3. inference\n    ```\n    horovodrun -np 4 python inf_vcr.py --txt_db /txt/vcr_test.db \\\n        --img_db \"/img/vcr_gt_test/;/img/vcr_test/\" \\\n        --split test --output_dir $VCR_EXP --checkpoint 8000 \\\n        --pin_mem --fp16\n    ```\n    The result file will be written at `$VCR_EXP/results_test/results_8000_all.csv`, which can be\n    submitted to VCR leaderboard for evluation.\n\n### VCR 2nd Stage Pre-training\nNOTE: pretrain should be ran inside the docker container\n1. download VCR data if you haven't\n    ```\n    bash scripts/download_vcr.sh $PATH_TO_STORAGE\n    ```\n2. 2nd stage pre-train\n    ```\n    horovodrun -np 4 python pretrain_vcr.py --config config/pretrain-vcr-base-4gpu.json \\\n        --output_dir $PRETRAIN_VCR_EXP\n    ```\n\n### Visual Entailment (SNLI-VE)\nNOTE: train should be ran inside the docker container\n1. download data\n    ```\n    bash scripts/download_ve.sh $PATH_TO_STORAGE\n    ```\n2. train\n    ```\n    horovodrun -np 2 python train_ve.py --config config/train-ve-base-2gpu.json \\\n        --output_dir $VE_EXP\n    ```\n\n### Image-Text Retrieval\ndownload data\n```\nbash scripts/download_itm.sh $PATH_TO_STORAGE\n```\nNOTE: Image-Text Retrieval is computationally heavy, especially on COCO.\n#### Zero-shot Image-Text Retrieval (Flickr30k)\n```\n# every image-text pair has to be ranked; please use as many GPUs as possible\nhorovodrun -np $NGPU python inf_itm.py \\\n    --txt_db /txt/itm_flickr30k_test.db --img_db /img/flickr30k \\\n    --checkpoint /pretrain/uniter-base.pt --model_config /src/config/uniter-base.json \\\n    --output_dir $ZS_ITM_RESULT --fp16 --pin_mem\n```\n#### Image-Text Retrieval (Flickr30k)\n- normal finetune\n    ```\n    horovodrun -np 8 python train_itm.py --config config/train-itm-flickr-base-8gpu.json\n    ```\n- finetune with hard negatives\n    ```\n    horovodrun -np 16 python train_itm_hard_negatives.py \\\n        --config config/train-itm-flickr-base-16gpu-hn.jgon\n    ```\n#### Image-Text Retrieval (COCO)\n- finetune with hard negatives\n    ```\n    horovodrun -np 16 python train_itm_hard_negatives.py \\\n        --config config/train-itm-coco-base-16gpu-hn.json\n    ```\n### Referring Expressions\n1. download data\n    ```\n    bash scripts/download_re.sh $PATH_TO_STORAGE\n    ```\n2. train\n    ```\n    python train_re.py --config config/train-refcoco-base-1gpu.json \\\n        --output_dir $RE_EXP\n    ```\n3. inference and evaluation\n    ```\n    source scripts/eval_refcoco.sh $RE_EXP\n    ```\n    The result files will be written under `$RE_EXP/results_test/`\n\nSimilarly, change corresponding configs/scripts for running RefCOCO+/RefCOCOg.\n\n\n## Pre-tranining\ndownload\n```\nbash scripts/download_indomain.sh $PATH_TO_STORAGE\n```\npre-train\n```\nhorovodrun -np 8 python pretrain.py --config config/pretrain-indomain-base-8gpu.json \\\n    --output_dir $PRETRAIN_EXP\n```\nUnfortunately, we cannot host CC/SBU features due to their large size. Users will need to process\nthem on their own. We will provide a smaller sample for easier reference to the expected format soon.\n\n\n## Citation\n\nIf you find this code useful for your research, please consider citing:\n```\n@inproceedings{chen2020uniter,\n  title={Uniter: Universal image-text representation learning},\n  author={Chen, Yen-Chun and Li, Linjie and Yu, Licheng and Kholy, Ahmed El and Ahmed, Faisal and Gan, Zhe and Cheng, Yu and Liu, Jingjing},\n  booktitle={ECCV},\n  year={2020}\n}\n```\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FChenRocks%2FUNITER","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FChenRocks%2FUNITER","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FChenRocks%2FUNITER/lists"}