{"id":15035931,"url":"https://github.com/ofa-sys/ofa","last_synced_at":"2025-05-15T05:04:31.929Z","repository":{"id":37319871,"uuid":"453343381","full_name":"OFA-Sys/OFA","owner":"OFA-Sys","description":"Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework","archived":false,"fork":false,"pushed_at":"2024-04-24T06:20:34.000Z","size":125586,"stargazers_count":2491,"open_issues_count":112,"forks_count":249,"subscribers_count":20,"default_branch":"main","last_synced_at":"2025-04-14T05:17:27.013Z","etag":null,"topics":["chinese","image-captioning","multimodal","pretrained-models","pretraining","prompt","prompt-tuning","referring-expression-comprehension","text-to-image-synthesis","vision-language","visual-question-answering"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OFA-Sys.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-01-29T08:46:04.000Z","updated_at":"2025-04-14T02:38:39.000Z","dependencies_parsed_at":"2023-02-09T02:45:24.035Z","dependency_job_id":"e44791cf-2c39-4143-a921-e9d030da3e4c","html_url":"https://github.com/OFA-Sys/OFA","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OFA-Sys%2FOFA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OFA-Sys%2FOFA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OFA-Sys%2FOFA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OFA-Sys%2FOFA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OFA-Sys","download_url":"https://codeload.github.com/OFA-Sys/OFA/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248824694,"owners_count":21167345,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chinese","image-captioning","multimodal","pretrained-models","pretraining","prompt","prompt-tuning","referring-expression-comprehension","text-to-image-synthesis","vision-language","visual-question-answering"],"created_at":"2024-09-24T20:29:47.496Z","updated_at":"2025-04-14T05:17:45.828Z","avatar_url":"https://github.com/OFA-Sys.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!---\nCopyright 2022 The OFA-Sys Team. \nAll rights reserved.\nThis source code is licensed under the Apache 2.0 license found in the LICENSE file in the root directory.\n--\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003cbr\u003e\n    \u003cimg src=\"examples/OFA_logo_tp_path.svg\" width=\"150\" /\u003e\n    \u003cbr\u003e\n\u003cp\u003e\n\u003cbr\u003e\n\n\u003cp align=\"center\"\u003e\n        \u003ca href=\"modelscope.md\"\u003eModelScope\u003c/a\u003e\u0026nbsp ｜ \u0026nbsp\u003ca href=\"checkpoints.md\"\u003eCheckpoints\u003c/a\u003e\u0026nbsp ｜ \u0026nbsp\u003ca href=\"colab.md\"\u003eColab\u003c/a\u003e\u0026nbsp ｜ \u0026nbsp\u003ca href=\"https://huggingface.co/ofa-sys\"\u003eDemo\u003c/a\u003e\u0026nbsp ｜ \u0026nbsp\u003ca href=\"http://arxiv.org/abs/2202.03052\"\u003ePaper \u003c/a\u003e\u0026nbsp ｜ \u0026nbspBlog\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003cbr\u003e\n    \u003cimg src=\"examples/demo.gif\" width=\"800\" /\u003e\n    \u003cbr\u003e\n\u003cp\u003e\n\n[colab]: \u003chttps://colab.research.google.com/assets/colab-badge.svg\u003e\n\nOFA is a unified sequence-to-sequence pretrained model (support **English** and **Chinese**) that unifies modalities (i.e., cross-modality, vision, language) and tasks (**finetuning** and **prompt tuning** are supported): image captioning (1st at the [MSCOCO Leaderboard](https://competitions.codalab.org/competitions/3221#results)), VQA ([link](https://eval.ai/web/challenges/challenge-page/830/leaderboard/2278)), visual grounding, text-to-image generation, text classification, text generation, image classification, etc. We provide **step-by-step** instructions for pretraining and finetuning and corresponding checkpoints (check official ckpt \\[[EN](checkpoints.md)|[CN](checkpoints_cn.md)\\] or [Hugging Face ckpt](https://huggingface.co/OFA-Sys)).\n\nWe sincerely welcome contributions to our project. Feel free to contact us or send us issues / PRs!\n\u003cbr\u003e\u003c/br\u003e\n\n\n# Online Demos\nWe provide online demo via Hugging Face Spaces for you to interact with our pretrained and finetuned models. Below are the links to the demos:\n* Image Captioning \\[[ModelScope](https://modelscope.cn/#/models/damo/ofa_image-caption_coco_large_en/summary)  |  [Spaces](https://huggingface.co/spaces/OFA-Sys/OFA-Image_Caption)\\]\n* Visual Grounding \\[[ModelScope](https://modelscope.cn/#/models/damo/ofa_visual-grounding_refcoco_large_en/summary) | [Spaces](https://huggingface.co/spaces/OFA-Sys/OFA-Visual_Grounding)\\]\n* Visual Question Answering \\[[ModelScope](https://modelscope.cn/#/models/damo/ofa_visual-question-answering_pretrain_large_en/summary) | [Spaces](https://huggingface.co/spaces/OFA-Sys/OFA-Visual_Question_Answering)\\]\n* Text-to-Image Generation \\[[ModelScope](https://modelscope.cn/#/models/damo/ofa_text-to-image-synthesis_coco_large_en/summary) | [Spaces](https://huggingface.co/spaces/OFA-Sys/OFA-Text2Image_Generation)\\]\n* Generic Interface \\[[Spaces](https://huggingface.co/spaces/OFA-Sys/OFA-Generic_Interface)\\]\n* Chinese OCR \\[[ModelScope](https://modelscope.cn/studios/damo/ofa_ocr_pipeline/summary)  |  [Spaces](https://huggingface.co/spaces/OFA-Sys/OFA-OCR)\\]\n\n\nAlso we provide Colab notebooks for you to better perceive the procedures. Click [here](colab.md) to check them out!\n\u003cbr\u003e\u003c/br\u003e\n\n# Use in Hugging Face Transformers\nWe support the inference of OFA in Hugging Face Transformers. Check the [README](transformers.md) and [Colab Notebook](https://colab.research.google.com/drive/1Ho81RBV8jysZ7e0FhsSCk_v938QeDuy3?usp=sharing) for more information. Codes are released in this branch https://github.com/OFA-Sys/OFA/tree/feature/add_transformers\n\u003cbr\u003e\u003cbr\u003e\n\n\n# News\n* 2023.5.11: Two papers ([OFA-OCR](https://arxiv.org/abs/2212.09297) and [OFA-prompt](https://arxiv.org/abs/2208.02532)) are accepted by ACL. The evaluation scripts and checkpoints of OFA-OCR are released.\n* 2023.1.11: Released MuE (https://arxiv.org/abs/2211.11152), which significantly accelerates OFA with little performance degradation. Many thanks to the first author, Shengkun Tang (@Tangshengku). See the branch `feature/MuE` and [PR](https://github.com/OFA-Sys/OFA/pull/336) for more information.\n* 2022.12.20: Released OFA-OCR, a model for Chinese text recognition based on OFA. Check our [paper](https://arxiv.org/abs/2212.09297) and [demo](https://modelscope.cn/studios/damo/ofa_ocr_pipeline/summary).\n* 2022.12.7: Released the MMSpeech an ASR pre-training method based on OFA. Check our paper [here](https://arxiv.org/abs/2212.00500)! Please see the [README_mmspeech.md](README_mmspeech.md) for further details.\n* 2022.8.16: Released the **Chinese** version of OFA. **OFA-CN** needs only switching to `bpe_dir=../../utils/BERT_CN_dict` and `bpe=bert` and using our provided Chinese checkpoints in [checkpoints_cn.md](checkpoints_cn.md). Temporarily, we only provide base-size and large-size pretrained checkpoints and finetuned checkpoints on [MUGE Caption](https://tianchi.aliyun.com/muge) and the Chinese version of RefCOCO(-/+/g) (to release soon). \n* 2022.8.5: Released support of **prompt tuning** for OFA. Check our paper [here](https://arxiv.org/abs/2208.02532)! Please see the [prompt_tuning.md](prompt_tuning.md) for further details.\n* 2022.7.7: Updated support of OFA on **Hugging Face transformers** (fixed bugs in forward, add sequence generator from Fairseq to ensure performance, etc.). Refer to the doc [transformers.md](transformers.md) and the branch `feature/add_transformers`. \n* 2022.6.17: Released the pretrained checkpoint of **OFA-Huge**. To use it, set `--arch=ofa_huge` in the script.\n* 2022.5.15: OFA was accepted by **ICML 2022**\n\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003eMore News\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        \u003cul\u003e\n            \u003cli\u003e2022.4.28: Add support of inference on **Hugging Face transformers**. For how to use it, please refer to the doc [transformers.md](transformers.md) and our [Hugging Face models](https://huggingface.co/OFA-Sys).\u003c/li\u003e\n            \u003cli\u003e2022.4.16: Released lightweight pretrained models **OFA-Medium** (~93M params) and **OFA-Tiny** (~33M params) in [checkpoints.md](checkpoints.md). To use them, you just need to load the corresponding checkpoint and set `--arch=ofa_medium` or `--arch=ofa_tiny` in the scripts.\u003c/li\u003e\n            \u003cli\u003e2022.3.23: Added [Encouraging Loss](https://arxiv.org/pdf/2110.06537.pdf) as a feature. See [README_EncouragingLoss.md](README_EncouragingLoss.md). Leveraging this feature, OFA-Large has achieved improved results in both VQA (**test-std acc: 80.67**) and Image Classification (**test acc: 85.6**) recently.\u003c/li\u003e\n            \u003cli\u003e2022.3.21: Released codes for pretraining OFA.\u003c/li\u003e\n            \u003cli\u003e2022.3.18: Released the finetuned \u003cb\u003eOFA-Base\u003c/b\u003e (~180M parameters) checkpoints and running scripts for vision \u0026 language tasks, including: \u003cb\u003eCaption (146.4 CIDEr), VQA (78.07 on test-std), SNLI-VE (89.3 on dev), RefCOCO (90.67 on testA), RefCOCO+ (87.15 on testA) and RefCOCOg (82.31 on test-u)\u003c/b\u003e.\u003c/li\u003e\n            \u003cli\u003e2022.3.11: Released the finetuning \u0026 inference code/checkpoints for \u003cb\u003eGigaword\u003c/b\u003e.\u003c/li\u003e\n            \u003cli\u003e2022.3.08: Released the pretrained checkpoint of \u003cb\u003eOFA-Base\u003c/b\u003e in \u003ca href=\"https://github.com/OFA-Sys/OFA/blob/main/checkpoints.md\"\u003echeckpoints.md\u003c/a\u003e. To use OFA-Base, you just need to load \u003ccode\u003eofa_base.pt\u003c/code\u003e and change \u003ccode\u003e--arch=ofa_large\u003c/code\u003e to \u003ccode\u003e--arch=ofa_base\u003c/code\u003e in the training scripts.\u003c/li\u003e\n            \u003cli\u003e2022.3.07: Released the finetuning \u0026 inference code/checkpoints for \u003cb\u003eImage Classification\u003c/b\u003e, which achieves \u003cb\u003e85.0\u003c/b\u003e accuracy on ImageNet-1K, slightly better than reported in OFA paper.\u003c/li\u003e\n            \u003cli\u003e2022.3.04: Released the finetuning \u0026 inference code/checkpoints for \u003cb\u003eText-to-Image Generation\u003c/b\u003e.\u003c/li\u003e\n            \u003cli\u003e2022.3.03: Released the finetuning \u0026 inference code/checkpoints for \u003cb\u003eSNLI-VE\u003c/b\u003e and \u003cb\u003eGLUE\u003c/b\u003e.\u003c/li\u003e\n            \u003cli\u003e2022.2.22: Released the finetuning \u0026 inference code/checkpoints for \u003cb\u003eVisual Question Answering\u003c/b\u003e, which can reproduce \u003cb\u003ethe reported VQA accuracy in OFA paper (80.02 on test-std)\u003c/b\u003e. Check our results on the \u003ca href=\"https://eval.ai/web/challenges/challenge-page/830/leaderboard/2278\"\u003eVQA Challenge\u003c/a\u003e.\u003c/li\u003e\n            \u003cli\u003e2022.2.15: Released finetuning \u0026 inference code/checkpoints for \u003cb\u003eReferring Expression Comprehension\u003c/b\u003e\u003c/li\u003e\n            \u003cli\u003e2022.2.10: Released the inference code \u0026 finetuned checkpoint for \u003cb\u003eImage captioning\u003c/b\u003e, which can reproduce \u003cb\u003ethe results on COCO Karparthy test split (149.6 CIDEr)\u003c/b\u003e. OFA also achieves No.1 on the COCO image captioning online leaderboard \u003ca href='https://competitions.codalab.org/competitions/3221#results'\u003eLink\u003c/a\u003e (marked as M6-Team).\u003c/li\u003e\n        \u003c/ul\u003e\n    \u003c/p\u003e\n\u003c/details\u003e\n\u003cbr\u003e\u003c/br\u003e\n\n\n# Model Card\nWe list the parameters and pretrained checkpoints of OFAs below. For finetuned checkpoints, please refer to [checkpoints.md](checkpoints.md). \n\n\u003ctable border=\"1\" width=\"100%\"\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003cth\u003eModel\u003c/th\u003e\u003cth\u003eCkpt\u003c/th\u003e\u003cth\u003eParams\u003c/th\u003e\u003cth\u003eBackbone\u003c/th\u003e\u003cth\u003eHidden size\u003c/th\u003e\u003cth\u003eIntermediate size\u003c/th\u003e\u003cth\u003eNum. of heads\u003c/th\u003e\u003cth\u003eEnc layers\u003c/th\u003e\u003cth\u003eDec layers\u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eOFA\u003csub\u003eTiny\u003c/sub\u003e\u003c/td\u003e\u003ctd\u003e\u003ca href=\"https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/ofa_tiny.pt\"\u003eDownload\u003c/a\u003e\u003c/td\u003e\u003ctd\u003e33M\u003c/td\u003e\u003ctd\u003eResNet50\u003c/td\u003e\u003ctd\u003e256\u003c/td\u003e\u003ctd\u003e1024\u003c/td\u003e\u003ctd\u003e4\u003c/td\u003e\u003ctd\u003e4\u003c/td\u003e\u003ctd\u003e4\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eOFA\u003csub\u003eMedium\u003c/sub\u003e\u003c/td\u003e\u003ctd\u003e\u003ca href=\"https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/ofa_medium.pt\"\u003eDownload\u003c/a\u003e\u003c/td\u003e\u003ctd\u003e93M\u003c/td\u003e\u003ctd\u003eResNet101\u003c/td\u003e\u003ctd\u003e512\u003c/td\u003e\u003c/td\u003e\u003ctd\u003e2048\u003c/td\u003e\u003ctd\u003e8\u003c/td\u003e\u003ctd\u003e4\u003c/td\u003e\u003ctd\u003e4\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eOFA\u003csub\u003eBase\u003c/sub\u003e\u003c/td\u003e\u003ctd\u003e\u003ca href=\"https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/ofa_base.pt\"\u003eDownload\u003c/a\u003e\u003c/td\u003e\u003ctd\u003e180M\u003c/td\u003e\u003ctd\u003eResNet101\u003c/td\u003e\u003ctd\u003e768\u003c/td\u003e\u003c/td\u003e\u003ctd\u003e3072\u003c/td\u003e\u003ctd\u003e12\u003c/td\u003e\u003ctd\u003e6\u003c/td\u003e\u003ctd\u003e6\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eOFA\u003csub\u003eLarge\u003c/sub\u003e\u003c/td\u003e\u003ctd\u003e\u003ca href=\"https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/ofa_large.pt\"\u003eDownload\u003c/a\u003e\u003c/td\u003e\u003ctd\u003e470M\u003c/td\u003e\u003ctd\u003eResNet152\u003c/td\u003e\u003ctd\u003e1024\u003c/td\u003e\u003c/td\u003e\u003ctd\u003e4096\u003c/td\u003e\u003ctd\u003e16\u003c/td\u003e\u003ctd\u003e12\u003c/td\u003e\u003ctd\u003e12\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eOFA\u003csub\u003eHuge\u003c/sub\u003e\u003c/td\u003e\u003ctd\u003e\u003ca href=\"https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/ofa_huge.pt\"\u003eDownload\u003c/a\u003e\u003c/td\u003e\u003ctd\u003e930M\u003c/td\u003e\u003ctd\u003eResNet152\u003c/td\u003e\u003ctd\u003e1280\u003c/td\u003e\u003c/td\u003e\u003ctd\u003e5120\u003c/td\u003e\u003ctd\u003e16\u003c/td\u003e\u003ctd\u003e24\u003c/td\u003e\u003ctd\u003e12\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\u003cbr\u003e\u003c/br\u003e\n\n# Results\nBelow we demonstrate the results of OFAs on cross-modal understanding and generation. \n\n\u003ctable border=\"1\" width=\"100%\"\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003cth\u003eTask\u003c/th\u003e\u003cth\u003eImage Captioning\u003c/th\u003e\u003cth\u003eVQA\u003c/th\u003e\u003cth\u003eVisual Entailment\u003c/th\u003e\u003cth colspan=\"3\"\u003eReferring Expression Comprehension\u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eDataset\u003c/td\u003e\u003ctd\u003eCOCO\u003c/td\u003e\u003ctd\u003eVQA v2\u003c/td\u003e\u003ctd\u003eSNLI-VE\u003c/td\u003e\u003ctd\u003eRefCOCO\u003c/td\u003e\u003ctd\u003eRefCOCO+\u003c/td\u003e\u003ctd\u003eRefCOCOg\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eSplit\u003c/td\u003e\u003ctd\u003eKarpathy test (CE/CIDEr)\u003c/td\u003e\u003ctd\u003etest-dev/test-std\u003c/td\u003e\u003ctd\u003eval/test\u003c/td\u003e\u003ctd\u003eval/test-a/test-b\u003c/td\u003e\u003ctd\u003eval/test-a/test-b\u003c/td\u003e\u003ctd\u003eval-u/test-u\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eMetric\u003c/td\u003e\u003ctd\u003eCIDEr\u003c/td\u003e\u003ctd\u003eAcc.\u003c/td\u003e\u003ctd\u003eAcc.\u003c/td\u003e\u003ctd colspan=\"3\"\u003eAcc.\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eOFA\u003csub\u003eTiny\u003c/sub\u003e\u003c/td\u003e\u003ctd\u003e119.0 / 128.7\u003c/td\u003e\u003ctd\u003e70.3 / 70.4\u003c/td\u003e\u003ctd\u003e85.3 / 85.2\u003c/td\u003e\u003ctd\u003e80.20 / 84.07 / 75.00\u003c/td\u003e\u003ctd\u003e68.22 / 75.13 / 57.66\u003c/td\u003e\u003ctd\u003e72.02 / 69.74\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eOFA\u003csub\u003eMedium\u003c/sub\u003e\u003c/td\u003e\u003ctd\u003e130.4 / 140.3\u003c/td\u003e\u003ctd\u003e75.4 / 75.5\u003c/td\u003e\u003ctd\u003e86.6 / 87.0\u003c/td\u003e\u003ctd\u003e85.34 / 87.68 / 77.92\u003c/td\u003e\u003ctd\u003e76.09 / 83.04 / 66.25\u003c/td\u003e\u003ctd\u003e78.76 / 78.58\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eOFA\u003csub\u003eBase\u003c/sub\u003e\u003c/td\u003e\u003ctd\u003e138.2 / 146.7\u003c/td\u003e\u003ctd\u003e78.0 / 78.1\u003c/td\u003e\u003ctd\u003e89.3 / 89.2\u003c/td\u003e\u003ctd\u003e88.48 / 90.67 / 83.30\u003c/td\u003e\u003ctd\u003e81.39 / 87.15 / 74.29\u003c/td\u003e\u003ctd\u003e82.29 / 82.31\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eOFA\u003csub\u003eLarge\u003c/sub\u003e\u003c/td\u003e\u003ctd\u003e142.2 / 150.7\u003c/td\u003e\u003ctd\u003e80.4 / 80.7\u003c/td\u003e\u003ctd\u003e90.3 / 90.2\u003c/td\u003e\u003ctd\u003e90.05 / 92.93 / 85.26\u003c/td\u003e\u003ctd\u003e85.80 / 89.87 / 79.22\u003c/td\u003e\u003ctd\u003e85.89 / 86.55\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eOFA\u003csub\u003eHuge\u003c/sub\u003e\u003c/td\u003e\u003ctd\u003e145.3 / 154.9\u003c/td\u003e\u003ctd\u003e82.0 / 82.0\u003c/td\u003e\u003ctd\u003e91.0 / 91.2\u003c/td\u003e\u003ctd\u003e92.04 / 94.03 / 88.44\u003c/td\u003e\u003ctd\u003e87.86 / 91.70 / 80.71\u003c/td\u003e\u003ctd\u003e88.07 / 88.78\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\u003cbr\u003e\u003c/br\u003e\n\n# Requirements\n* python 3.7.4\n* pytorch 1.8.1\n* torchvision 0.9.1\n* JAVA 1.8 (for COCO evaluation)\n\u003cbr\u003e\u003c/br\u003e\n\n# Installation\n```bash\ngit clone https://github.com/OFA-Sys/OFA\npip install -r requirements.txt\n```\n\u003cbr\u003e\u003c/br\u003e\n\n# Datasets and Checkpoints\nSee [datasets.md](datasets.md) and [checkpoints.md](checkpoints.md).\n\u003cbr\u003e\u003c/br\u003e\n\n# Training \u0026 Inference\nBelow we provide methods for training and inference on different tasks. We provide both pretrained OFA-Large and OFA-Base in [checkpoints.md](checkpoints.md). The scripts mentioned in this section are prepared for OFA-Large. For reproducing the downstreaming results of OFA-Base, we have also provided the corresponding finetuning and inference scripts for OFA-Base in the `run_scripts/` folder.\n\nWe recommend that your workspace directory should be organized like this: \n```\nOFA/\n├── checkpoints/\n│   ├── ofa_base.pt\n│   ├── ofa_large.pt\n│   ├── caption_large_best_clean.pt\n│   └── ...\n├── criterions/\n├── data/\n├── dataset/\n│   ├── caption_data/\n│   ├── gigaword_data/\n│   └── ...\n├── fairseq/\n├── models/\n├── run_scripts/\n├── tasks/\n├── train.py\n├── trainer.py\n└── utils/\n```\n\n\n## Image Processing\nTo ensure the efficiency of processing data, we did not store images with small files, but instead we encode them to base64 strings.\nTransforming image files to base64 strings is simple. Run the following code:\n```python\nfrom PIL import Image\nfrom io import BytesIO\nimport base64\n\nimg = Image.open(file_name) # path to file\nimg_buffer = BytesIO()\nimg.save(img_buffer, format=img.format)\nbyte_data = img_buffer.getvalue()\nbase64_str = base64.b64encode(byte_data) # bytes\nbase64_str = base64_str.decode(\"utf-8\") # str\n```\n\n## Pretraining\nBelow we provide methods for pretraining OFA.\n\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e1. Prepare the Dataset\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        To pretrain OFA, you should first download the dataset we provide (\u003ca href=\"https://ofa-beijing.oss-cn-beijing.aliyuncs.com/datasets/pretrain_data/pretrain_data_examples.zip\"\u003epretrain_data_examples.zip\u003c/a\u003e, a small subset of the original pretraining data). For your customed pretraining datasets, please prepare your training samples into the same format. \u003ccode\u003epretrain_data_examples.zip\u003c/code\u003e contains 4 TSV files: \u003ccode\u003evision_language_examples.tsv\u003c/code\u003e, \u003ccode\u003etext_examples.tsv\u003c/code\u003e, \u003ccode\u003eimage_examples.tsv\u003c/code\u003e and \u003ccode\u003edetection_examples.tsv\u003c/code\u003e. Details of these files are as follows: \n        \u003cbr /\u003e\n        \u003cul type=\"circle\"\u003e\n            \u003cli\u003e\u003cb\u003evision_language_examples.tsv\u003c/b\u003e:\n    Each line contains uniq-id, image (base64 string), caption, question, answer, ground-truth objects (objects appearing in the caption or question), dataset name (source of the data) and task type (caption, qa or visual gronunding). Prepared for the pretraining tasks of visual grounding, grounded captioning, image-text matching, image captioning and visual question answering. \u003c/li\u003e\n            \u003cli\u003e\u003cb\u003etext_examples.tsv\u003c/b\u003e: Each line contains uniq-id and text. Prepared for the pretraining task of text infilling. \u003c/li\u003e \n            \u003cli\u003e\u003cb\u003eimage_examples.tsv\u003c/b\u003e: Each line contains uniq-id, image (base64 string, should be resized to 256*256 resolution) and image-code (generate the sparse codes for the central part of image through VQ-GAN). Prepared for the pretraining task of image infilling. \u003c/li\u003e\n            \u003cli\u003e\u003cb\u003edetection_examples.tsv\u003c/b\u003e: Each line contains uniq-id, image (base64 string) and bounding box annotations (contains the top-left and bottom-right coordinates of the bounding box, object_id and object_name, seperated by commas). Prepared for the pretraining task of detection. \u003c/li\u003e\n        \u003c/ul\u003e\n        In addition, the folder negative_sample in pretrain_data_examples.zip contains three files \u003ccode\u003eall_captions.txt\u003c/code\u003e, \u003ccode\u003eobject.txt\u003c/code\u003e and \u003ccode\u003etype2ans.json\u003c/code\u003e. The data in these files are used as negative samples for the image-text matching (ITM) task.\n    \u003c/p\u003e\n\u003c/details\u003e\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e2. Pretraining\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        By default, the pretraining script will attempt to restore the released pretrained checkpoints of OFA-Base or OFA-Large and perform continuous pretraining. Continuous pretraining is more recommended, which achieves much better results compared with pretraining from scratch. For continuous pretraining, please download the pretrained weights in advance (see \u003ca href='checkpoints.md'\u003echeckpoints.md\u003c/a\u003e) and put them in the correct directory \u003ccode\u003eOFA/checkpoints/\u003c/code\u003e. If not, the pretraining will begin from scratch.\n    \u003c/p\u003e\n\u003cpre\u003e\ncd run_scripts/pretraining\nbash pretrain_ofa_large.sh # Pretrain OFA-Large. For OFA-Base, use pretrain_ofa_base.sh\n\u003c/pre\u003e\n    \u003cp\u003e\n        If the pretrained OFA checkpoint is restored successfully, you will see the following information in the log:\n    \u003c/p\u003e\n\u003cpre\u003e\nINFO: Loaded checkpoint ../../checkpoints/ofa_large.pt\n\u003c/pre\u003e\n\u003c/details\u003e\n\n## Image Captioning\nWe provide procedures to reproduce our results of image captioning on our paper below.\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e1. Prepare the Dataset \u0026 Checkpoints\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        Download data (see \u003ca href='datasets.md'\u003edatasets.md\u003c/a\u003e) and models (see \u003ca href='checkpoints.md'\u003echeckpoints.md\u003c/a\u003e) and put them in the correct directory. The dataset zipfile \u003ccode\u003ecaption_data.zip\u003c/code\u003e contains caption_stage1_train.tsv, caption_stage2_train.tsv, caption_val.tsv and caption_test.tsv. Each image corresponds to only 1 caption in \u003ccode\u003ecaption_stage1_train.tsv\u003c/code\u003e and corresponds to multiple captions in other TSV files (about 5 captions per image). Each line of the dataset represents a caption sample with the following format. The information of uniq-id, image-id, caption, predicted object labels (taken from \u003ca href='https://github.com/pzzhang/VinVL'\u003eVinVL\u003c/a\u003e, not used), image base64 string are separated by tabs.\n    \u003c/p\u003e\n\u003cpre\u003e\n162365  12455   the sun sets over the trees beyond some docks.  sky\u0026\u0026water\u0026\u0026dock\u0026\u0026pole  /9j/4AAQSkZJ....UCP/2Q==\n\u003c/pre\u003e\n\u003c/details\u003e\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e2. Finetuning\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        Following previous standard practice, we divide the finetuning process of image captioning into two stages. In stage 1, we finetune OFA with cross-entropy loss on 4 NVIDIA-V100 GPUs with 32GB memory (expected to obtain ~139.5 CIDEr on the validation set at this stage). In stage 2, we select the best checkpoint of stage 1 and train with CIDEr optimization on 8 NVIDIA-V100 GPUs. \u003cb\u003eNote that CIDEr optimization is very unstable and requires careful hyperparameter tuning. If you encounter training errors in the stage2 finetuning, you can increase the batch size or reduce the learning rate. If neither of these works, you can directly set \u003c/b\u003e\u003ccode\u003e--freeze-resnet\u003c/code\u003e\u003cb\u003e to freeze the inner states of batch normalization.\u003c/b\u003e\n    \u003c/p\u003e\n\u003cpre\u003e\ncd run_scripts/caption\nnohup sh train_caption_stage1.sh \u003e train_stage1.out \u0026  # stage 1, train with cross-entropy loss\nnohup sh train_caption_stage2.sh \u003e train_stage2.out \u0026  # stage 2, load the best ckpt of stage1 and train with CIDEr optimization \n\u003c/pre\u003e\n\u003c/details\u003e\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e3. Inference\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        Run the following commands to get your results and evaluate your model.\n    \u003c/p\u003e\n\u003cpre\u003e\ncd run_scripts/caption ; sh evaluate_caption.sh  # inference \u0026 evaluate\n\u003c/pre\u003e\n\u003c/details\u003e\n\n## Text-to-Image Generation \nThis part provides procedures for the finetuning and inference of text-to-image generation. See below.\n\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e1. Prepare the Dataset \u0026 Checkpoints\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        Download data (see \u003ca href=\"datasets.md\"\u003edatasets.md\u003c/a\u003e) and models (see \u003ca href=\"checkpoints.md\"\u003echeckpoints.md\u003c/a\u003e) and put them in the correct directory. The dataset zipfile \u003ccode\u003ecoco_image_gen.zip\u003c/code\u003e contains \u003ccode\u003ecoco_vqgan_train.tsv\u003c/code\u003e, \u003ccode\u003ecoco_vqgan_dev.tsv\u003c/code\u003e and \u003ccode\u003ecoco_vqgan_full_test.tsv\u003c/code\u003e. Each line of the dataset represents a sample with the following format. The information of uniq-id, image-code (produced by \u003ca href=\"https://github.com/CompVis/taming-transformers\"\u003evqgan\u003c/a\u003e, a list of integers separated by single-whitespaces), lowercased caption are separated by tabs.\n    \u003c/p\u003e\n\u003cpre\u003e\n1\t6674 4336 4532 5334 3251 5461 3615 2469 ...4965 4190 1846\tthe people are posing for a group photo.\n\u003c/pre\u003e\n    \u003cp\u003e\n        The checkpoint zipfile \u003ccode\u003eimage_gen_large_best.zip\u003c/code\u003e contains \u003ccode\u003eimage_gen_large_best.pt\u003c/code\u003e, \u003ccode\u003evqgan/last.ckpt\u003c/code\u003e, \u003ccode\u003evqgan/model.yaml\u003c/code\u003e and \u003ccode\u003eclip/Vit-B-16.pt\u003c/code\u003e. \n    \u003c/p\u003e\n\u003c/details\u003e\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e2. Shuffle the Training Data\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        (Optional, but achieves better result): If the disk storage is sufficient, we recommend to prepare the shuffled training data for each epoch in advance. \n    \u003c/p\u003e\n\u003cpre\u003e\ncd dataset/image_gen\nln coco_vqgan_train.tsv coco_vqgan_train_1.tsv\nfor idx in `seq 1 9`;do shuf coco_vqgan_train_${idx}.tsv \u003e coco_vqgan_train_$[${idx}+1].tsv;done # each file is used for an epoch\n\u003c/pre\u003e\n\u003c/details\u003e\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e3. Finetuning\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        Following previous practice, we divide the finetuning process of image generating into two stages. In stage 1, we finetune OFA with cross-entropy loss on 4 8-V100-32G-GPU servers (expected to obtain ~32.5+ CLIP Score on the validation set at this stage). In stage 2, we select the last checkpoint of stage 1 and train with CLIP Score optimization on 4 8-V100-32G-GPU servers (expected to obtain ~34.0+ CLIP Score on the validation set at this stage). During the validation, the generated image will be dumped into \u003ccode\u003e_GEN_IMAGE_PATH_\u003c/code\u003e. \n    \u003c/p\u003e\n\u003cpre\u003e\n# run on each worker after the distributed and data configs have been correctly set following the guide in train_image_gen_stage1_distributed.sh \ncd run_scripts/image_gen\nnohup sh train_image_gen_stage1_distributed.sh # stage 1, train with cross-entropy loss\nnohup sh train_image_gen_stage2_distributed.sh # stage 2, load the last ckpt of stage1 and train with CLIP Score optimization \n\u003c/pre\u003e\n\u003c/details\u003e\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e4. Inference\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        Run the command below to generate your images. \n    \u003c/p\u003e\n\u003cpre\u003e\ncd run_scripts/image_gen ; sh evaluate_image_gen.sh  # inference \u0026 evaluate (FID, IS and CLIP Score)\n\u003c/pre\u003e\n\u003c/details\u003e\n\n## Visual Question Answering\nHere we provide the finetuning and inference codes to reproduce the VQAv2 result reported in our paper (**test-std 80.02**). We believe much improvement on accuracy can still be achieved based on this codebase :)\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e1. Prepare the Dataset \u0026 Checkpoints\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        Download data (see \u003ca href=\"datasets.md\"\u003edatasets.md\u003c/a\u003e) and models (see \u003ca href=\"checkpoints.md\"\u003echeckpoints.md\u003c/a\u003e) and put them in the correct directory. The dataset zipfile \u003ccode\u003evqa_data.zip\u003c/code\u003e is around 100G and the decompressed data costs around 135G disk storage, which contains the training, validation and testing samples together with other necessary data resources. (Since \u003ccode\u003evqa_data.zip\u003c/code\u003e is large in size, we have also provided chunked parts of the dataset files for more convenient and stable downloading. Please refer to \u003ca href=\"https://github.com/OFA-Sys/OFA/issues/68#issuecomment-1096837349\"\u003eissue #68\u003c/a\u003e.) Following common practice, VG-QA samples are also included in the training data. To adapt to the seq2seq paradigm of OFA, we transform original VQA training questions with multiple golden answers into multiple training samples. For the original VQA validation set, we keep around 10k samples for our validation and utilize the other samples for training. Each line of the dataset represents a VQA sample with the following format. The information of question-id, image-id, question, answer (with confidence), predicted object labels (taken from \u003ca href=\"https://github.com/pzzhang/VinVL\"\u003eVinVL\u003c/a\u003e, slightly brings around +0.1 accuracy improvement), image base64 string are separated by tabs. \n    \u003c/p\u003e\n\u003cpre\u003e\n79459   79459   is this person wearing shorts?  0.6|!+no    house\u0026\u0026short\u0026\u0026...\u0026\u0026sky  /9j/4AAQS...tigZ/9k=\n\u003c/pre\u003e\n    \u003cp\u003e\n        For fine-tuning on customed VQA-formulated tasks, please refer to issue \u003ca href=\"https://github.com/OFA-Sys/OFA/issues/76\"\u003e#76\u003c/a\u003e, \u003ca href=\"https://github.com/OFA-Sys/OFA/issues/105\"\u003e#105\u003c/a\u003e and \u003ca href=\"https://github.com/OFA-Sys/OFA/issues/73\"\u003e#73\u003c/a\u003e for more information.\n    \u003c/p\u003e\n\u003c/details\u003e\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e2. Shuffle the Training Data\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        (Optional, but achieves better finetuning accuracy): If the disk storage is sufficient, we recommend to prepare the shuffled training data for each epoch in advance. In our experiments, we use shuffling which brings around \u003cb\u003e+0.3\u003c/b\u003e improvement on VQA accuracy.\n    \u003c/p\u003e\n\u003cpre\u003e\ncd dataset/vqa_data\nln vqa_train.tsv vqa_train_1.tsv\nfor idx in `seq 1 9`;do shuf vqa_train_${idx}.tsv \u003e vqa_train_$[${idx}+1].tsv;done # each file is used for an epoch\n\u003c/pre\u003e\n\u003c/details\u003e\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e3. Finetuning\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        In our experiments, the VQA finetuning is performed on 4 8-A100-GPU servers (\u003ci\u003ewith RDMA\u003c/i\u003e). Here provides the finetuning script \u003ccode\u003etrain_vqa_distributed.sh\u003c/code\u003e, which supports multi-server distributed training (as well as single-server training). Please refer to the comments in the beginning of the script and set the configs correctly according to your distribution environment. If you have shuffled the training data in the previous step, please correctly specify the training data path following the guide in the script comments. \u003cb\u003eThe command should be run on each worker.\u003c/b\u003e \n    \u003c/p\u003e\n\u003cpre\u003e\n# run on each worker after the distributed and data configs have been correctly set following the guide in train_vqa_distributed.sh \ncd run_scripts/vqa\nbash train_vqa_distributed.sh \n\u003c/pre\u003e\n    \u003cp\u003e\n        In our experiments, the finetuning costs around 36 hours (for 12 epochs). After each epoch, an evaluation on validation set is performed. The best validation accuracy during finetuning will be around 80.8. The log is saved in \u003ccode\u003e${log_dir}\u003c/code\u003e.\n    \u003c/p\u003e\n    \u003cp\u003e\n        \u003ci\u003e(Update on validation time-cost)\u003c/i\u003e As will be mentioned in the \u003ci\u003e4. Inference\u003c/i\u003e section, we prepare 2 types of inference: beam-search and all-candidate inference. By default, all-candidate inference is used for validation during fine-tuning, which achieves better accuracy but costs much time. Now we have added a new option in the training scripts called \u003ccode\u003e--val-inference-type\u003c/code\u003e to switch the validation inference type during fine-tuning. If you feel the validation takes too long, you can refer to \u003ca href=\"https://github.com/OFA-Sys/OFA/pull/79\"\u003ePR #79\u003c/a\u003e to activate beam-search validation, which significantly takes much less time, with around 0.5-0.6 validation score degradation compared with all-candidate validation.\n    \u003c/p\u003e\n\u003c/details\u003e\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e4. Inference\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        We provide 2 types of inference, \u003cb\u003ebeam-search\u003c/b\u003e (much faster but gets sub-optimal accuracy) and \u003cb\u003eall-candidate evaluation\u003c/b\u003e (slower but best accuracy). \u003cbr\u003e\u003c/br\u003e\n        For beam-search inference, use the script \u003ccode\u003eevaluate_vqa_beam.sh\u003c/code\u003e. Refer to the command below. The inference on test set costs around 16 GPU hours. After inference on test set, the result JSON file will be dumped in the \u003ccode\u003e${result_path}\u003c/code\u003e defined in the shell script. You can submit the result \u003ccode\u003etest_predict.json\u003c/code\u003e to \u003ca href=\"https://eval.ai/web/challenges/challenge-page/830/overview\"\u003eEvalAI\u003c/a\u003e. Using our released finetuned checkpoint, beam-search inference will get 80.15 validation accuracy, 79.36 test-dev accuracy and 79.48 test-std accuracy (around 0.6 lower than all-candidate evaluation).\n    \u003c/p\u003e\n\u003cpre\u003e\ncd run_scripts/vqa\nbash evaluate_vqa_beam.sh val # specify 'val' or 'test'\n\u003c/pre\u003e\n    \u003cp\u003e\n        For all-candidate evaluation, we recommend to use the distributed script \u003ccode\u003eevaluate_vqa_allcand_distributed.sh\u003c/code\u003e. Please refer to the guide in the script to set the distributed configs before running. The result JSON file will be dumped in the \u003ccode\u003e${result_path}\u003c/code\u003e defined in the shell script of rank-0 server. All-candidate evaluation computes scores on all the candidate answers in the VQA dataset, which achieves \u003cb\u003e80.82\u003c/b\u003e validation accuracy, \u003cb\u003e79.87\u003c/b\u003e test-dev accuracy and \u003cb\u003e80.02\u003c/b\u003e test-std accuracy, reproducing our reported results in the paper. However, the inference on test set costs around 1k GPU hours, which is much slower.\n    \u003c/p\u003e\n\u003cpre\u003e\n# run on each worker after the distributed configs have been correctly set following the guide in evaluate_vqa_allcand_distributed.sh\ncd run_scripts/vqa\nbash evaluate_vqa_allcand_distributed.sh val # specify 'val' or 'test'\n\u003c/pre\u003e\n\u003c/details\u003e\n\n## Visual Grounding (Referring Expression Comprehension)\nHere provides procedures for you to prepare data, train, and evaluate your model on visual grounding. \n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e1. Prepare the Dataset \u0026 Checkpoints\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        Download data (see \u003ca href='datasets.md'\u003edatasets.md\u003c/a\u003e) and models (see \u003ca href='checkpoints.md'\u003echeckpoints.md\u003c/a\u003e) and put them in the correct directory. We provide RefCOCO (split by UNC), RefCOCO+ (split by UNC) and RefCOCOg (split by UMD) datasets. See \u003ca href='https://www.tensorflow.org/datasets/catalog/ref_coco'\u003eRefCOCO\u003c/a\u003e and \u003ca href=\"https://github.com/lichengunc/refer\"\u003eRefer\u003c/a\u003e for more details. Note that in the original dataset, each region-coord (or bounding box) may corresponds to multiple descriptive texts. We split these texts into multiple samples so that the region-coord in each sample corresponds to only one text. Each line of the processed dataset represents a sample with the following format. The information of uniq-id, image-id, text, region-coord (separated by commas), image base64 string are separated by tabs.\n    \u003c/p\u003e\n\u003cpre\u003e\n79_1    237367  A woman in a white blouse holding a glass of wine.  230.79,121.75,423.66,463.06 9j/4AAQ...1pAz/9k=\n\u003c/pre\u003e\n\u003c/details\u003e\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e2. Finetuning\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        Unlike the original paper, we finetune OFA with a drop-path rate of 0.2, and found that training with this hyper-parameter achieves better results. We will update the reported results of the paper later.\n    \u003c/p\u003e\n\u003cpre\u003e\ncd run_scripts/refcoco\nnohup sh train_refcoco.sh \u003e train_refcoco.out \u0026  # finetune for refcoco\nnohup sh train_refcocoplus.sh \u003e train_refcocoplus.out \u0026  # finetune for refcoco+\nnohup sh train_refcocog.sh \u003e train_refcocog.out \u0026  # finetune for refcocog\n\u003c/pre\u003e\n\u003c/details\u003e\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e3. Inference\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        Run the following commands for the evaluation. \n    \u003c/p\u003e\n\u003cpre\u003e\ncd run_scripts/refcoco ; sh evaluate_refcoco.sh  # inference \u0026 evaluate for refcoco/refcoco+/refcocog\n\u003c/pre\u003e\n\u003c/details\u003e\n\n## Visual Entailment\nWe provide steps for you to reproduce our results in visual entailment. See the details below.\n\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e1. Prepare the Dataset \u0026 Checkpoints\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        Download data (see \u003ca href=\"datasets.md\"\u003edatasets.md\u003c/a\u003e) and models (see \u003ca href=\"checkpoints.md\"\u003echeckpoints.md\u003c/a\u003e) and put them in the correct directory. Each line of the processed dataset represents a sample with the following format. The information of uniq-id, image-id, image base64 string, hypothesis, caption (or text premise), label are separated by tabs.\n    \u003c/p\u003e\n\u003cpre\u003e\n252244149.jpg#1r1n  252244149   /9j/4AAQ...MD/2Q==   a man in pink and gold is chewing on a wooden toothpick.   a man in pink is chewing a toothpick on the subway.   neutral \n\u003c/pre\u003e\n\u003c/details\u003e\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e2. Finetuning\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        In our experiments, the SNLI-VE finetuning is performed on 8 NVIDIA-V100 GPUs with 32GB memory. In this task, we experimented with only a few sets of hyperparameters. We believe that proper hyperparameter tuning can lead to further accuracy improvement.\n    \u003c/p\u003e\n\u003cpre\u003e\ncd run_scripts/snli_ve\nnohup sh train_snli_ve.sh \u003e train_snli_ve.out \u0026  # finetune for snli_ve\n\u003c/pre\u003e\n\u003c/details\u003e\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e3. Inference\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        Run the following command to obtain the results.\n    \u003c/p\u003e\n\u003cpre\u003e\ncd run_scripts/snli_ve ; sh evaluate_snli_ve.sh dev  # specify 'dev' or 'test'\n\u003c/pre\u003e\n\u003c/details\u003e\n   \n## GLUE\nHere we provide steps for you to finetune and evaluate our model on language understanding tasks. We demonstrate our practice for the GLUE benchmark. \n\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e1. Prepare the Dataset \u0026 Checkpoints\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        Download data (see \u003ca href=\"datasets.md\"\u003edatasets.md\u003c/a\u003e) and models (see \u003ca href=\"checkpoints.md\"\u003echeckpoints.md\u003c/a\u003e) and put them in the correct directory. we provide 7 language understanding datasets from GLUE benchmark, including COLA, MNLI, MRPC, QNLI, QQP, RTE and SST2. More details about these datasets can be found in this \u003ca href=\"https://openreview.net/pdf?id=rJ4km2R5t7\"\u003elink\u003c/a\u003e.\n    \u003c/p\u003e\n\u003c/details\u003e\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e2. Finetuning\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        For each task, we have tried multiple sets of hyperparameters (including learning rate, batch size, training epochs). The results under different sets of hyperparameters can be found in \u003ccode\u003e${log_dir}\u003c/code\u003e.\n    \u003c/p\u003e\n\u003cpre\u003e\ncd run_scripts/glue\nnohup sh train_cola.sh \u003e train_cola.out \u0026  # finetune for cola\nnohup sh train_mnli.sh \u003e train_mnli.out \u0026  # finetune for mnli\nnohup sh train_mrpc.sh \u003e train_mrpc.out \u0026  # finetune for mrpc\nnohup sh train_qnli.sh \u003e train_qnli.out \u0026  # finetune for qnli\nnohup sh train_qqp.sh \u003e train_qqp.out \u0026  # finetune for qqp\nnohup sh train_rte.sh \u003e train_rte.out \u0026  # finetune for rte\nnohup sh train_sst2.sh \u003e train_sst2.out \u0026  # finetune for sst2\n\u003c/pre\u003e\n\u003c/details\u003e\n\n## Image Classification on ImageNet-1K\nWe provide the finetuning and inference codes which reproduce **85.0 ImageNet-1K accuracy**, slightly better than reported in our paper. \n\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e1. Prepare the Dataset \u0026 Checkpoints\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        Download data (see \u003ca href=\"datasets.md\"\u003edatasets.md\u003c/a\u003e) and models (see \u003ca href=\"checkpoints.md\"\u003echeckpoints.md\u003c/a\u003e) and put them in the correct directory. Our provided data is derived from the original \u003ca href=\"http://image-net.org/\"\u003eImageNet-1K\u003c/a\u003e (ILSVRC2012 train \u0026 validation) dataset and shares the same data split with it. To formulate the classification task into seq2seq paradigm, we use the \u003ca href=\"https://github.com/HoldenCaulfieldRye/caffe/blob/master/data/ilsvrc12/synset_words.txt\"\u003esynset words\u003c/a\u003e provided by Caffe as the generation target for each image class. Each line of the processed dataset represents a sample with the following format. The information of image base64 string, classification label (1-indexed, conform to the order in \u003ccode\u003esynset_words.txt\u003c/code\u003e), synset words of the label are separated by tabs.\n    \u003c/p\u003e\n\u003cpre\u003e\n_9j_4AAQS...fzX__Z  769 rugby ball\n\u003c/pre\u003e\n\u003c/details\u003e\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e2. Shuffle the Training Data\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        (Optional, but achieves better finetuning accuracy): If the disk storage is sufficient, we recommend to prepare the shuffled training data for each epoch in advance. In our experiments, we use shuffling which brings around \u003cb\u003e+0.2\u003c/b\u003e improvement on ImageNet-1K accuracy.\n    \u003c/p\u003e\n\u003cpre\u003e\ncd dataset/imagenet_1k_data\nln imagenet_1k_train.tsv imagenet_1k_train_1.tsv\nfor idx in `seq 1 9`;do shuf imagenet_1k_train_${idx}.tsv \u003e imagenet_1k_train_$[${idx}+1].tsv;done # each file is used for an epoch one by one\n\u003c/pre\u003e\n\u003c/details\u003e\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e3. Finetuning\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        In our experiments, the ImageNet-1K finetuning is performed on 2 8-A100-GPU servers (\u003ci\u003ewith RDMA\u003c/i\u003e). Here provides the finetuning script \u003ccode\u003etrain_imagenet_distributed.sh\u003c/code\u003e, which supports multi-server distributed training (as well as single-server training). Please refer to the comments in the beginning of the script and set the configs correctly according to your distribution environment. If you have shuffled the training data in the previous step, please correctly specify the training data path following the guide in the script comments. \u003cb\u003eThe command should be run on each worker.\u003c/b\u003e For quick evaluation during finetuning, by default we sample 20% of the original validation split and report accuracy on this subset after each epoch. The accuracy on the validation subset is generally ±0.1 relative to accuracy on the whole validation split.\n    \u003c/p\u003e\n\u003cpre\u003e\n# run on each worker after the distributed and data configs have been correctly set following the guide in train_imagenet_distributed.sh\ncd run_scripts/image_classify\nbash train_imagenet_distributed.sh\n\u003c/pre\u003e\n    \u003cp\u003e\n        In our experiments, the finetuning costs around 80 hours (for 32 epochs). The best accuracy on validation subset during finetuning will be around 85.0. The log is saved in \u003ccode\u003e${log_dir}\u003c/code\u003e.\n    \u003c/p\u003e\n\u003c/details\u003e\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e4. Inference\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        To get the validation accuracy on the whole ImageNet-1K validation set, run the following command. The evaluation costs around 10 GPU hours. The accuracy will be reported in the stdout (expected to be around \u003cb\u003e85.0\u003c/b\u003e).\n    \u003c/p\u003e\n\u003cpre\u003e\ncd run_scripts/image_classify ; sh evaluate_imagenet.sh  # inference \u0026 evaluate for imagenet-1k\n\u003c/pre\u003e\n\u003c/details\u003e \n\n## Gigaword\nWe provide steps for you to reproduce our results in Gigaword. See the details below.\n\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e1. Prepare the Dataset \u0026 Checkpoints\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        Download data (see \u003ca href=\"datasets.md\"\u003edatasets.md\u003c/a\u003e) and models (see \u003ca href=\"checkpoints.md\"\u003echeckpoints.md\u003c/a\u003e) and put them in the correct directory. The original dataset is taken from \u003ca href=\"https://github.com/microsoft/unilm/\"\u003eUniLM\u003c/a\u003e and we organized the data into the tsv format. Each line of the processed dataset represents a sample with the following format. The information of source and target texts are separated by tabs.\n    \u003c/p\u003e\n\u003cpre\u003e\nfactory orders for manufactured goods rose #.# percent in september...  us september factory orders up #.# percent\n\u003c/pre\u003e\n\u003c/details\u003e\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e2. Finetuning\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        Run the following command to train the model.\n    \u003c/p\u003e\n\u003cpre\u003e\ncd run_scripts/gigaword\nnohup sh train_gigaword.sh \u003e train_gigaword.out \u0026  # finetune for gigaword\n\u003c/pre\u003e\n\u003c/details\u003e\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003e3. Inference\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003e\n        Run the following command to obtain the results (~36.43 rougeL).\n    \u003c/p\u003e\n\u003cpre\u003e\ncd run_scripts/gigaword ; sh evaluate_gigaword.sh  # inference \u0026 evaluate for gigaword\n\u003c/pre\u003e\n\u003c/details\u003e \n\n\u003cbr\u003e\u003c/br\u003e\n\n# Gallery\nBelow we provide examples of OFA in text-to-image generation and open-ended VQA. Also, we demonstrate its performance in unseen task (Grounded QA) as well as unseen domain (Visual Grounding on images from unseen domains). \n\n## Text-to-Image Generation\n\n![case1](examples/case1.png)\n\n\n## Open-Ended VQA\n![open_vqa](examples/open_vqa.png)\n\n## Grounded QA (unseen task)\n![grounded_qa](examples/grounded_qa.png)\n\n## Visual Grounding (unseen domain)\n![vg](examples/viusal_grounding.png)\n\u003cbr\u003e\u003c/br\u003e\n\n# Related Codebase\n* [Fairseq](https://github.com/pytorch/fairseq)\n* [taming-transformers](https://github.com/CompVis/taming-transformers)\n\u003cbr\u003e\u003c/br\u003e\n\n\n# Getting Involved\nFeel free to submit Github issues or pull requests. Welcome to contribute to our project!\n\nTo contact us, never hestitate to send an email to `zheluo.wp@alibaba-inc.com` or `junyang.ljy@alibaba-inc.com`!\n\u003cbr\u003e\u003c/br\u003e\n\n\n# Citation\nPlease cite our papers if you find them helpful :)\n\n```\n@article{wang2022ofa,\n  author    = {Peng Wang and\n               An Yang and\n               Rui Men and\n               Junyang Lin and\n               Shuai Bai and\n               Zhikang Li and\n               Jianxin Ma and\n               Chang Zhou and\n               Jingren Zhou and\n               Hongxia Yang},\n  title     = {OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence\n               Learning Framework},\n  journal   = {CoRR},\n  volume    = {abs/2202.03052},\n  year      = {2022}\n}\n```\n\u003cbr\u003e\u003c/br\u003e\n```\n@article{ofa_ocr,\n  author       = {Junyang Lin and\n                  Xuancheng Ren and\n                  Yichang Zhang and\n                  Gao Liu and\n                  Peng Wang and\n                  An Yang and\n                  Chang Zhou},\n  title        = {Transferring General Multimodal Pretrained Models to Text Recognition},\n  journal      = {CoRR},\n  volume       = {abs/2212.09297},\n  year         = {2022}\n}\n```\n\u003cbr\u003e\u003cbr\u003e\n```\n@article{ofa_prompt,\n  author       = {Hao Yang and\n                  Junyang Lin and\n                  An Yang and\n                  Peng Wang and\n                  Chang Zhou and\n                  Hongxia Yang},\n  title        = {Prompt Tuning for Generative Multimodal Pretrained Models},\n  journal      = {CoRR},\n  volume       = {abs/2208.02532},\n  year         = {2022}\n}\n```\n\u003cbr\u003e\u003cbr\u003e\n```\n@article{mmspeech,\n  title={MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition},\n  author={Zhou, Xiaohuan and Wang, Jiaming and Cui, Zeyu and Zhang, Shiliang and Yan, Zhijie and Zhou, Jingren and Zhou, Chang},\n  journal={arXiv preprint arXiv:2212.00500},\n  year={2022}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fofa-sys%2Fofa","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fofa-sys%2Fofa","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fofa-sys%2Fofa/lists"}