{"id":28102506,"url":"https://github.com/ofa-sys/one-peace","last_synced_at":"2025-05-13T19:55:14.909Z","repository":{"id":167048998,"uuid":"642596800","full_name":"OFA-Sys/ONE-PEACE","owner":"OFA-Sys","description":"A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities","archived":false,"fork":false,"pushed_at":"2024-10-06T04:13:22.000Z","size":31306,"stargazers_count":977,"open_issues_count":9,"forks_count":63,"subscribers_count":14,"default_branch":"main","last_synced_at":"2024-11-29T15:50:51.633Z","etag":null,"topics":["audio-language","contrastive-loss","foundation-models","multimodal","representation-learning","vision-and-language","vision-language","vision-transformer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OFA-Sys.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-18T23:53:24.000Z","updated_at":"2024-11-29T05:22:10.000Z","dependencies_parsed_at":"2023-10-16T19:42:41.846Z","dependency_job_id":"7609feca-2b4d-4515-91ba-5464754a18af","html_url":"https://github.com/OFA-Sys/ONE-PEACE","commit_stats":null,"previous_names":["ofa-sys/one-peace"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OFA-Sys%2FONE-PEACE","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OFA-Sys%2FONE-PEACE/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OFA-Sys%2FONE-PEACE/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OFA-Sys%2FONE-PEACE/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OFA-Sys","download_url":"https://codeload.github.com/OFA-Sys/ONE-PEACE/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254019082,"owners_count":22000594,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio-language","contrastive-loss","foundation-models","multimodal","representation-learning","vision-and-language","vision-language","vision-transformer"],"created_at":"2025-05-13T19:55:14.228Z","updated_at":"2025-05-13T19:55:14.898Z","avatar_url":"https://github.com/OFA-Sys.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!---\nCopyright 2023 The OFA-Sys Team. \nAll rights reserved.\nThis source code is licensed under the Apache 2.0 license found in the LICENSE file in the root directory.\n--\u003e\n\n\n\u003cp align=\"center\"\u003e\n    \u003cbr\u003e\n    \u003cimg src=\"assets/logo.png\" width=\"350\" /\u003e\n    \u003cbr\u003e\n\u003cp\u003e\n\u003cp align=\"center\"\u003e\n        📖 \u003ca href=\"https://arxiv.org/abs/2305.11172\"\u003ePaper\u003c/a\u003e\u0026nbsp\u0026nbsp ｜ \u0026nbsp🤗 \u003ca href=\"https://huggingface.co/spaces/OFA-Sys/ONE-PEACE_Multimodal_Retrieval\"\u003eDemo\u003c/a\u003e\u0026nbsp\u0026nbsp | \u0026nbsp\u0026nbsp🤖 \u003ca href=\"https://modelscope.cn/models/damo/ONE-PEACE-4B/summary\"\u003eModelScope\u003c/a\u003e\u0026nbsp\u0026nbsp | \u0026nbsp\u0026nbsp\u003ca href=\"checkpoints.md\"\u003eCheckpoints\u003c/a\u003e\u0026nbsp ｜ \u0026nbsp\u003ca href=\"datasets.md\"\u003eDatasets\u003c/a\u003e\n\u003c/p\u003e\n\u003cbr\u003e\n\nONE-PEACE is a general representation model across vision, audio, and language modalities,\nWithout using any vision or language pretrained model for initialization, ONE-PEACE achieves leading results in vision, \naudio, audio-language, and vision-language tasks.\nFurthermore, ONE-PEACE possesses a strong emergent zero-shot retrieval capability, enabling it to align modalities\nthat are not paired in the training data.\n\nBelow shows the architecture and pretraining tasks of ONE-PEACE.\nWith the scaling-friendly architecture and modality-agnostic tasks, ONE-PEACE has the potential to expand to unlimited modalities.\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"assets/one_peace.png\" width=100%\u003e\n\u003c/p\u003e\n\n\u003cbr\u003e\n\n# Online Demo\nWe provide the [online demo](https://huggingface.co/spaces/OFA-Sys/ONE-PEACE_Multimodal_Retrieval) in Huggingface Spaces. In this demo, you can combine multiple modalities to retrieve related images, such as audio-to-image, audio+text-to-image, audio+image-to-image, and even audio+image+text-to-image.\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"assets/demo.png\" width=100%\u003e\n\u003c/p\u003e\n\u003cbr\u003e\n\n# News\n* **2023.7.20:** Released the [visual grounding API](https://github.com/OFA-Sys/ONE-PEACE#visual-grounding), you can use it to locate objects from the picture.\n* **2023.6.23:** Released vision tasks fine-tuning scripts and checkpoints. See [guidance for vision tasks](one_peace_vision/README.md) for more details.\n* **2023.6.04:** Released the pretraining scripts. See [guidance for pretraining](one_peace/README.md/##Pretraining) for more details.\n* **2023.5.30:** Released the finetuned checkpoints and scripts for audio(-language) tasks.\n* **2023.5.29:** Released the finetuned checkpoints for vision-language tasks.\n* **2023.5.27:** 🔥 We have provided the [multimodal retrieval demo](https://huggingface.co/spaces/OFA-Sys/ONE-PEACE_Multimodal_Retrieval) in huggingface spaces. Have Fun!\n* **2023.5.25:** Released the [multimodal embedding API](https://github.com/OFA-Sys/ONE-PEACE#multi-modal-embedding), which enables the quick extraction for image, audio and text representations.\n* **2023.5.23:** Released the [pretrained checkpoint](checkpoints.md), as well as [finetuning \u0026 inference scripts](one_peace/README.md) for vision-language tasks.\n* **2023.5.19:** Released the paper and code. Pretrained \u0026 finetuned checkpoints, training \u0026 inference scripts, as well as demos will be released as soon as possible.\n\u003cbr\u003e\u003c/br\u003e\n\n# Models and Results\n## Model Card\nWe list the parameters and pretrained checkpoints of ONE-PEACE below. Note that ONE-PEACE can be disassembled into different branches to handle different tasks.\nWe also provide the vision-branch of ONE-PEACE, which can be used to perform vision tasks.\n\n\u003ctable border=\"1\" width=\"100%\"\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003cth\u003eModel\u003c/th\u003e\u003cth\u003eCkpt\u003c/th\u003e\u003cth\u003eParams\u003c/th\u003e\u003cth\u003eHidden size\u003c/th\u003e\u003cth\u003eIntermediate size\u003c/th\u003e\u003cth\u003eAttention heads\u003c/th\u003e\u003cth\u003eLayers\u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eONE-PEACE\u003c/td\u003e\u003ctd\u003e\u003ca href=\"http://one-peace-shanghai.oss-accelerate.aliyuncs.com/one-peace.pt\"\u003eDownload\u003c/a\u003e\u003c/td\u003e\u003ctd\u003e4B\u003c/td\u003e\u003ctd\u003e1536\u003c/td\u003e\u003ctd\u003e6144\u003c/td\u003e\u003ctd\u003e24\u003c/td\u003e\u003ctd\u003e40\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eONE-PEACE\u003cbr\u003e(Vision Branch)\u003c/td\u003e\u003ctd\u003e\u003ca href=\"https://one-peace-shanghai.oss-accelerate.aliyuncs.com/one_peace_checkpoints/one-peace-vision.pkl\"\u003eDownload\u003c/a\u003e\u003c/td\u003e\u003ctd\u003e1.5B\u003c/td\u003e\u003ctd\u003e1536\u003c/td\u003e\u003ctd\u003e6144\u003c/td\u003e\u003ctd\u003e24\u003c/td\u003e\u003ctd\u003e40\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\u003cbr\u003e\n\n## Results\n### Vision Tasks\n\u003ctable border=\"1\" width=\"100%\"\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003cth\u003eTask\u003c/th\u003e\u003cth\u003eImage classification\u003c/th\u003e\u003cth\u003eSemantic Segmentation\u003c/th\u003e\u003cth\u003eObject Detection (w/o Object365)\u003c/th\u003e\u003cth\u003eVideo Action Recognition\u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eDataset\u003c/td\u003e\u003ctd\u003eImagenet-1K\u003c/td\u003e\u003ctd\u003eADE20K\u003c/td\u003e\u003ctd\u003eCOCO\u003c/td\u003e\u003ctd\u003eKinetics 400\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eSplit\u003c/td\u003e\u003ctd\u003eval\u003c/td\u003e\u003ctd\u003eval\u003c/td\u003e\u003ctd\u003eval\u003c/td\u003e\u003ctd\u003eval\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eMetric\u003c/td\u003e\u003ctd\u003eAcc.\u003c/td\u003e\u003ctd\u003emIoU\u003csup\u003ess\u003c/sup\u003e / mIoU\u003csup\u003ems\u003c/sup\u003e\u003c/td\u003e\u003ctd\u003eAP\u003csup\u003ebox\u003c/sup\u003e / AP\u003csup\u003emask\u003c/sup\u003e\u003c/td\u003e\u003ctd\u003eTop-1 Acc. / Top-5 Acc.\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eONE-PEACE\u003c/td\u003e\u003ctd\u003e89.8\u003c/td\u003e\u003ctd\u003e62.0 / 63.0\u003c/td\u003e\u003ctd\u003e60.4 / 52.9\u003c/td\u003e\u003ctd\u003e88.1 / 97.8\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\n### Audio Tasks\n\u003ctable border=\"1\" width=\"100%\"\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003cth\u003eTask\u003c/th\u003e\u003cth colspan=\"4\"\u003eAudio-Text Retrieval\u003c/th\u003e\u003cth colspan=\"3\"\u003eAudio Classification\u003c/th\u003e\u003cth\u003eAudio Question Answering\u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eDataset\u003c/td\u003e\u003ctd colspan=\"2\"\u003eAudioCaps\u003c/td\u003e\u003ctd colspan=\"2\"\u003eClotho\u003c/td\u003e\u003ctd\u003eESC-50\u003c/td\u003e\u003ctd\u003eFSD50K\u003c/td\u003e\u003ctd\u003eVGGSound (Audio-Visual)\u003c/td\u003e\u003ctd\u003eAVQA\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eSplit\u003c/td\u003e\u003ctd colspan=\"2\"\u003etest\u003c/td\u003e\u003ctd colspan=\"2\"\u003eevaluation\u003c/td\u003e\u003ctd\u003efull\u003c/td\u003e\u003ctd\u003eeval\u003c/td\u003e\u003ctd\u003etest\u003c/td\u003e\u003ctd\u003eval\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eMetric\u003c/td\u003e\u003ctd\u003eT2A R@1\u003c/td\u003e\u003ctd\u003eA2T R@1\u003c/td\u003e\u003ctd\u003eT2A R@1\u003c/td\u003e\u003ctd\u003eA2T R@1\u003c/td\u003e\u003ctd\u003eZero-shot Acc.\u003c/td\u003e\u003ctd\u003eMAP\u003c/td\u003e\u003ctd\u003eAcc.\u003c/td\u003e\u003ctd\u003eAcc.\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eONE-PEACE\u003c/td\u003e\u003ctd\u003e42.5\u003c/td\u003e\u003ctd\u003e51.0\u003c/td\u003e\u003ctd\u003e22.4\u003c/td\u003e\u003ctd\u003e27.1\u003c/td\u003e\u003ctd\u003e91.8\u003c/td\u003e\u003ctd\u003e69.7\u003c/td\u003e\u003ctd\u003e68.2\u003c/td\u003e\u003ctd\u003e92.2\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\n### Vision-Language Tasks\n\u003ctable border=\"1\" width=\"100%\"\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003cth\u003eTask\u003c/th\u003e\u003cth colspan=\"4\"\u003eImage-Text Retrieval (w/o ranking)\u003c/th\u003e\u003cth colspan=\"3\"\u003eVisual Grounding\u003c/th\u003e\u003cth\u003eVQA\u003c/th\u003e\u003cth\u003eVisual Reasoning\u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eDataset\u003c/td\u003e\u003ctd colspan=\"2\"\u003eCOCO\u003c/td\u003e\u003ctd colspan=\"2\"\u003eFlickr30K\u003c/td\u003e\u003ctd\u003eRefCOCO\u003c/td\u003e\u003ctd\u003eRefCOCO+\u003c/td\u003e\u003ctd\u003eRefCOCOg\u003c/td\u003e\u003ctd\u003eVQAv2\u003c/td\u003e\u003ctd\u003eNLVR2\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eSplit\u003c/td\u003e\u003ctd colspan=\"2\"\u003etest\u003c/td\u003e\u003ctd colspan=\"2\"\u003etest\u003c/td\u003e\u003ctd\u003eval / testA / testB\u003c/td\u003e\u003ctd\u003eval / testA / testB\u003c/td\u003e\u003ctd\u003eval-u / test-u\u003c/td\u003e\u003ctd\u003etest-dev / test-std\u003c/td\u003e\u003ctd\u003edev / test-P\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eMetric\u003c/td\u003e\u003ctd\u003eI2T R@1\u003c/td\u003e\u003ctd\u003eT2I R@1\u003c/td\u003e\u003ctd\u003eI2T R@1\u003c/td\u003e\u003ctd\u003eT2I R@1\u003c/td\u003e\u003ctd colspan=\"3\"\u003eAcc@0.5\u003c/td\u003e\u003ctd\u003eAcc.\u003c/td\u003e\u003ctd\u003eAcc.\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr align=\"center\"\u003e\n        \u003ctd\u003eONE-PEACE\u003c/td\u003e\u003ctd\u003e84.1\u003c/td\u003e\u003ctd\u003e65.4\u003c/td\u003e\u003ctd\u003e97.6\u003c/td\u003e\u003ctd\u003e89.6\u003c/td\u003e\u003ctd\u003e92.58 / 94.18 / 89.26\u003c/td\u003e\u003ctd\u003e88.77 / 92.21 / 83.23\u003c/td\u003e\u003ctd\u003e89.22 / 89.27\u003c/td\u003e\u003ctd\u003e82.6 / 82.5\u003c/td\u003e\u003ctd\u003e87.8 / 88.3\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\u003cbr\u003e\u003c/br\u003e\n\n\n# Requirements and Installation\n* 3.6 \u003c= Python \u003c=3.10\n* Pytorch \u003e= 1.10.0 (recommend 1.13.1)\n* CUDA Version \u003e= 10.2 (recommend 11.6)\n* Install required packages:\n```bash\ngit clone https://github.com/OFA-Sys/ONE-PEACE\ncd ONE-PEACE\npip install -r requirements.txt\n```\n* For faster training install [Apex](https://github.com/NVIDIA/apex) library (optional):\n```bash\ngit clone https://github.com/NVIDIA/apex\ncd apex \u0026\u0026 pip install -v --disable-pip-version-check --no-cache-dir --global-option=\"--cpp_ext\" --global-option=\"--cuda_ext\" --global-option=\"--distributed_adam\" --global-option=\"--deprecated_fused_adam\" ./\n```\n* Install [Xformers](https://github.com/facebookresearch/xformers) library to use Memory-efficient attention (optional):\n```bash\nconda install xformers -c xformers\n```\n* Install [FlashAttention](https://github.com/HazyResearch/flash-attention) library to use faster LayerNorm (optional):\n```bash\ngit clone --recursive https://github.com/HazyResearch/flash-attention\ncd flash-attention \u0026\u0026 pip install .\ncd csrc/layer_norm \u0026\u0026 pip install .\n```\n\u003cbr\u003e\n\n# Datasets and Checkpoints\nSee [datasets.md](datasets.md) and [checkpoints.md](checkpoints.md).\n\u003cbr\u003e\u003c/br\u003e\n\n# Usage\n## API\nWe provide a simple code snippet to show how to use the API for ONE-PEACE.\n\n### Multi-modal Embedding\nWe use ONE-PEACE to compute embeddings for text, images, and audio, as well as their similarities:\n```python\nimport torch\nfrom one_peace.models import from_pretrained\n\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n# \"ONE-PEACE\" can also be replaced with ckpt path\nmodel = from_pretrained(\"ONE-PEACE\", device=device, dtype=\"float32\")\n\n# process raw data\nsrc_tokens = model.process_text([\"cow\", \"dog\", \"elephant\"])\nsrc_images = model.process_image([\"assets/dog.JPEG\", \"assets/elephant.JPEG\"])\nsrc_audios, audio_padding_masks = model.process_audio([\"assets/cow.flac\", \"assets/dog.flac\"])\n\nwith torch.no_grad():\n    # extract normalized features\n    text_features = model.extract_text_features(src_tokens)\n    image_features = model.extract_image_features(src_images)\n    audio_features = model.extract_audio_features(src_audios, audio_padding_masks)\n\n    # compute similarity\n    i2t_similarity = image_features @ text_features.T\n    a2t_similarity = audio_features @ text_features.T\n\nprint(\"Image-to-text similarities:\", i2t_similarity)\nprint(\"Audio-to-text similarities:\", a2t_similarity)\n```\n\n### Visual Grounding\nWe use ONE-PEACE to perform visual grounding on anime pictures:\n```python\nimport torch\nimport cv2\nfrom one_peace.models import from_pretrained\n\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\nmodel = from_pretrained(\n\t\"ONE-PEACE_Grounding\",\n    model_type=\"one_peace_classify\",\n    device=device,\n    dtype=\"float32\"\n)\n\n# process raw data\nimage_text_list = [\n    (\"assets/pokemons.jpg\", \"a blue turtle-like pokemon with round head\"),\n    (\"assets/pokemons.jpg\", \"Bulbasaur\"),\n    (\"assets/pokemons.jpg\", \"Charmander\"),\n    (\"assets/pokemons.jpg\", \"Squirtle\"),\n    (\"assets/one_piece.jpeg\", \"Brook\"),\n    (\"assets/one_piece.jpeg\", \"Franky\"),\n    (\"assets/one_piece.jpeg\", \"Monkey D. Luffy\"),\n    (\"assets/one_piece.jpeg\", \"Nami\"),\n    (\"assets/one_piece.jpeg\", \"Nico Robin\"),\n    (\"assets/one_piece.jpeg\", \"Roronoa Zoro\"),\n    (\"assets/one_piece.jpeg\", \"Tony Tony Chopper\"),\n    (\"assets/one_piece.jpeg\", \"Usopp\"),\n    (\"assets/one_piece.jpeg\", \"Vinsmoke Sanji\"),\n]\n(src_images, image_widths, image_heights), src_tokens  = model.process_image_text_pairs(\n    image_text_list, return_image_sizes=True\n)\n\nwith torch.no_grad():\n    # extract features\n    vl_features = model.extract_vl_features(src_images, src_tokens).sigmoid()\n    # extract coords\n    vl_features[:, ::2] *= image_widths.unsqueeze(1)\n    vl_features[:, 1::2] *= image_heights.unsqueeze(1)\n    coords = vl_features.cpu().tolist()\n\n# display results\nfor i, image_text_pair in enumerate(image_text_list):\n    image, text = image_text_pair\n    img = cv2.imread(image)\n    cv2.rectangle(\n        img,\n        (int(coords[i][0]), int(coords[i][1])),\n        (int(coords[i][2]), int(coords[i][3])),\n        (0, 255, 0),\n        3\n    )\n    cv2.imshow(text, img)\n    cv2.waitKey(3500)\n    cv2.destroyAllWindows()\n\n```\n\n### Audio Classification\nWe use ONE-PEACE to perform audio classification:\n\n```python\nimport torch\nimport json\nfrom one_peace.models import from_pretrained\n\nid2label = json.load(open(\"assets/vggsound_id2label.json\"))\n\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\nmodel = from_pretrained(\n  \"ONE-PEACE_VGGSound\",\n    model_type=\"one_peace_classify\",\n    device=device,\n    dtype=\"float32\"\n)\n\n# process audio\naudio_list = [\"assets/cow.flac\", \"assets/dog.flac\"]\nsrc_audios, audio_padding_masks = model.process_audio(audio_list)\n\nwith torch.no_grad():\n    # extract audio features\n    audio_logits = model.extract_audio_features(src_audios, audio_padding_masks)\n    print(audio_logits.size())\n    predict_label_ids = audio_logits.argmax(1).cpu().tolist()\n\nfor audio, predict_label_id in zip(audio_list, predict_label_ids):\n    predict_label = id2label[str(predict_label_id)]\n    print('audio: {}, predict label: {}'.format(audio, predict_label))\n\n```\n\n\n## Training \u0026 Inference\nIf you are not satisfied with only using the API, we offer comprehensive training and inference instructions for [audio \u0026 multimodal](one_peace/README.md) and [vision](one_peace_vision/README.md) tasks.\n\n\u003cbr\u003e\u003c/br\u003e\n\n# Gallery\n\n## Visual Grounding (unseen domain)\n![grounding](assets/grounding.png)\n\n## Emergent Zero-shot Retrieval\n![a2i](assets/audio2img.png)\n\n![a+t2i](assets/audio+text2img.png)\n\n![a+i2i](assets/audio+img2img.png)\n\u003cbr\u003e\u003c/br\u003e\n\n# Acknowledgement\n* [Fairseq](https://github.com/pytorch/fairseq) A sequence modeling toolkit with flexible configuration and highly extensible code structure.\n* [xFormers](https://github.com/facebookresearch/xformers) A toolbox to accelerate research on Transformers.\n* [FlashAttention](https://github.com/HazyResearch/flash-attention) A repository that provides the official implementation of FlashAttention, which greatly speeds up multi-head attention.\n* [Apex](https://github.com/NVIDIA/apex) A repository that provides useful model acceleration and memory optimization techniques.\n\u003cbr\u003e\u003c/br\u003e\n\n## Getting Involved\nFeel free to submit GitHub issues or pull requests. Welcome to contribute to our project!\n\nTo contact us, never hestitate to send an email to `zheluo.wp@alibaba-inc.com` or `saimeng.wsj@alibaba-inc.com`!\n\u003cbr\u003e\u003c/br\u003e\n\n# Citation\n\nIf you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)\n\n```BibTeX\n@article{wang2023one,\n  title={ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities},\n  author={Wang, Peng and Wang, Shijie and Lin, Junyang and Bai, Shuai and Zhou, Xiaohuan and Zhou, Jingren and Wang, Xinggang and Zhou, Chang},\n  journal={arXiv preprint arXiv:2305.11172},\n  year={2023}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fofa-sys%2Fone-peace","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fofa-sys%2Fone-peace","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fofa-sys%2Fone-peace/lists"}