{"id":15108750,"url":"https://github.com/facebookresearch/metaclip","last_synced_at":"2025-10-20T05:49:08.586Z","repository":{"id":197120401,"uuid":"697602443","full_name":"facebookresearch/MetaCLIP","owner":"facebookresearch","description":"ICLR2024 Spotlight: curation/training code, metadata, distribution and pre-trained models for MetaCLIP; CVPR 2024: MoDE: CLIP Data Experts via Clustering","archived":false,"fork":false,"pushed_at":"2025-03-13T18:20:44.000Z","size":26257,"stargazers_count":1415,"open_issues_count":24,"forks_count":63,"subscribers_count":12,"default_branch":"main","last_synced_at":"2025-04-13T01:56:04.688Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/facebookresearch.png","metadata":{"files":{"readme":"README.md","changelog":"HISTORY.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-09-28T04:56:22.000Z","updated_at":"2025-04-11T11:11:59.000Z","dependencies_parsed_at":"2023-10-05T03:48:35.710Z","dependency_job_id":"f4c1ddda-52de-4d6a-8d5a-5d2b15418bd1","html_url":"https://github.com/facebookresearch/MetaCLIP","commit_stats":null,"previous_names":["facebookresearch/metaclip"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FMetaCLIP","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FMetaCLIP/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FMetaCLIP/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FMetaCLIP/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/facebookresearch","download_url":"https://codeload.github.com/facebookresearch/MetaCLIP/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248654050,"owners_count":21140235,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-09-25T22:24:27.894Z","updated_at":"2025-10-20T05:49:08.567Z","avatar_url":"https://github.com/facebookresearch.png","language":"Python","funding_links":[],"categories":["多模态大模型"],"sub_categories":["资源传输下载"],"readme":"# Meta CLIP\n\n[FAIR, Meta](https://ai.meta.com/research/)\n\n[![arXiv](https://img.shields.io/badge/arXiv-2507.22062-b31b1b)](https://arxiv.org/abs/2507.22062) [![arXiv](https://img.shields.io/badge/arXiv-2309.16671-b31b1b)](https://arxiv.org/abs/2309.16671) [![Hugging Face Collection](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Collection-blue)](https://huggingface.co/collections/facebook/meta-clip-687e97787e9155bc480ef446) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1V0Rv1QQJkcolTjiwJuRsqWycROvYjOwg?usp=sharing) [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/activebus/MetaCLIP)\n\n\u003cimg src=\"docs/metaclip2_scaling.gif\" style=\"width: 50%; margin: 0 auto; display: block;\" /\u003e\n\u003cimg src=\"docs/metaclip2_teaser.png\" style=\"width: 80%; margin: 0 auto; display: block;\" /\u003e\n\nAfter years of advancements in English-centric CLIP development, Meta CLIP 2 is now taking the next step: scaling CLIP to worldwide data. The effort addresses long-standing challenges:\n- large-scale non-English data curation pipelines are largely undeveloped;\n- the curse of multilinguality, where English performance often degrades in multilingual CLIP compared to English-only CLIP.\n\nWith a complete recipe for worldwide CLIP—spanning data curation, modeling, and training—we show that English and non-English worlds can **mutually benefit** and elevate each other, achieving SoTA multilingual performance.\n\n\n## Updates\n* 09/18/2025: 🔥 paper [Meta CLIP 2 (worldwide)](https://arxiv.org/abs/2507.22062) accepted by NeurIPS as spotlight presentation.\n* 08/25/2025: 🔥 [Meta CLIP 2 (worldwide)](https://arxiv.org/abs/2507.22062) is on [open_clip](https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/pretrained.py) and [Huggingface](https://huggingface.co/collections/facebook/meta-clip-687e97787e9155bc480ef446).\n* 07/29/2025: 🔥 paper [Meta CLIP 2: A Worldwide Scaling Recipe](https://arxiv.org/abs/2507.22062) (aka Meta CLIP 2 worldwide) is released.\n* 12/10/2024: 🔥 Meta CLIP 1.2 (ViT-H/14) trained with Altogether synthetic captions is released.\n* 10/09/2024: 🔥 [Altogether: Image Captioning via Re-aligning Alt-text](https://arxiv.org/abs/2410.17251) (aka Meta CLIP 1.2) is accepted by EMNLP 2024 with [code](altogether/README.md) released.\n* 08/15/2024: [v0.1](https://github.com/facebookresearch/MetaCLIP/releases/tag/v0.1) released.\n* 04/25/2024: 🔥 paper [MoDE: CLIP Data Experts via Clustering](https://arxiv.org/abs/2404.16030) is accepted by CVPR 2024 with [code](mode/README.md) released.\n* 01/18/2024: 🔥 add [code](metaclip/README_metadata.md) for building metadata.\n* 01/16/2024: 🔥 paper [Demystifying CLIP Data](https://arxiv.org/pdf/2309.16671) accepted by ICLR as [spotlight presentation](https://openreview.net/group?id=ICLR.cc/2024/Conference#tab-accept-spotlight).\n* 12/25/2023: [Huggingface Space](https://huggingface.co/spaces/activebus/MetaCLIP) demo and [Colab](https://colab.research.google.com/drive/1V0Rv1QQJkcolTjiwJuRsqWycROvYjOwg?usp=sharing) released.\n* 12/21/2023: Meta CLIP 1.1 (ViT-G/14) released.\n* 09/28/2023: initial release.\n\n\n## Quick Start\nThe pre-trained MetaCLIP models are available in\n\n\u003cdetails\u003e\n\u003csummary\u003emini_clip (this repo)\u003c/summary\u003e\n\n```python\nimport torch\nfrom PIL import Image\nfrom src.mini_clip.factory import create_model_and_transforms, get_tokenizer\n\n\nmodel, _, preprocess = create_model_and_transforms('ViT-H-14-quickgelu-worldwide@WorldWideCLIP', pretrained='metaclip2_worldwide')\ntokenize = get_tokenizer(\"facebook/xlm-v-base\")\n\nimage = preprocess(Image.open(\"docs/CLIP.png\")).unsqueeze(0)\ntext = tokenize([\"a diagram\", \"a dog\", \"a cat\"])\n\nwith torch.no_grad():\n    image_features = model.encode_image(image)\n    text_features = model.encode_text(text)\n    image_features /= image_features.norm(dim=-1, keepdim=True)\n    text_features /= text_features.norm(dim=-1, keepdim=True)\n\n    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)\n\nprint(\"Label probs:\", text_probs)\n```\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\n\u003csummary\u003eHuggingface\u003c/summary\u003e\n\n```python\nfrom PIL import Image\nfrom transformers import AutoProcessor, AutoModel\n\n\n# Meta CLIP 1\nprocessor = AutoProcessor.from_pretrained(\"facebook/metaclip-b32-400m\")\nmodel = AutoModel.from_pretrained(\"facebook/metaclip-b32-400m\")\n\n# Meta CLIP 2\n# model = AutoModel.from_pretrained(\"facebook/metaclip-2-worldwide-huge-quickgelu\")\n# processor = AutoProcessor.from_pretrained(\"facebook/metaclip-2-worldwide-huge-quickgelu\")\n\nimage = Image.open(\"docs/CLIP.png\")\ninputs = processor(text=[\"a diagram\", \"a dog\", \"a cat\"], images=image, return_tensors=\"pt\", padding=True)\n\nwith torch.no_grad():\n  outputs = model(**inputs)\n  logits_per_image = outputs.logits_per_image  # this is the image-text similarity score\n  text_probs = logits_per_image.softmax(dim=-1)\nprint(\"Label probs:\", text_probs)\n```\n\u003c/details\u003e\n\n## Pre-trained Models\n\nMeta CLIP closely adhere to OpenAI CLIP training and model setup (you mostly just need to replace the weights): **to promote rigorous ablation studies and advance scientific understanding**, as in the old \"era of ImageNet\".\n\n\nMeta CLIP 2\n\n|    `model_name`     | `pretrained` | Data Card | # of Seen Pairs | Res. | CVQA-LOCAL ZS Acc. |\n|:--------------------|:-------------|:---------:|:---------:|:---------:|:--------------:|\n| `ViT-H-14-quickgelu-worldwide` | [`metaclip2_worldwide`](https://dl.fbaipublicfiles.com/MMPT/metaclip/metaclip2_h14_quickgelu_224px_worldwide.pt) | Online Curation | 29B | 224 | 57.4 |\n| `ViT-H-14-378-worldwide` | [`metaclip2_worldwide`](https://dl.fbaipublicfiles.com/MMPT/metaclip/metaclip2_h14_378px_worldwide.pt) | Online Curation | 29B | 378 | 58.2 |\n| `ViT-bigG-14-worldwide` | [`metaclip2_worldwide`](https://dl.fbaipublicfiles.com/MMPT/metaclip/metaclip2_bigG14_224px_worldwide.pt) | Online Curation | 29B | 224 | 60.7 |\n| `ViT-bigG-14-378-worldwide` | [`metaclip2_worldwide`](https://dl.fbaipublicfiles.com/MMPT/metaclip/metaclip2_bigG14_378px_worldwide.pt) | Online Curation | 29B | 378 | 62.0 |\n\n\n(WIP): Meta CLIP 2: distilled smaller models and tokenizers.\n\n\nMeta CLIP 1\n\n|    `model_name`     | `pretrained` | Data Card | # of Seen Pairs | Res. | GPUs | IN ZS Acc. |\n|:--------------------|:-------------|:---------:|:---------:|:---------:|:---------:|:--------------:|\n| `ViT-B-32-quickgelu` | [`metaclip_400m`](https://dl.fbaipublicfiles.com/MMPT/metaclip/b32_400m.pt) | [data card](https://dl.fbaipublicfiles.com/MMPT/metaclip/datacard_400m.json) | 12.8B | 224 | 64 x V100 | 65.5 |\n| `ViT-B-16-quickgelu` | [`metaclip_400m`](https://dl.fbaipublicfiles.com/MMPT/metaclip/b16_400m.pt) | [data card](https://dl.fbaipublicfiles.com/MMPT/metaclip/datacard_400m.json) | 12.8B | 224 | 64 x V100 | 70.8 |\n| `ViT-L-14-quickgelu` | [`metaclip_400m`](https://dl.fbaipublicfiles.com/MMPT/metaclip/l14_400m.pt) | [data card](https://dl.fbaipublicfiles.com/MMPT/metaclip/datacard_400m.json) | 12.8B | 224 | 128 x V100 | 76.2 |\n| `ViT-B-32-quickgelu` | [`metaclip_2_5b`](https://dl.fbaipublicfiles.com/MMPT/metaclip/b32_fullcc2.5b.pt) | [data card](https://dl.fbaipublicfiles.com/MMPT/metaclip/datacard_fullcc2.5b.json) | 12.8B | 224 | 64 x V100 | 67.6 |\n| `ViT-B-16-quickgelu` | [`metaclip_2_5b`](https://dl.fbaipublicfiles.com/MMPT/metaclip/b16_fullcc2.5b.pt) | [data card](https://dl.fbaipublicfiles.com/MMPT/metaclip/datacard_fullcc2.5b.json) | 12.8B | 224 | 64 x V100 | 72.1 |\n| `ViT-L-14-quickgelu` | [`metaclip_2_5b`](https://dl.fbaipublicfiles.com/MMPT/metaclip/l14_fullcc2.5b.pt) | [data card](https://dl.fbaipublicfiles.com/MMPT/metaclip/datacard_fullcc2.5b.json) | 12.8B | 224 | 128 x V100 | 79.2 |\n| `ViT-H-14-quickgelu` | [`metaclip_2_5b`](https://dl.fbaipublicfiles.com/MMPT/metaclip/h14_fullcc2.5b.pt) | [data card](https://dl.fbaipublicfiles.com/MMPT/metaclip/datacard_fullcc2.5b.json) | 12.8B | 224 | 256 x A100 | 80.5 |\n| `ViT-bigG-14-quickgelu` (v1.1) | [`metaclip_2_5b`](https://dl.fbaipublicfiles.com/MMPT/metaclip/G14_fullcc2.5b.pt) | [data card](https://dl.fbaipublicfiles.com/MMPT/metaclip/datacard_fullcc2.5b.json) | 12.8B | 224 | 256 x A100 | 82.1 |\n| `ViT-H-14` (v1.2) | [`metaclip_v1_2_altogether`](https://dl.fbaipublicfiles.com/MMPT/metaclip/h14_v1.2_altogether.pt) | Online Curation | 35B | 224 | 256 x H100 | 82.0 |\n\n\n## Environment \n\nThis code is customized from [OpenCLIP](https://github.com/mlfoundations/open_clip) and will be maintained separately for research on MetaCLIP. The following command should install requirements for OpenCLIP and `submitit=1.2.1` used by this repo:\n\n```bash\nconda create -n metaclip python=3.10 pytorch torchvision pytorch-cuda=11.7 tqdm ftfy braceexpand regex pandas submitit=1.2.1 \\\n    -c pytorch-nightly \\\n    -c nvidia \\\n    -c conda-forge \\\n    -c anaconda\n```\n\n## Curation\n\nSee [MetaCLIP 2](docs/metaclip2.md) and [MetaCLIP 1](docs/metaclip1.md).\n\n\n## Bugs or questions?\n\nIf you have any questions related to the code or the paper, feel free to email Hu Xu (`huxu@meta.com`).\n\n\n## Citation\n\nPlease cite the following paper if MetaCLIP helps your work:\n\n```bibtex\n```bibtex\n@inproceedings{chuang2025metaclip2,\n   title={Meta CLIP 2: A Worldwide Scaling Recipe},\n   author={Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, Xinlei Chen, Zhuang Liu, Saining Xie, Wen-tau Yih, Shang-Wen Li and Hu Xu},\n   journal={arXiv preprint arXiv:2507.22062},\n   year={2025}\n}\n\n@inproceedings{xu2023metaclip,\n   title={Demystifying CLIP Data},\n   author={Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer and Christoph Feichtenhofer},\n   journal={arXiv preprint arXiv:2309.16671},\n   year={2023}\n}\n\n@inproceedings{xu2024altogether,\n   title={Altogether: Image Captioning via Re-aligning Alt-text},\n   author={Hu Xu, Po-Yao Huang, Xiaoqing Ellen Tan, Ching-Feng Yeh, Jacob Kahn, Christine Jou, Gargi Ghosh, Omer Levy, Luke Zettlemoyer, Wen-tau Yih, Shang-Wen Li, Saining Xie, Christoph Feichtenhofer},\n   journal={arXiv preprint arXiv:2410.17251},\n   year={2024}\n}\n\n@inproceedings{ma2024mode,\n  title={Mode: Clip data experts via clustering},\n  author={Jiawei Ma, Po-Yao Huang, Saining Xie, Shang-Wen Li, Luke Zettlemoyer, Shih-Fu Chang, Wen-Tau Yih and Hu Xu},\n  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},\n  year={2024}\n}\n```\n\n\n## Reference\n\nThe training code is developed based on [OpenCLIP](https://github.com/mlfoundations/open_clip), modified to the vanilla CLIP training setup.\n\n## TODO\n- pip installation of metaclip package;\n- refactor mini_clip with apps for MoDE, altogether.\n- more updates for Meta CLIP 2: metadata, data loader, training code. \n\n## License\n\nThe majority of Meta CLIP is licensed under CC-BY-NC, however portions of the project are available under separate license terms: open_clip is licensed under the https://github.com/mlfoundations/open_clip license.\n\n## Acknowledgement\nWe gratefully acknowledge the [OpenCLIP](https://github.com/mlfoundations/open_clip) team for initial CLIP codebase and integration and [NielsRogge](https://github.com/NielsRogge)'s integration into [Huggingface](https://huggingface.co/models?other=metaclip).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2Fmetaclip","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffacebookresearch%2Fmetaclip","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2Fmetaclip/lists"}