{"id":15036649,"url":"https://github.com/apple/ml-mobileclip","last_synced_at":"2025-05-15T05:06:06.708Z","repository":{"id":225899708,"uuid":"765465710","full_name":"apple/ml-mobileclip","owner":"apple","description":"This repository contains the official implementation of the research paper, \"MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training\" CVPR 2024","archived":false,"fork":false,"pushed_at":"2024-11-22T19:19:39.000Z","size":3383,"stargazers_count":918,"open_issues_count":1,"forks_count":71,"subscribers_count":19,"default_branch":"main","last_synced_at":"2025-05-03T20:02:43.861Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/apple.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-01T01:07:30.000Z","updated_at":"2025-05-03T20:01:03.000Z","dependencies_parsed_at":"2024-06-14T02:13:33.351Z","dependency_job_id":"1b01d68f-9b59-4612-8e7d-33cf185776d9","html_url":"https://github.com/apple/ml-mobileclip","commit_stats":null,"previous_names":["apple/ml-mobileclip"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apple%2Fml-mobileclip","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apple%2Fml-mobileclip/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apple%2Fml-mobileclip/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apple%2Fml-mobileclip/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/apple","download_url":"https://codeload.github.com/apple/ml-mobileclip/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254276447,"owners_count":22043867,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-09-24T20:31:48.435Z","updated_at":"2025-05-15T05:06:01.695Z","avatar_url":"https://github.com/apple.png","language":"Python","funding_links":[],"categories":["🧠 SOTA 2024-2025: Mobile LLMs \u0026 Multimodal","Python"],"sub_categories":["🎨 Vision Models \u0026 Feature Extraction"],"readme":"# MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training\n\nThis is the official repository of\n**[MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training](https://arxiv.org/pdf/2311.17049.pdf). (CVPR 2024)**\n*Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel.*\nThe repository contains code for inference, training, and evaluation of MobileCLIP models trained on DataCompDR datasets.\n\n[//]: # (![MobileCLIP Performance]\u0026#40;docs/fig_accuracy_latency.png\u0026#41;)\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"docs/fig_accuracy_latency.png\" alt=\"Accuracy vs latency figure.\" width=\"400\"/\u003e\n\u003c/p\u003e\n\n- **Update 2024/11/22:** Releasing iOS app to demonstrate the use of our model for real-time zero-shot image classification. See [ios_app](./ios_app/).\n- **Update 2024/06/13:** Releasing the code and scripts to train using [OpenCLIP](https://github.com/mlfoundations/open_clip/tree/main/src/open_clip) on DataCompDR datasets. See [training/](./training/).\n- **Update 2024/06/13:** MobileCLIP models and DataCompDR datasets are now available on HuggingFace in [MobileCLIP/DataCompDR Collection](https://huggingface.co/collections/apple/mobileclip-models-datacompdr-data-665789776e1aa2b59f35f7c8).\n\n### Highlights\n* Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as [OpenAI](https://arxiv.org/abs/2103.00020)'s ViT-B/16 model while being 4.8x faster and 2.8x smaller.\n* `MobileCLIP-S2` obtains better avg zero-shot performance than [SigLIP](https://arxiv.org/abs/2303.15343)'s ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples.\n* `MobileCLIP-B`(LT) attains zero-shot ImageNet performance of **77.2%** which is significantly better than recent works like [DFN](https://arxiv.org/abs/2309.17425) and [SigLIP](https://arxiv.org/abs/2303.15343) with similar architectures or even [OpenAI's ViT-L/14@336](https://arxiv.org/abs/2103.00020).\n* iOS app to demonstrate the superior performance of our model on a mobile device.\n\n![Examples](ios_app/docs/app_screenshots/examples.png)\n\n## Getting Started\n\n### Setup\n```bash\nconda create -n clipenv python=3.10\nconda activate clipenv\npip install -e .\n```\nTo download pretrained checkpoints follow the code snippet below\n```bash\nsource get_pretrained_models.sh   # Files will be downloaded to `checkpoints` directory.\n```\n\n### Usage Example\nTo models from the official repo, follow the code snippet below\n```python\nimport torch\nfrom PIL import Image\nimport mobileclip\n\nmodel, _, preprocess = mobileclip.create_model_and_transforms('mobileclip_s0', pretrained='/path/to/mobileclip_s0.pt')\ntokenizer = mobileclip.get_tokenizer('mobileclip_s0')\n\nimage = preprocess(Image.open(\"docs/fig_accuracy_latency.png\").convert('RGB')).unsqueeze(0)\ntext = tokenizer([\"a diagram\", \"a dog\", \"a cat\"])\n\nwith torch.no_grad(), torch.cuda.amp.autocast():\n    image_features = model.encode_image(image)\n    text_features = model.encode_text(text)\n    image_features /= image_features.norm(dim=-1, keepdim=True)\n    text_features /= text_features.norm(dim=-1, keepdim=True)\n\n    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)\n\nprint(\"Label probs:\", text_probs)\n```\n\nFor an example of loading the data from HuggingFace see \n[hf_dataset_example.py](./hf_dataset_example.py).\n\n### OpenCLIP Support\nOur models are now natively supported in OpenCLIP. To use MobileCLIP models in OpenCLIP, setup your environment as shown below,\n```bash\nconda create -n clipenv python=3.10\nconda activate clipenv\n\npip install git+https://github.com/mlfoundations/open_clip\npip install git+https://github.com/huggingface/pytorch-image-models\n```\n\nTo run inference, see example below,\n```python\nimport open_clip\nfrom mobileclip.modules.common.mobileone import reparameterize_model\n \nmodel, _, preprocess = open_clip.create_model_and_transforms('MobileCLIP-S2', pretrained='datacompdr')\ntokenizer = open_clip.get_tokenizer('MobileCLIP-S2')\n\n# For inference/model exporting purposes, please reparameterize first\nmodel.eval() \nmodel = reparameterize_model(model)\n\n# ... follow examples in open_clip repo ...\n```\nVariants currently available on OpenCLIP, \n `[('MobileCLIP-S1', 'datacompdr'),\n  ('MobileCLIP-S2', 'datacompdr'),\n  ('MobileCLIP-B', 'datacompdr'),\n  ('MobileCLIP-B', 'datacompdr_lt')]`\n\n\n## Evaluation\nPlease find the detailed evaluation results [here](./results).\nTo reproduce results, we provide script to perform zero-shot evaluation on ImageNet-1k dataset. \nTo evaluate on all the 38 datasets, please follow instructions in [datacomp](https://github.com/mlfoundations/datacomp).\n```bash\n# Run evaluation with single GPU\npython eval/zeroshot_imagenet.py --model-arch mobileclip_s0 --model-path /path/to/mobileclip_s0.pt\n```\n\nPlease refer to [Open CLIP Results](https://github.com/mlfoundations/open_clip/blob/main/docs/openclip_results.csv) to compare with other models.\n\n| Model             |   # Seen \u003cBR\u003eSamples (B)   | # Params (M) \u003cBR\u003e (img + txt) | Latency (ms) \u003cBR\u003e (img + txt)  | IN-1k Zero-Shot \u003cBR\u003e Top-1 Acc. (%) | Avg. Perf. (%) \u003cBR\u003e on 38 datasets |                                            Pytorch Checkpoint (url)                                            |\n|:------------------|:----------------------:|:-----------------------------:|:------------------------------:|:-----------------------------------:|:----------------------------------:|:--------------------------------------------------------------------------------------------------------------:|\n| MobileCLIP-S0     |           13           |          11.4 + 42.4          |           1.5 + 1.6            |                67.8                 |                58.1                |  [mobileclip_s0.pt](https://docs-assets.developer.apple.com/ml-research/datasets/mobileclip/mobileclip_s0.pt)  |\n| MobileCLIP-S1     |           13           |          21.5 + 63.4          |           2.5 + 3.3           |                72.6                 |                61.3                |  [mobileclip_s1.pt](https://docs-assets.developer.apple.com/ml-research/datasets/mobileclip/mobileclip_s1.pt)  |\n| MobileCLIP-S2     |           13           |          35.7 + 63.4          |           3.6 + 3.3           |                74.4                 |                63.7                |  [mobileclip_s2.pt](https://docs-assets.developer.apple.com/ml-research/datasets/mobileclip/mobileclip_s2.pt)  |\n| MobileCLIP-B      |           13           |          86.3 + 63.4          |          10.4 + 3.3           |                76.8                 |                65.2                |   [mobileclip_b.pt](https://docs-assets.developer.apple.com/ml-research/datasets/mobileclip/mobileclip_b.pt)   |\n| MobileCLIP-B (LT) |           36           |          86.3 + 63.4          |          10.4 + 3.3           |                77.2                 |                65.8                | [mobileclip_blt.pt](https://docs-assets.developer.apple.com/ml-research/datasets/mobileclip/mobileclip_blt.pt) |\n\nNote: MobileCLIP-B(LT) is trained for 300k iterations with constant learning rate schedule and 300k iterations with cosine learning rate schedule.\n\n## Citation\nIf you found this code useful, please cite the following paper:\n```\n@InProceedings{mobileclip2024,\n  author = {Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel},\n  title = {MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training},\n  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},\n  month = {June},\n  year = {2024},\n}\n```\n\n## Acknowledgements\nOur codebase is built using multiple opensource contributions, please see [ACKNOWLEDGEMENTS](ACKNOWLEDGEMENTS) for more details. \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapple%2Fml-mobileclip","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fapple%2Fml-mobileclip","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapple%2Fml-mobileclip/lists"}