{"id":13563925,"url":"https://github.com/facebookresearch/SLIP","last_synced_at":"2025-04-03T20:32:17.083Z","repository":{"id":41097336,"uuid":"440678688","full_name":"facebookresearch/SLIP","owner":"facebookresearch","description":"Code release for SLIP Self-supervision meets Language-Image Pre-training","archived":true,"fork":false,"pushed_at":"2023-02-09T10:23:37.000Z","size":1816,"stargazers_count":735,"open_issues_count":18,"forks_count":67,"subscribers_count":18,"default_branch":"main","last_synced_at":"2024-08-01T13:30:22.226Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/facebookresearch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2021-12-21T23:38:04.000Z","updated_at":"2024-07-25T16:26:19.000Z","dependencies_parsed_at":"2023-01-19T23:17:59.185Z","dependency_job_id":"5660faa5-5f0a-47fa-a276-fa24f193380b","html_url":"https://github.com/facebookresearch/SLIP","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FSLIP","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FSLIP/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FSLIP/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FSLIP/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/facebookresearch","download_url":"https://codeload.github.com/facebookresearch/SLIP/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223030783,"owners_count":17076500,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T13:01:24.622Z","updated_at":"2024-11-04T16:31:30.769Z","avatar_url":"https://github.com/facebookresearch.png","language":"Python","readme":"# [SLIP: Self-supervision meets Language-Image Pre-training](https://arxiv.org/abs/2112.12750)\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"slip.png\" alt=\"SLIP framework\" width=\"400\"/\u003e\u003c/p\u003e\n\n\n## What you can find in this repo:\n- Pre-trained models (with ViT-Small, Base, Large) and code to reproduce results from our paper: **[SLIP: Self-supervision meets Language-Image Pre-training](https://arxiv.org/abs/2112.12750).** *[Norman Mu](https://normanmu.com), [Alexander Kirillov](https://alexander-kirillov.github.io/), [David Wagner](http://people.eecs.berkeley.edu/~daw/) and [Saining Xie](https://sainingxie.com)*, arXiv 2021\n\n- An improved CLIP baseline (31.3% → 34.6% ImageNet 0-shot w/ Modified ResNet-50) on YFCC15M dataset.\n- Zero-shot transfer and linear classification evaluation scripts on **26** downstream datasets.\n\n## Updates:\n\nJan 18 2022: Added support for training on RedCaps\n\nJan 17 2022: Released CC3M/CC12M CLIP/SLIP ViT-B checkpoints\n\n## Results and Pre-trained Models\nThe following models are pre-trained on YFCC15M and evaluated on ImageNet-1K (ILSVRC2012).\n\n### ViT-Small (MoCo v3 version w/ 12 vs. 6 heads)\n\u003ctable\u003e\u003ctbody\u003e\n\u003c!-- START TABLE --\u003e\n\u003c!-- TABLE HEADER --\u003e\n\u003cth valign=\"center\"\u003eMethod\u003c/th\u003e\n\u003cth valign=\"center\"\u003eEpochs\u003c/th\u003e\n\u003cth valign=\"center\"\u003e0-shot\u003c/th\u003e\n\u003cth valign=\"center\"\u003eLinear\u003c/th\u003e\n\u003cth valign=\"center\"\u003eFinetuned\u003c/th\u003e\n\u003cth valign=\"center\"\u003eWeights\u003c/th\u003e\n\n\u003c!-- TABLE BODY --\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003eCLIP\u003c/td\u003e\n\u003ctd align=\"center\"\u003e25\u003c/td\u003e\n\u003ctd align=\"center\"\u003e32.7\u003c/td\u003e\n\u003ctd align=\"center\"\u003e59.3\u003c/td\u003e\n\u003ctd align=\"center\"\u003e78.2\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\u003ca href=\"https://dl.fbaipublicfiles.com/slip/clip_small_25ep.pt\"\u003eurl\u003c/a\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003eSimCLR\u003c/td\u003e\n\u003ctd align=\"center\"\u003e25\u003c/td\u003e\n\u003ctd align=\"center\"\u003e-\u003c/td\u003e\n\u003ctd align=\"center\"\u003e58.1\u003c/td\u003e\n\u003ctd align=\"center\"\u003e79.9\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\u003ca href=\"https://dl.fbaipublicfiles.com/slip/simclr_small_25ep.pt\"\u003eurl\u003c/a\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003eSLIP\u003c/td\u003e\n\u003ctd align=\"center\"\u003e25\u003c/td\u003e\n\u003ctd align=\"center\"\u003e38.3\u003c/td\u003e\n\u003ctd align=\"center\"\u003e66.4\u003c/td\u003e\n\u003ctd align=\"center\"\u003e80.3\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\u003ca href=\"https://dl.fbaipublicfiles.com/slip/slip_small_25ep.pt\"\u003eurl\u003c/a\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003eSLIP\u003c/td\u003e\n\u003ctd align=\"center\"\u003e50\u003c/td\u003e\n\u003ctd align=\"center\"\u003e39.3\u003c/td\u003e\n\u003ctd align=\"center\"\u003e67.6\u003c/td\u003e\n\u003ctd align=\"center\"\u003e80.7\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\u003ca href=\"https://dl.fbaipublicfiles.com/slip/slip_small_50ep.pt\"\u003eurl\u003c/a\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003eSLIP\u003c/td\u003e\n\u003ctd align=\"center\"\u003e100\u003c/td\u003e\n\u003ctd align=\"center\"\u003e39.5\u003c/td\u003e\n\u003ctd align=\"center\"\u003e68.3\u003c/td\u003e\n\u003ctd align=\"center\"\u003e80.7\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\u003ca href=\"https://dl.fbaipublicfiles.com/slip/slip_small_100ep.pt\"\u003eurl\u003c/a\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/tbody\u003e\u003c/table\u003e\n\n### ViT-Base\n\u003ctable\u003e\u003ctbody\u003e\n\u003c!-- START TABLE --\u003e\n\u003c!-- TABLE HEADER --\u003e\n\u003cth valign=\"center\"\u003eMethod\u003c/th\u003e\n\u003cth valign=\"center\"\u003eEpochs\u003c/th\u003e\n\u003cth valign=\"center\"\u003e0-shot\u003c/th\u003e\n\u003cth valign=\"center\"\u003eLinear\u003c/th\u003e\n\u003cth valign=\"center\"\u003eFinetuned\u003c/th\u003e\n\u003cth valign=\"center\"\u003eWeights\u003c/th\u003e\n\n\u003c!-- TABLE BODY --\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003eCLIP\u003c/td\u003e\n\u003ctd align=\"center\"\u003e25\u003c/td\u003e\n\u003ctd align=\"center\"\u003e37.6\u003c/td\u003e\n\u003ctd align=\"center\"\u003e66.5\u003c/td\u003e\n\u003ctd align=\"center\"\u003e80.5\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\u003ca href=\"https://dl.fbaipublicfiles.com/slip/clip_base_25ep.pt\"\u003eurl\u003c/a\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003eSimCLR\u003c/td\u003e\n\u003ctd align=\"center\"\u003e25\u003c/td\u003e\n\u003ctd align=\"center\"\u003e-\u003c/td\u003e\n\u003ctd align=\"center\"\u003e64.0\u003c/td\u003e\n\u003ctd align=\"center\"\u003e82.5\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\u003ca href=\"https://dl.fbaipublicfiles.com/slip/simclr_base_25ep.pt\"\u003eurl\u003c/a\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003eSLIP\u003c/td\u003e\n\u003ctd align=\"center\"\u003e25\u003c/td\u003e\n\u003ctd align=\"center\"\u003e42.8\u003c/td\u003e\n\u003ctd align=\"center\"\u003e72.1\u003c/td\u003e\n\u003ctd align=\"center\"\u003e82.6\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\u003ca href=\"https://dl.fbaipublicfiles.com/slip/slip_base_25ep.pt\"\u003eurl\u003c/a\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003eSLIP\u003c/td\u003e\n\u003ctd align=\"center\"\u003e50\u003c/td\u003e\n\u003ctd align=\"center\"\u003e44.1\u003c/td\u003e\n\u003ctd align=\"center\"\u003e73.0\u003c/td\u003e\n\u003ctd align=\"center\"\u003e82.9\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\u003ca href=\"https://dl.fbaipublicfiles.com/slip/slip_base_50ep.pt\"\u003eurl\u003c/a\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003eSLIP\u003c/td\u003e\n\u003ctd align=\"center\"\u003e100\u003c/td\u003e\n\u003ctd align=\"center\"\u003e45.0\u003c/td\u003e\n\u003ctd align=\"center\"\u003e73.6\u003c/td\u003e\n\u003ctd align=\"center\"\u003e83.4\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\u003ca href=\"https://dl.fbaipublicfiles.com/slip/slip_base_100ep.pt\"\u003eurl\u003c/a\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/tbody\u003e\u003c/table\u003e\n\n### ViT-Large\n\u003ctable\u003e\u003ctbody\u003e\n\u003c!-- START TABLE --\u003e\n\u003c!-- TABLE HEADER --\u003e\n\u003cth valign=\"center\"\u003eMethod\u003c/th\u003e\n\u003cth valign=\"center\"\u003eEpochs\u003c/th\u003e\n\u003cth valign=\"center\"\u003e0-shot\u003c/th\u003e\n\u003cth valign=\"center\"\u003eLinear\u003c/th\u003e\n\u003cth valign=\"center\"\u003eFinetuned\u003c/th\u003e\n\u003cth valign=\"center\"\u003eWeights\u003c/th\u003e\n\n\u003c!-- TABLE BODY --\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003eCLIP\u003c/td\u003e\n\u003ctd align=\"center\"\u003e25\u003c/td\u003e\n\u003ctd align=\"center\"\u003e40.4\u003c/td\u003e\n\u003ctd align=\"center\"\u003e70.5\u003c/td\u003e\n\u003ctd align=\"center\"\u003e81.0\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\u003ca href=\"https://dl.fbaipublicfiles.com/slip/clip_large_25ep.pt\"\u003eurl\u003c/a\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003eSimCLR\u003c/td\u003e\n\u003ctd align=\"center\"\u003e25\u003c/td\u003e\n\u003ctd align=\"center\"\u003e-\u003c/td\u003e\n\u003ctd align=\"center\"\u003e66.7\u003c/td\u003e\n\u003ctd align=\"center\"\u003e84.0\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\u003ca href=\"https://dl.fbaipublicfiles.com/slip/simclr_large_25ep.pt\"\u003eurl\u003c/a\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003eSLIP\u003c/td\u003e\n\u003ctd align=\"center\"\u003e25\u003c/td\u003e\n\u003ctd align=\"center\"\u003e46.2\u003c/td\u003e\n\u003ctd align=\"center\"\u003e76.0\u003c/td\u003e\n\u003ctd align=\"center\"\u003e84.2\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\u003ca href=\"https://dl.fbaipublicfiles.com/slip/slip_large_25ep.pt\"\u003eurl\u003c/a\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003eSLIP\u003c/td\u003e\n\u003ctd align=\"center\"\u003e50\u003c/td\u003e\n\u003ctd align=\"center\"\u003e47.4\u003c/td\u003e\n\u003ctd align=\"center\"\u003e75.8\u003c/td\u003e\n\u003ctd align=\"center\"\u003e84.7\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\u003ca href=\"https://dl.fbaipublicfiles.com/slip/slip_large_50ep.pt\"\u003eurl\u003c/a\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003eSLIP\u003c/td\u003e\n\u003ctd align=\"center\"\u003e100\u003c/td\u003e\n\u003ctd align=\"center\"\u003e47.9\u003c/td\u003e\n\u003ctd align=\"center\"\u003e75.1\u003c/td\u003e\n\u003ctd align=\"center\"\u003e84.8\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\u003ca href=\"https://dl.fbaipublicfiles.com/slip/slip_large_100ep.pt\"\u003eurl\u003c/a\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/tbody\u003e\u003c/table\u003e\n\n### Additional Datasets and Models\n\u003ctable\u003e\u003ctbody\u003e\n\u003c!-- START TABLE --\u003e\n\u003c!-- TABLE HEADER --\u003e\n\u003cth valign=\"center\"\u003eDataset\u003c/th\u003e\n\u003cth valign=\"center\"\u003eMethod\u003c/th\u003e\n\u003cth valign=\"center\"\u003eModel\u003c/th\u003e\n\u003cth valign=\"center\"\u003eEpochs\u003c/th\u003e\n\u003cth valign=\"center\"\u003e0-shot\u003c/th\u003e\n\u003cth valign=\"center\"\u003eLinear\u003c/th\u003e\n\u003cth valign=\"center\"\u003eFinetuned\u003c/th\u003e\n\u003cth valign=\"center\"\u003eWeights\u003c/th\u003e\n\n\u003c!-- TABLE BODY --\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003eCC3M\u003c/td\u003e\n\u003ctd align=\"center\"\u003eCLIP\u003c/td\u003e\n\u003ctd align=\"center\"\u003eViT-B\u003c/td\u003e\n\u003ctd align=\"center\"\u003e40\u003c/td\u003e\n\u003ctd align=\"center\"\u003e17.1\u003c/td\u003e\n\u003ctd align=\"center\"\u003e53.3\u003c/td\u003e\n\u003ctd align=\"center\"\u003e79.5\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\u003ca href=\"https://dl.fbaipublicfiles.com/slip/clip_base_cc3m_40ep.pt\"\u003eurl\u003c/a\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003eCC3M\u003c/td\u003e\n\u003ctd align=\"center\"\u003eSLIP\u003c/td\u003e\n\u003ctd align=\"center\"\u003eViT-B\u003c/td\u003e\n\u003ctd align=\"center\"\u003e40\u003c/td\u003e\n\u003ctd align=\"center\"\u003e23.0\u003c/td\u003e\n\u003ctd align=\"center\"\u003e65.4\u003c/td\u003e\n\u003ctd align=\"center\"\u003e81.4\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\u003ca href=\"https://dl.fbaipublicfiles.com/slip/slip_base_cc3m_40ep.pt\"\u003eurl\u003c/a\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003eCC12M\u003c/td\u003e\n\u003ctd align=\"center\"\u003eCLIP\u003c/td\u003e\n\u003ctd align=\"center\"\u003eViT-B\u003c/td\u003e\n\u003ctd align=\"center\"\u003e35\u003c/td\u003e\n\u003ctd align=\"center\"\u003e36.5\u003c/td\u003e\n\u003ctd align=\"center\"\u003e69.0\u003c/td\u003e\n\u003ctd align=\"center\"\u003e82.1\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\u003ca href=\"https://dl.fbaipublicfiles.com/slip/clip_base_cc12m_35ep.pt\"\u003eurl\u003c/a\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003eCC12M\u003c/td\u003e\n\u003ctd align=\"center\"\u003eSLIP\u003c/td\u003e\n\u003ctd align=\"center\"\u003eViT-B\u003c/td\u003e\n\u003ctd align=\"center\"\u003e35\u003c/td\u003e\n\u003ctd align=\"center\"\u003e40.7\u003c/td\u003e\n\u003ctd align=\"center\"\u003e73.7\u003c/td\u003e\n\u003ctd align=\"center\"\u003e83.1\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\u003ca href=\"https://dl.fbaipublicfiles.com/slip/slip_base_cc12m_35ep.pt\"\u003eurl\u003c/a\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\n\u003c/tbody\u003e\u003c/table\u003e\n\n## 1. Setup\nInstall [PyTorch](https://pytorch.org) and [timm](https://github.com/rwightman/pytorch-image-models). \nThe code has been tested with CUDA 11.3/CuDNN 8.2.0, PyTorch 1.10.0 and timm 0.5.0.\n\n### 1.1. YFCC15M Setup\nDownload the [YFCC100M dataset](https://multimediacommons.wordpress.com/yfcc100m-core-dataset/).\nOur dataloader expects the following dataset directory structure with 100 folders containing 1000 zip archives of 1000 images each.\nThe concatenation of the folder, archive, and file names is the index of the image (i.e. image 12345678 is stored as `678.jpg` within `12/345.zip`):\n\n```\n/path/to/yfcc100m/\n├── images/\n│   ├── 00/\n│   │   └── 000.zip\n│   │   │   ├── 000.jpg\n│   │   │   │   ...\n│   │   │   └── 999.jpg\n│   │   ...\n│   │   └── 999.zip\n│   ...\n│   └── 99/\n...\n```\n\nPrepare the YFCC15M subset metadata pickle:\n1. Download and compile a list of downloaded images to `flickr_unique_ids.npy` ([ours](https://dl.fbaipublicfiles.com/deepcluster/flickr_unique_ids.npy))\n2. Download OpenAI's list of captioned YFCC100M images according to instructions [here](https://github.com/openai/CLIP/blob/8cad3a736a833bc4c9b4dd34ef12b52ec0e68856/data/yfcc100m.md)\n3. Run `python make_dataset.py` to create the `yfcc15m.pkl` metadata pickle\n\nWhen pre-training with YFCC15M, set `--dataset yfcc15m --root /path/to/yfcc100m --metadata /path/to/yfcc15m.pkl`.\n\n### 1.2. COCO Captions Setup\nDownload and unzip the 2017 Train [images](http://images.cocodataset.org/zips/train2017.zip) and [annotations](http://images.cocodataset.org/annotations/annotations_trainval2017.zip).\nWhen pre-training on COCO, set `--dataset coco --root /path/to/coco --metadata /path/to/captions_train2017.json`.\n\n### 1.3. Conceptual Captions Setup\n[CC3M](https://ai.google.com/research/ConceptualCaptions/download) and [CC12M](https://github.com/google-research-datasets/conceptual-12m) are published as tsv files listing original image urls and processed captions.\nDownload images and collect the captions of all available images (many will be missing due to broken links) into `cc3m.npy` and `cc12m.npy`.\n\nFor CC3M our dataloader expects `cc3m.npy` to contain a NumPy array of dicts in the following format:\n\n```\n{\n  'image_id': 1510438788,  # local file path relative to root\n  'captions': ['large field with pink tulips on a clear sunny summer day with a blue sky']\n}\n```\n\nFor CC12M our dataloader expects `cc12m.npy` to contain a NumPy array of dicts in the following format:\n\n```\n{\n  'image_name': '0.jpg',  # local file path relative to root\n  'image_id': 0,\n  'captions': ['Metal Design Within Reach Ivory Slipper Chairs - a Pair For Sale - Image 7 of 10']\n}\n```\n\nWhen pre-training on CC3M set `--dataset cc3m --root /path/to/cc3m --metadata /path/to/cc3m.npy`, and whe pre-training on CC12M set `--dataset cc12m --root /path/to/cc12m --metadata /path/to/cc12m.npy`.\n\n### 1.4. RedCaps Setup\n[RedCaps](https://redcaps.xyz) is published as a list of JSON annotation files containing image urls and raw/processed captions.\nImages can be downloaded from these annotations with a helpful [downloader tool](https://github.com/redcaps-dataset/redcaps-downloader).\nThen merge all per-subreddit annotations into a single file with the [combine_captions.py](redcaps/combine_captions.py) script:\n\n```\npython redcaps/combine_captions.py --input /path/to/redcaps/annotations --output /path/to/redcaps_v1.json\n```\n\nTo pre-train on RedCaps set `--dataset redcaps --root /path/to/redcaps --metadata /path/to/redcaps_v1.json`.\n\n\n### 1.4. Downstream Dataset Setup\nZero-shot (in [main.py](main.py) and [eval_zeroshot.py](eval_zeroshot.py)) and linear (in [main_linear.py](main_linear.py)) evaluations read dataset paths from [dataset_catalog.json](dataset_catalog.json).\nZero-shot evaluations read CLIP's class labels and caption templates from [labels.json](labels.json) and [templates.json](templates.json).\nIf just pre-training models on YFCC15M, only the ImageNet path is required for model validation between training epochs.\nSee Section 3 below on zero-shot transfer evaluation for dataset preparation details.\n\n## 2. Pre-training\nWe use the following pre-training recipes for SLIP, CLIP, and SimCLR.\nSee [main.py](main.py) for the full list of default arguments.\nWe use the same lr and wd settings for all model sizes within the same training framework, and different model sizes can be selected by passing in different strings to the `--model` argument such as `SLIP_VITS16` or `SLIP_VITL16`.\n\nIn our workflow we use [submitit](https://github.com/facebookincubator/submitit), which interfaces nicely with Slurm.\nFor local training with the [torchrun](https://pytorch.org/docs/stable/elastic/run.html) utility (supersedes `torch.distributed.launch`), replace `python run_with_submitit.py` with `torchrun --nproc_per_node=8 main.py`. \nLocal multi-node training with `torchrun` should also be possible.\n\nWe train most of our models on 8x 8-gpu nodes, but training with fewer gpus is possible by reducing the batch size and setting the `--update-freq` argument above 1 to enable gradient accumulation.\nNote that gradient accumulation will increase the variance of minibatch statistics and alter the training dynamics of batchnorm, which is used in SLIP and SimCLR.\n\n### SLIP ViT-Base with 8-nodes (batch size 4096)\n```\npython run_with_submitit.py \\\n  --root /path/to/yfcc100m \\\n  --model SLIP_VITB16 \\\n  --lr 3e-3 --wd 0.1\n```\n\n### CLIP ViT-Base with 8-nodes (batch size 4096)\n```\npython run_with_submitit.py \\\n  --root /path/to/yfcc100m \\\n  --model CLIP_VITB16 \\\n  --lr 5e-4 --wd 0.5\n```\n\n### SimCLR ViT-Base with 8-nodes (batch size 4096)\n```\npython run_with_submitit.py \\\n  --root /path/to/yfcc100m \\\n  --model SIMCLR_VITB16 \\\n  --ssl-mlp-dim 4096 --ssl-emb-dim 256 --ssl-temp 0.1 \\\n  --lr 3.2e-3 --wd 0.1 \n```\n\nSome important arguments:\n\n`--dataset`: pre-training dataset name. choices include `yfcc15m`, `cc12m`, `cc3m`, `coco`.\n\n`--root`: path to dataset root\n\n`--metadata`: path to metadata file (see section 1 for details)\n\n`--ssl-mlp-dim`: hidden dim of SimCLR mlp projection head\n\n`--ssl-emb-dim`: output embed dim of SimCLR mlp projection head\n\n`--ssl-scale`: loss scale for SimCLR objective\n\n`--ssl-temp`: softmax temperature for SimCLR objective\n\n`--batch-size`: number of samples per-device/per-gpu \n\n`--lr-start`: initial warmup lr\n\n`--lr-end`: minimum final lr\n\n`--update-freq`: optimizer update frequency, i.e. gradient accumulation steps\n\n`--disable-amp`: disable mixed-precision training (requires more memory and compute)\n\n## 3. Evaluation: Zero-shot Transfer\nFirst, prepare additional downstream classification datasets:\n- MNIST, CIFAR-10/100, STL-10: Automatic download via [torchvision datasets](https://pytorch.org/vision/stable/datasets.html)\n- HatefulMemes: Manual download from [official website](https://hatefulmemeschallenge.com/#download) and sort images according to `train.jsonl`/`dev.jsonl` into train/dev folder\n- Rendered SST2, Country211: Manual download from [CLIP repo](https://github.com/openai/CLIP/tree/main/data)\n- Other datasets: Use scripts from [VISSL](https://github.com/facebookresearch/vissl/tree/main/extra_scripts/datasets)\n\nThen set all dataset paths in [dataset_catalog.json](dataset_catalog.json).\n\nEvaluate zero-shot transfer to various classification benchmarks with [eval_zeroshot.py](eval_zeroshot.py), which reads labels and templates from [labels.json](labels.json)/[templates.json](templates.json) and dataset paths from [dataset_catalog.json](dataset_catalog.json). Inference is performed with a single gpu. By default, the script iterates through all datasets in [dataset_catalog.json](dataset_catalog.json) and evaluates zero-shot in order. Evaluation can be limited to a subset of datasets by replacing `for d in datasets:` with `for d in ['imagenet']:` on line 78.\n\n```\npython eval_zeroshot.py --resume /path/to/checkpoint.pt\n```\n\n## 4. Evaluation: Linear Classification\nWe use a modified version of the MoCo v3 ImageNet linear classification script, [main_linear.py](main_linear.py).\nWe use the same single node 8-gpu recipe for all model sizes.\nSee [main_linear.py](main_linear.py) for the full list of default arguments.\nAs with pre-training, our workflow uses [submitit](https://github.com/facebookincubator/submitit).\nFor local training with [torchrun](https://pytorch.org/docs/stable/elastic/run.html), replace `python run_with_submitit_linear.py` with `torchrun --nproc_per_node=8 main_linear.py`. \nThis script reads the ImageNet dataset path from the dataset catalog ([dataset_catalog.json](dataset_catalog.json)), which must be set properly before training.\n\n```\npython run_with_submitit_linear.py  \\\n  --arch vit_base_patch16_224 --dataset imagenet \\\n  --pretrained /path/to/checkpoint.pt\n```\n\nTo evaluate linear classification on other datasets, set `--dataset` to the corresponding dataset name listed in [dataset_catalog.json](dataset_catalog.json).\n\n## 5. Evaluation: End-to-End Finetuning\nWe use a modified version of the ImageNet finetuning script from [BeiT](https://github.com/microsoft/unilm/tree/f8f3df80c65eb5e5fc6d6d3c9bd3137621795d1e/beit).\nOur code has been tested with commit `f8f3df8`.\nWe have removed the explicit torch, torchvision, and timm dependencies from [beit_finetuning/requirements.txt](beit_finetuning/requirements.txt), as they conflict with the versions used in our SLIP code (CUDA 11.3/CuDNN 8.2.0, PyTorch 1.10.0 and timm 0.5.0).\nThe fintuning code has been modified and tested to work with these versions.\n\n### 5.1. Setup\nTo evaluate end-to-end finetuning on ImageNet, first clone the BeiT repo and checkout the correct commit:\n\n```\ngit clone git@github.com:microsoft/unilm.git\ncd unilm/beit\ngit checkout f8f3df8\n```\n\nNow copy over modified files from our [beit_finetuning](beit_finetuning) directory:\n\n```\ncp beit_finetuning/* unilm/beit\ncd unilm/beit\n```\n\nInstall pip dependencies and Nvidia Apex:\n\n```\npip install -r requirements.txt\ngit clone https://github.com/NVIDIA/apex\ncd apex\npip install -v --disable-pip-version-check --no-cache-dir --global-option=\"--cpp_ext\" --global-option=\"--cuda_ext\" ./\n```\n\n\n### 5.2. Commands\nAs with pre-training, our workflow uses [submitit](https://github.com/facebookincubator/submitit).\nFor local training with [torchrun](https://pytorch.org/docs/stable/elastic/run.html), replace `python run_with_submitit_finetune.py` with `torchrun --nproc_per_node=8 run_class_finetuning.py`. \nWe established finetuning recipes based on the BeiT recipes with some light additional hyperparameter tuning.\nWe increase regularization with model size: ViT-S uses drop_path=0 and layer_decay=0.65, ViT-B uses drop_path=0.1 and layer_decay=0.65, and ViT-L uses drop_path=0.1 and layer_decay=0.75.\nNote the use of the `--finetune` argument instead of `--resume`.\n\n### ViT-Small (MoCo v3 version w/ 12 vs. 6 heads)\n\n```\npython run_with_submitit_finetune.py \\\n    --batch_size 128 --enable_deepspeed \\\n    --epochs 100 --warmup_epochs 20 \\\n    --model beit_small_patch16_224 --nb_classes 1000 \\\n    --imagenet_default_mean_and_std \\\n    --model_key state_dict --model_prefix module.visual. \\\n    --disable_rel_pos_bias --abs_pos_emb --use_cls \\\n    --mixup 0.8 --cutmix 1 \\\n    --layer_scale_init_value 0 \\\n    --lr 4e-3 --drop_path 0 --layer_decay 0.65 \\\n    --output_dir /path/to/output_dir --finetune /path/to/checkpoint.pt\n```\n\n### ViT-Base\n\n```\npython run_with_submitit_finetune.py \\\n    --batch_size 128 --enable_deepspeed \\\n    --epochs 100 --warmup_epochs 20 \\\n    --model beit_base_patch16_224 --nb_classes 1000 \\\n    --imagenet_default_mean_and_std \\\n    --model_key state_dict --model_prefix module.visual. \\\n    --disable_rel_pos_bias --abs_pos_emb --use_cls \\\n    --mixup 0.8 --cutmix 1 \\\n    --layer_scale_init_value 0 \\\n    --lr 4e-3 --drop_path 0.1 --layer_decay 0.65 \\\n    --output_dir /path/to/output_dir --finetune /path/to/checkpoint.pt\n```\n\n### ViT-Large\n\n```\npython run_with_submitit_finetune.py \\\n    --batch_size 128 --enable_deepspeed \\\n    --epochs 50 --warmup_epochs 5 \\\n    --model beit_large_patch16_224 --nb_classes 1000 \\\n    --imagenet_default_mean_and_std \\\n    --model_key state_dict --model_prefix module.visual. \\\n    --disable_rel_pos_bias --abs_pos_emb --use_cls \\\n    --mixup 0.8 --cutmix 1 \\\n    --layer_scale_init_value 0 \\\n    --lr 4e-3 --drop_path 0.1 --layer_decay 0.75 \\\n    --output_dir /path/to/output_dir --finetune /path/to/checkpoint.pt\n```\n\n\n### License\n\nThis project is under the MIT license. See [LICENSE](LICENSE) for details.\n\n### Citation\n```\n@Article{mu2021slip,\n  author  = {Norman Mu and Alexander Kirillov and David Wagner and Saining Xie},\n  title   = {SLIP: Self-supervision meets Language-Image Pre-training},\n  journal = {arXiv preprint arXiv:2112.12750},\n  year    = {2021},\n}\n```\n","funding_links":[],"categories":["Python","其他_机器视觉"],"sub_categories":["网络服务_其他"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2FSLIP","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffacebookresearch%2FSLIP","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2FSLIP/lists"}