{"id":13547521,"url":"https://github.com/lucidrains/x-clip","last_synced_at":"2025-04-01T11:01:39.989Z","repository":{"id":38743975,"uuid":"433655841","full_name":"lucidrains/x-clip","owner":"lucidrains","description":"A concise but complete implementation of CLIP with various experimental improvements from recent papers","archived":false,"fork":false,"pushed_at":"2023-10-16T15:02:57.000Z","size":1528,"stargazers_count":708,"open_issues_count":8,"forks_count":47,"subscribers_count":25,"default_branch":"main","last_synced_at":"2025-03-25T10:01:41.994Z","etag":null,"topics":["artificial-intelligence","contrastive-learning","deep-learning","multi-modal-learning","zero-shot-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lucidrains.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2021-12-01T02:26:33.000Z","updated_at":"2025-03-08T20:06:53.000Z","dependencies_parsed_at":"2023-01-23T08:45:57.974Z","dependency_job_id":"956af52a-4f12-4b90-bce6-ea61161fb9b4","html_url":"https://github.com/lucidrains/x-clip","commit_stats":{"total_commits":67,"total_committers":1,"mean_commits":67.0,"dds":0.0,"last_synced_commit":"91fe1ed33fa8d0dcff7865e0a102f7d1728d6359"},"previous_names":[],"tags_count":63,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fx-clip","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fx-clip/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fx-clip/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fx-clip/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lucidrains","download_url":"https://codeload.github.com/lucidrains/x-clip/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246531968,"owners_count":20792736,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","contrastive-learning","deep-learning","multi-modal-learning","zero-shot-learning"],"created_at":"2024-08-01T12:00:57.447Z","updated_at":"2025-04-01T11:01:39.940Z","avatar_url":"https://github.com/lucidrains.png","language":"Python","funding_links":[],"categories":["CLIP-related","Python"],"sub_categories":[],"readme":"\u003cimg src=\"./clip.png\" width=\"600px\"\u003e\u003c/img\u003e\n\n\u003ca href=\"https://discord.gg/xBPBXfcFHd\"\u003e\u003cimg alt=\"Join us on Discord\" src=\"https://img.shields.io/discord/823813159592001537?color=5865F2\u0026logo=discord\u0026logoColor=white\"\u003e\u003c/a\u003e\n\n## x-clip\n\nA concise but complete implementation of \u003ca href=\"https://openai.com/blog/clip/\"\u003eCLIP\u003c/a\u003e with various experimental improvements from recent papers\n\n## Install\n\n```bash\n$ pip install x-clip\n```\n\n## Usage\n\n```python\nimport torch\nfrom x_clip import CLIP\n\nclip = CLIP(\n    dim_text = 512,\n    dim_image = 512,\n    dim_latent = 512,\n    num_text_tokens = 10000,\n    text_enc_depth = 6,\n    text_seq_len = 256,\n    text_heads = 8,\n    visual_enc_depth = 6,\n    visual_image_size = 256,\n    visual_patch_size = 32,\n    visual_heads = 8,\n    visual_patch_dropout = 0.5,             # patch dropout probability, used in Kaiming He's FLIP to save compute and improve end results - 0.5 is good value, 0.75 on high end is tolerable\n    use_all_token_embeds = False,           # whether to use fine-grained contrastive learning (FILIP)\n    decoupled_contrastive_learning = True,  # use decoupled contrastive learning (DCL) objective function, removing positive pairs from the denominator of the InfoNCE loss (CLOOB + DCL)\n    extra_latent_projection = True,         # whether to use separate projections for text-to-image vs image-to-text comparisons (CLOOB)\n    use_visual_ssl = True,                  # whether to do self supervised learning on iages\n    use_mlm = False,                        # use masked language learning (MLM) on text (DeCLIP)\n    text_ssl_loss_weight = 0.05,            # weight for text MLM loss\n    image_ssl_loss_weight = 0.05            # weight for image self-supervised learning loss\n)\n\n# mock data\n\ntext = torch.randint(0, 10000, (4, 256))\nimages = torch.randn(4, 3, 256, 256)\n\n# train\n\nloss = clip(\n    text,\n    images,\n    freeze_image_encoder = False,   # whether to freeze image encoder if using a pretrained image net, proposed by LiT paper\n    return_loss = True              # needs to be set to True to return contrastive loss\n)\n\nloss.backward()\n```\n\nYou can also pass in an external visual transformer / residual net. You simply have to make sure your image encoder returns a set of embeddings in the shape of `batch x seq x dim`, and make sure `dim_image` is properly specified as the dimension of the returned embeddings. Below is an example using vision transformer from `vit_pytorch`\n\n```bash\n$ pip install vit_pytorch\u003e=0.25.6\n```\n\n```python\nimport torch\nfrom x_clip import CLIP\n\nfrom vit_pytorch import ViT\nfrom vit_pytorch.extractor import Extractor\n\nbase_vit = ViT(\n    image_size = 256,\n    patch_size = 32,\n    num_classes = 1000,\n    dim = 512,\n    depth = 6,\n    heads = 16,\n    mlp_dim = 2048,\n    dropout = 0.1,\n    emb_dropout = 0.1\n)\n\nvit = Extractor(\n    base_vit,\n    return_embeddings_only = True\n)\n\nclip = CLIP(\n    image_encoder = vit,\n    dim_image = 512,           # must be set as the same dimensions as the vision transformer above\n    dim_text = 512,\n    dim_latent = 512,\n    num_text_tokens = 10000,\n    text_enc_depth = 6,\n    text_seq_len = 256,\n    text_heads = 8\n)\n\ntext = torch.randint(0, 10000, (4, 256))\nimages = torch.randn(4, 3, 256, 256)\n\nloss = clip(text, images, return_loss = True)\nloss.backward()\n```\n\nFinally, one can also have the text transformer be externally defined. It will need to return the embeddings including the CLS token, for now.\n\n```python\nimport torch\nfrom x_clip import CLIP, TextTransformer\n\nfrom vit_pytorch import ViT\nfrom vit_pytorch.extractor import Extractor\n\nbase_vit = ViT(\n    image_size = 256,\n    patch_size = 32,\n    num_classes = 1000,\n    dim = 512,\n    depth = 6,\n    heads = 16,\n    mlp_dim = 2048,\n    dropout = 0.1,\n    emb_dropout = 0.1\n)\n\nimage_encoder = Extractor(\n    base_vit,\n    return_embeddings_only = True\n)\n\ntext_encoder = TextTransformer(\n    dim = 512,\n    num_tokens = 10000,\n    max_seq_len = 256,\n    depth = 6,\n    heads = 8\n)\n\nclip = CLIP(\n    image_encoder = image_encoder,\n    text_encoder = text_encoder,\n    dim_image = 512,\n    dim_text = 512,\n    dim_latent = 512\n)\n\ntext = torch.randint(0, 10000, (4, 256))\nimages = torch.randn(4, 3, 256, 256)\n\nloss = clip(text, images, return_loss = True)\nloss.backward()\n```\n\n## Multiview CL Losses\n\nThis repository also supports multiview contrastive learning loss, as proposed in \u003ca href=\"https://arxiv.org/abs/2110.05208\"\u003eDeCLIP\u003c/a\u003e. Just pass in the augmented text and/or augmented image, and it will be auto-calculated, weighed by `multiview_loss_weight` set on initialization.\n\nex.\n\n```python\nimport torch\nfrom x_clip import CLIP, TextTransformer\n\nfrom vit_pytorch import ViT\nfrom vit_pytorch.extractor import Extractor\n\nbase_vit = ViT(\n    image_size = 256,\n    patch_size = 32,\n    num_classes = 1000,\n    dim = 512,\n    depth = 6,\n    heads = 16,\n    mlp_dim = 2048,\n    dropout = 0.1,\n    emb_dropout = 0.1\n)\n\nimage_encoder = Extractor(\n    base_vit,\n    return_embeddings_only = True\n)\n\ntext_encoder = TextTransformer(\n    dim = 512,\n    num_tokens = 10000,\n    max_seq_len = 256 + 1,\n    depth = 6,\n    heads = 8\n)\n\nclip = CLIP(\n    image_encoder = image_encoder,\n    text_encoder = text_encoder,\n    dim_image = 512,\n    dim_text = 512,\n    dim_latent = 512,\n    extra_latent_projection = True,\n    multiview_loss_weight = 0.1         # weight multiview contrastive loss by 0.1\n)\n\ntext = torch.randint(0, 10000, (4, 256))\nimages = torch.randn(4, 3, 256, 256)\n\naug_text = torch.randint(0, 10000, (4, 256))  # augmented text (backtranslation or EDA), same dimensions as text\naug_images = torch.randn(4, 3, 256, 256)      # augmented images, same dimension as images above\nloss = clip(\n    text,\n    images,\n    aug_text = aug_text,           # pass in augmented texts\n    aug_image = aug_images,        # pass in augmented images\n    return_loss = True,\n    freeze_image_encoder = True\n)\n\nloss.backward()\n```\n\nYou can even send in more than one augmented text or image\n\n```python\n# ...\n\naug_texts = (\n    torch.randint(0, 10000, (4, 256)),\n    torch.randint(0, 10000, (4, 256)),\n)\n\naug_images = (\n    torch.randn(4, 3, 256, 256),\n    torch.randn(4, 3, 256, 256),\n)\n\nloss = clip(\n    text,\n    images,\n    aug_text = aug_texts,\n    aug_image = aug_images,\n    return_loss = True,\n    freeze_image_encoder = True\n)\n\nloss.backward()\n```\n\n## Custom Vision Self-supervised Learning Module\n\nYou can pass in your own vision self-supervised learning module through the `visual_ssl` keyword as so\n\n```python\nimport torch\nfrom x_clip import CLIP\nfrom x_clip.visual_ssl import SimSiam\n\nfrom vit_pytorch import ViT\nfrom vit_pytorch.extractor import Extractor\n\nbase_vit = ViT(\n    image_size = 256,\n    patch_size = 32,\n    num_classes = 1000,\n    dim = 512,\n    depth = 6,\n    heads = 16,\n    mlp_dim = 2048,\n    dropout = 0.1,\n    emb_dropout = 0.1\n)\n\nimage_encoder = Extractor(\n    base_vit,\n    return_embeddings_only = True\n)\n\nvisual_ssl = SimSiam(                 # SimSiam defined externally - needs to be a module that accepts an image of the same dimensions as CLIP and returns a scalar loss\n    image_encoder,\n    image_size = 256,\n    hidden_layer = -1\n)\n\nclip = CLIP(\n    image_encoder = image_encoder,\n    dim_image = 512,\n    dim_text = 512,\n    dim_latent = 512,\n    use_mlm = True,\n    visual_ssl = visual_ssl,           # SSL module passed into CLIP\n    use_all_token_embeds = False,\n    extra_latent_projection = False,\n    mlm_random_token_prob = 0.1\n)\n\ntext = torch.randint(0, 10000, (4, 256))\nimages = torch.randn(4, 3, 256, 256)\n\nloss = clip(text, images, return_loss = True)\nloss.backward()\n\n```\n\n## Citations\n\n```bibtex\n@misc{radford2021learning,\n    title   = {Learning Transferable Visual Models From Natural Language Supervision}, \n    author  = {Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},\n    year    = {2021},\n    eprint  = {2103.00020},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CV}\n}\n```\n\n```bibtex\n@misc{yao2021filip,\n    title   = {FILIP: Fine-grained Interactive Language-Image Pre-Training}, \n    author  = {Lewei Yao and Runhui Huang and Lu Hou and Guansong Lu and Minzhe Niu and Hang Xu and Xiaodan Liang and Zhenguo Li and Xin Jiang and Chunjing Xu},\n    year    = {2021},\n    eprint  = {2111.07783},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CV}\n}\n```\n\n```bibtex\n@misc{fürst2021cloob,\n    title   = {CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP},\n    author  = {Andreas Fürst and Elisabeth Rumetshofer and Viet Tran and Hubert Ramsauer and Fei Tang and Johannes Lehner and David Kreil and Michael Kopp and Günter Klambauer and Angela Bitto-Nemling and Sepp Hochreiter},\n    year    = {2021},\n    eprint  = {2110.11316},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.LG}\n}\n```\n\n```bibtex\n@misc{yeh2021decoupled,\n    title   = {Decoupled Contrastive Learning},\n    author  = {Chun-Hsiao Yeh and Cheng-Yao Hong and Yen-Chi Hsu and Tyng-Luh Liu and Yubei Chen and Yann LeCun},\n    year    = {2021},\n    eprint  = {2110.06848},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.LG}\n}\n```\n\n```bibtex\n@misc{zhai2021lit,\n    title   = {LiT: Zero-Shot Transfer with Locked-image Text Tuning},\n    author  = {Xiaohua Zhai and Xiao Wang and Basil Mustafa and Andreas Steiner and Daniel Keysers and Alexander Kolesnikov and Lucas Beyer},\n    year    = {2021},\n    eprint  = {2111.07991},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CV}\n}\n```\n\n```bibtex\n@misc{li2021supervision,\n    title   = {Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm},\n    author  = {Yangguang Li and Feng Liang and Lichen Zhao and Yufeng Cui and Wanli Ouyang and Jing Shao and Fengwei Yu and Junjie Yan},\n    year    = {2021},\n    eprint  = {2110.05208},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CV}\n}\n```\n\n```bibtex\n@Article{mu2021slip,\n    author  = {Norman Mu and Alexander Kirillov and David Wagner and Saining Xie},\n    title   = {SLIP: Self-supervision meets Language-Image Pre-training},\n    journal = {arXiv preprint arXiv:2112.12750},\n    year    = {2021},\n}\n```\n\n```bibtex\n@misc{su2021roformer,\n    title   = {RoFormer: Enhanced Transformer with Rotary Position Embedding},\n    author  = {Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu},\n    year    = {2021},\n    eprint  = {2104.09864},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CL}\n}\n```\n\n```bibtex\n@inproceedings{anonymous2022normformer,\n    title   = {NormFormer: Improved Transformer Pretraining with Extra Normalization},\n    author  = {Anonymous},\n    booktitle = {Submitted to The Tenth International Conference on Learning Representations },\n    year    = {2022},\n    url     = {https://openreview.net/forum?id=GMYWzWztDx5},\n    note    = {under review}\n}\n```\n\n```bibtex\n@inproceedings{Li2022ScalingLP,\n    title   = {Scaling Language-Image Pre-training via Masking},\n    author  = {Yanghao Li and Haoqi Fan and Ronghang Hu and Christoph Feichtenhofer and Kaiming He},\n    year    = {2022}\n}\n```\n\n```bibtex\n@article{Liu2022PatchDropoutEV,\n    title   = {PatchDropout: Economizing Vision Transformers Using Patch Dropout},\n    author  = {Yue Liu and Christos Matsoukas and Fredrik Strand and Hossein Azizpour and Kevin Smith},\n    journal = {ArXiv},\n    year    = {2022},\n    volume  = {abs/2208.07220}\n}\n```\n\n```bibtex\n@misc{shi2023enhance,\n    title   = {Enhance audio generation controllability through representation similarity regularization}, \n    author  = {Yangyang Shi and Gael Le Lan and Varun Nagaraja and Zhaoheng Ni and Xinhao Mei and Ernie Chang and Forrest Iandola and Yang Liu and Vikas Chandra},\n    year    = {2023},\n    eprint  = {2309.08773},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.SD}\n}\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Fx-clip","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flucidrains%2Fx-clip","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Fx-clip/lists"}