https://github.com/yzhuoning/Awesome-CLIP

Awesome list for research on CLIP (Contrastive Language-Image Pre-Training).
https://github.com/yzhuoning/Awesome-CLIP
List: Awesome-CLIP
clip contrastive-learning pre-training
Last synced: 7 months ago
JSON representation
Awesome list for research on CLIP (Contrastive Language-Image Pre-Training).
Host: GitHub
URL: https://github.com/yzhuoning/Awesome-CLIP
Owner: yzhuoning
Created: 2021-09-05T01:43:48.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2023-08-05T05:15:13.000Z (about 2 years ago)
Last Synced: 2024-05-19T20:12:09.214Z (over 1 year ago)
Topics: clip, contrastive-learning, pre-training
Homepage:
Size: 60.5 KB
Stars: 1,035
Watchers: 19
Forks: 54
Open Issues: 10
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

awesome-vision-language-pretraining - awesome-clip
Awesome-Computer-Vision - 8
ultimate-awesome - Awesome-CLIP - Awesome list for research on CLIP (Contrastive Language-Image Pre-Training). (Other Lists / TeX Lists)
README

          # Awesome CLIP 

This repo collects the research resources based on CLIP (Contrastive Language-Image Pre-Training) proposed by OpenAI. If you would like to contribute, please open an issue.

## CLIP 

- [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) [[code](https://github.com/openai/CLIP)]

- [CLIP: Connecting Text and Images](https://openai.com/blog/clip/)

- [Multimodal Neurons in Artificial Neural Networks](https://openai.com/blog/multimodal-neurons/)

## Training

- OpenCLIP (3rd-party, PyTorch) [[code](https://github.com/mlfoundations/open_clip)]  

- Train-CLIP (3rd-party, PyTorch) [[code](https://github.com/Zasder3/train-CLIP)] 

- Paddle-CLIP (3rd-party, PaddlePaddle) [[code](https://github.com/AgentMaker/Paddle-CLIP)] 

## Applications

### GAN 

- VQGAN-CLIP [[code](https://github.com/nerdyrodent/VQGAN-CLIP)]

- [StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery](https://arxiv.org/abs/2103.17249) [[code](https://github.com/orpatashnik/StyleCLIP)]

- CLIP Guided Diffusion [[code](https://github.com/afiaka87/clip-guided-diffusion)]

- [CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions](https://arxiv.org/abs/2112.05219) [[code](https://github.com/RameenAbdal/CLIP2StyleGAN)]

- [TargetCLIP: Image-Based CLIP-Guided Essence Transfer](https://arxiv.org/abs/2110.12427) [[code](https://github.innominds.com/hila-chefer/TargetCLIP)]

- [DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation](https://arxiv.org/pdf/2110.02711.pdf) [[code](https://github.com/gwang-kim/DiffusionCLIP)]

- [Clip2latent: Text driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP](https://arxiv.org/pdf/2210.02347.pdf) [[code](https://github.com/justinpinkney/clip2latent)]

### Object Detection

- Roboflow Zero-shot Object Tracking [[code](https://github.com/roboflow-ai/zero-shot-object-tracking)] 

- [Zero-Shot Detection via Vision and Language Knowledge Distillation](https://arxiv.org/abs/2104.13921) [[code](https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild)]

- Crop-CLIP [[code](https://github.com/vijishmadhavan/Crop-CLIP)]

- [Detic: Detecting Twenty-thousand Classes using Image-level Supervision](https://arxiv.org/abs/2201.02605) [[code](https://github.com/facebookresearch/Detic)] 

- [CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks](https://arxiv.org/abs/2201.05729)

- [SLIP: Self-supervision meets Language-Image Pre-training](https://arxiv.org/abs/2112.12750) [[code](https://github.com/facebookresearch/SLIP)]

- [ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension](https://arxiv.org/pdf/2204.05991.pdf) [[code](https://github.com/allenai/reclip)]  

### Information Retrieval

- Unsplash Image Search [[code](https://github.com/haltakov/natural-language-image-search)]

- [CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval](https://arxiv.org/abs/2104.08860) [[code](https://github.com/ArrowLuo/CLIP4Clip)]

- [Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling](https://arxiv.org/abs/2102.06183) [[code](https://github.com/jayleicn/ClipBERT)]

- Natural Language YouTube Search [[code](https://github.com/haltakov/natural-language-youtube-search)]

- [CLIP-as-service: Embed images and sentences into fixed-length vectors with CLIP](https://github.com/jina-ai/clip-as-service/tree/main/docs) [[code](https://github.com/jina-ai/clip-as-service)]

- clip-retrieval [[code](https://github.com/rom1504/clip-retrieval)]

- [A CLIP-Hitchhiker’s Guide to Long Video Retrieval](https://arxiv.org/pdf/2205.08508.pdf) [code]

- [CLIP2Video: Mastering Video-Text Retrieval via Image CLIP](https://arxiv.org/pdf/2106.11097.pdf) [[code](https://github.com/CryhanFang/CLIP2Video)]

- [X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval](https://arxiv.org/pdf/2207.07285.pdf) [[code](https://github.com/xuguohai/X-CLIP)]

- [Extending CLIP for Category-to-image Retrieval in E-commerce](https://mariyahendriksen.github.io/files/ecir22.pdf) [[code](https://github.com/mariyahendriksen/ecir2022_category_to_image_retrieval)]

### Representation Learning 

- [Wav2CLIP: Learning Robust Audio Representations From CLIP](https://arxiv.org/pdf/2110.11499.pdf) [[code](https://github.com/descriptinc/lyrebird-Wav2CLIP)]

- [CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotation](https://arxiv.org/abs/2112.07133) [code]

- [RegionCLIP: Region-based Language-Image Pretraining](https://arxiv.org/pdf/2112.09106.pdf) [[code](https://github.com/microsoft/RegionCLIP)]

- [CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification](https://arxiv.org/abs/2112.03562) [code]

- [DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting](https://arxiv.org/pdf/2112.01518.pdf) [[code](https://github.com/raoyongming/DenseCLIP)]

- [CyCLIP: Cyclic Contrastive Language-Image Pretraining](https://arxiv.org/pdf/2205.14459v1.pdf) [[code](https://github.com/goel-shashank/CyCLIP)]

- [CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment](https://arxiv.org/pdf/2209.06430.pdf) [[code](https://github.com/microsoft/XPretrain/tree/main/CLIP-ViP)]

- [DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection](https://arxiv.org/pdf/2209.09407.pdf) [[code](https://github.com/Sense-GVT/DeCLIP)]

- [UniCLIP: Unified Framework for Contrastive Language–Image Pre-training](https://arxiv.org/pdf/2209.13430.pdf) [code]

- [SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model](https://arxiv.org/pdf/2210.00705.pdf) [[code](https://github.com/atosystem/SpeechCLIP)]

- [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/pdf/2211.01335.pdf) [[code](https://github.com/OFA-Sys/Chinese-CLIP)]

- [PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining](https://arxiv.org/pdf/2204.14095v2.pdf) [[code](https://github.com/Yuting-Gao/PyramidCLIP)]

- [Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training ](https://arxiv.org/pdf/2207.12661.pdf) [[code](https://github.com/Hxyou/MSCLIP)]

- [Fine-tuned CLIP Models are Efficient Video Learners](https://arxiv.org/pdf/2212.03640.pdf)[[code](https://github.com/muzairkhattak/ViFi-CLIP)]

- [MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining](https://arxiv.org/pdf/2208.12262.pdf) [code]

- [Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm](https://arxiv.org/abs/2110.05208) [[code](https://github.com/Sense-GVT/DeCLIP)]

- [Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision](https://arxiv.org/pdf/2203.05796v1.pdf) [[code](https://github.com/sense-gvt/declip)]

### Text-to-3D Generation

- [CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation](https://arxiv.org/pdf/2110.02624.pdf) [[code](https://github.com/AutodeskAILab/Clip-Forge)]

- [Text2Mesh: Text-Driven Neural Stylization for Meshes](https://arxiv.org/abs/2112.03221) [[code](https://github.com/threedle/text2mesh)]

- [CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP](https://arxiv.org/pdf/2203.00386.pdf) [[code](https://github.com/HFAiLab/clip-gen)]

- [CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders](https://arxiv.org/pdf/2106.14843.pdf) [[code](https://github.com/kvfrans/clipdraw/)]

- [CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields](https://arxiv.org/pdf/2112.05139.pdf) [[code](https://github.com/cassiePython/CLIPNeRF)]

- [MotionCLIP: Exposing Human Motion Generation to CLIP Space](https://arxiv.org/pdf/2203.08063.pdf) [[code](https://github.com/GuyTevet/MotionCLIP)]

- [AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars](https://arxiv.org/pdf/2205.08535.pdf) [[code](https://github.com/hongfz16/AvatarCLIP)]

- [ClipFace: Text-guided Editing of Textured 3D Morphable Models](https://arxiv.org/pdf/2212.01406.pdf?) [[code](https://github.com/sanonymous22/ClipFace)]

### Text-to-Image Generation

- Big Sleep: A simple command line tool for text to image generation [[code](https://github.com/lucidrains/big-sleep)]

- Deep Daze: A simple command line tool for text to image generation [[code](https://github.com/lucidrains/deep-daze)]

- [CLIP-CLOP: CLIP-Guided Collage and Photomontage](https://arxiv.org/pdf/2205.03146v2.pdf) [[code](https://github.com/deepmind/arnheim)]

- [CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP](https://arxiv.org/pdf/2203.00386.pdf) [[code](https://github.com/HFAiLab/clip-gen/blob/main/README_en.md)]

### Prompt Learning

- [Learning to Prompt for Vision-Language Models](https://arxiv.org/abs/2109.01134.pdf) [[code](https://github.com/KaiyangZhou/CoOp)]

- [Conditional Prompt Learning for Vision-Language Models](https://arxiv.org/abs/2203.05557.pdf) [[code](https://github.com/KaiyangZhou/CoOp)]

- [Prompt-aligned Gradient for Prompt Tuning](https://arxiv.org/abs/2205.14865.pdf) [[code](https://github.com/BeierZhu/Prompt-align)]

- [CLIP-Adapter: Better Vision-Language Models with Feature Adapters](https://arxiv.org/abs/2110.04544.pdf) [[code](https://github.com/gaopengcuhk/CLIP-Adapter)]

- [Learning to Compose Soft Prompts for Compositional Zero-Shot Learning](https://arxiv.org/abs/2204.03574) [[code](https://github.com/BatsResearch/csp)]

### Video Understanding

- [VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding](https://arxiv.org/pdf/2109.14084.pdf) [[code](https://github.com/pytorch/fairseq/tree/main/examples/MMPT)]

- [FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks](https://arxiv.org/pdf/2203.13371.pdf) [[code](https://github.com/bryant1410/fitclip)]

- [Frozen CLIP Models are Efficient Video Learners](https://arxiv.org/pdf/2208.03550.pdf) [[code](https://github.com/OpenGVLab/efficient-video-recognition)]

- [Towards Real-Time Text2Video via CLIP-Guided, Pixel-Level Optimization](https://arxiv.org/pdf/2210.12826.pdf) [[code](https://github.com/pschaldenbrand/Text2Video)] 

- [MovieCLIP: Visual Scene Recognition in Movies](https://arxiv.org/pdf/2210.11065v2.pdf) [[code](https://github.com/usc-sail/mica-MovieCLIP)]

### Image Captioning

- CLIP prefix captioning [[code](https://github.com/rmokady/CLIP_prefix_caption)]

- [CLIPScore: A Reference-free Evaluation Metric for Image Captioning](https://arxiv.org/abs/2104.08718) [[code](https://github.com/jmhessel/clipscore)]

- [ClipCap: CLIP Prefix for Image Captioning](https://arxiv.org/pdf/2111.09734v1.pdf) [[code](https://github.com/rmokady/CLIP_prefix_caption)]

- [Text-Only Training for Image Captioning using Noise-Injected CLIP](https://arxiv.org/pdf/2211.00575.pdf) [[code](https://github.com/DavidHuji/CapDec)]

- [Fine-grained Image Captioning with CLIP Reward](https://arxiv.org/pdf/2205.13115.pdf) [[code](https://github.com/j-min/CLIP-Caption-Reward)]

### Image Editing 

- [HairCLIP: Design Your Hair by Text and Reference Image](https://arxiv.org/pdf/2112.05142.pdf) [[code](https://github.com/wty-ustc/HairCLIP)]

- [CLIPstyler: Image Style Transfer with a Single Text Condition](https://arxiv.org/pdf/2112.00374.pdf) [[code](https://github.com/paper11667/CLIPstyler)]

- [CLIPasso: Semantically-Aware Object Sketching](https://clipasso.github.io/clipasso/static/source/paper_CLIPasso_Semantically_Aware_Object_Sketching.pdf) [[code](https://clipasso.github.io/clipasso/)]

- [Image-based CLIP-Guided Essence Transfer](https://arxiv.org/pdf/2110.12427.pdf) [[code](https://github.com/hila-chefer/TargetCLIP)]

- [CLIPDraw: Synthesize drawings to match a text prompt!](https://arxiv.org/pdf/2106.14843.pdf) [[code](https://github.com/kvfrans/clipdraw)]

- [CLIP-CLOP: CLIP-Guided Collage and Photomontage](https://arxiv.org/pdf/2205.03146.pdf) [[code](https://github.com/deepmind/arnheim)]

- [Towards Counterfactual Image Manipulation via CLIP](https://arxiv.org/pdf/2207.02812.pdf) [[code](https://github.com/yingchen001/CF-CLIP)]

- [ClipCrop: Conditioned Cropping Driven by Vision-Language Model](https://arxiv.org/pdf/2211.11492.pdf) [code]

- [CLIPascene: Scene Sketching with Different Types and Levels of Abstraction](https://arxiv.org/pdf/2211.17256.pdf) [[code](https://clipascene.github.io/CLIPascene/)]

### Image Segmentation

- [CLIMS: Cross Language Image Matching for Weakly Supervised Semantic Segmentation](https://arxiv.org/pdf/2203.02668.pdf) [[code](https://github.com/CVI-SZU/CLIMS)]

- [Image Segmentation Using Text and Image Prompts](https://arxiv.org/pdf/2112.10003.pdf) [[code](https://github.com/timojl/clipseg)]

- [Extract Free Dense Labels from CLIP](https://arxiv.org/pdf/2112.01071.pdf) [[code](https://github.com/chongzhou96/MaskCLIP)]

- [Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP](https://arxiv.org/pdf/2210.04150.pdf) [[code](https://github.com/facebookresearch/ov-seg)] 

### 3D Recognition

- [PointCLIP: Point Cloud Understanding by CLIP](https://arxiv.org/pdf/2112.02413.pdf) [[code](https://github.com/zrrskywalker/pointclip)]

- [CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training](https://arxiv.org/pdf/2210.01055.pdf) [[code](https://github.com/tyhuang0428/CLIP2Point)]

- [MotionCLIP: Exposing Human Motion Generation to CLIP Space](https://arxiv.org/pdf/2203.08063.pdf) [[code](https://github.com/GuyTevet/MotionCLIP)] 

- [LidarCLIP or: How I Learned to Talk to Point Clouds](https://arxiv.org/pdf/2212.06858.pdf?)[[code](https://github.com/atonderski/lidarclip)]

- [CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP](https://arxiv.org/pdf/2301.04926.pdf) [code]

### Audio

- [AudioCLIP: Extending CLIP to Image, Text and Audio](https://arxiv.org/pdf/2106.13043.pdf) [[code](https://github.com/AndreyGuzhov/AudioCLIP)]

- [Wav2CLIP: Learning Robust Audio Representations from Clip](https://arxiv.org/pdf/2110.11499.pdf) [[code](https://github.com/descriptinc/lyrebird-wav2clip)] 

- [AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization](https://arxiv.org/pdf/2210.05060.pdf) [code]

### Language Tasks

- [CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment](https://arxiv.org/pdf/2203.07190v1.pdf) [code]

### Object Navigation

- [CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and Exploration](https://arxiv.org/pdf/2203.10421.pdf) [code]

### Localization

- [Adapting CLIP For Phrase Localization Without Further Training](https://arxiv.org/pdf/2204.03647.pdf) [[code](https://github.com/pals-ttic/adapting-CLIP)]

### Others

- Multilingual-CLIP [[code](https://github.com/FreddeFrallan/Multilingual-CLIP)]

- CLIP (With Haiku + Jax!) [[code](https://github.com/kingoflolz/CLIP_JAX)]

- [CLIP-Event: Connecting Text and Images with Event Structures](https://arxiv.org/abs/2201.05078) [[code](https://github.com/limanling/clip-event)]

- [How Much Can CLIP Benefit Vision-and-Language Tasks?](https://openreview.net/forum?id=zf_Ll3HZWgy) [[code](https://github.com/clip-vil/CLIP-ViL)]

- [CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning](https://arxiv.org/pdf/2203.11096.pdf) [[code](https://asgaardlab.github.io/CLIPxGamePhysics/)]

- [CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory](https://arxiv.org/pdf/2210.05663.pdf) [[code](https://github.com/notmahi/clip-fields)]

- [CLIP-Event: Connecting Text and Images with Event Structures](https://arxiv.org/pdf/2201.05078.pdf) [[code](https://github.com/limanling/clip-event)]

- [CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracywith ViT-B and ViT-L on ImageNet](https://arxiv.org/pdf/2212.06138v1.pdf) [[code](https://github.com/lightdxy/ft-clip)]

- [Task Residual for Tuning Vision-Language Models](https://arxiv.org/pdf/2211.10277.pdf) [[code](https://github.com/geekyutao/TaskRes)]

## Acknowledgment

Inspired by [Awesome Visual-Transformer](https://github.com/dk-liang/Awesome-Visual-Transformer).
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/yzhuoning/Awesome-CLIP

Awesome Lists containing this project

README