Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/yzhuoning/Awesome-CLIP

Awesome list for research on CLIP (Contrastive Language-Image Pre-Training).
https://github.com/yzhuoning/Awesome-CLIP

List: Awesome-CLIP

clip contrastive-learning pre-training

Last synced: about 1 month ago
JSON representation

Awesome list for research on CLIP (Contrastive Language-Image Pre-Training).

Lists

README

        

# Awesome CLIP
This repo collects the research resources based on CLIP (Contrastive Language-Image Pre-Training) proposed by OpenAI. If you would like to contribute, please open an issue.

## CLIP
- [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) [[code](https://github.com/openai/CLIP)]
- [CLIP: Connecting Text and Images](https://openai.com/blog/clip/)
- [Multimodal Neurons in Artificial Neural Networks](https://openai.com/blog/multimodal-neurons/)

## Training
- OpenCLIP (3rd-party, PyTorch) [[code](https://github.com/mlfoundations/open_clip)]
- Train-CLIP (3rd-party, PyTorch) [[code](https://github.com/Zasder3/train-CLIP)]
- Paddle-CLIP (3rd-party, PaddlePaddle) [[code](https://github.com/AgentMaker/Paddle-CLIP)]

## Applications

### GAN
- VQGAN-CLIP [[code](https://github.com/nerdyrodent/VQGAN-CLIP)]
- [StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery](https://arxiv.org/abs/2103.17249) [[code](https://github.com/orpatashnik/StyleCLIP)]
- CLIP Guided Diffusion [[code](https://github.com/afiaka87/clip-guided-diffusion)]
- [CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions](https://arxiv.org/abs/2112.05219) [[code](https://github.com/RameenAbdal/CLIP2StyleGAN)]
- [TargetCLIP: Image-Based CLIP-Guided Essence Transfer](https://arxiv.org/abs/2110.12427) [[code](https://github.innominds.com/hila-chefer/TargetCLIP)]
- [DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation](https://arxiv.org/pdf/2110.02711.pdf) [[code](https://github.com/gwang-kim/DiffusionCLIP)]
- [Clip2latent: Text driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP](https://arxiv.org/pdf/2210.02347.pdf) [[code](https://github.com/justinpinkney/clip2latent)]

### Object Detection
- Roboflow Zero-shot Object Tracking [[code](https://github.com/roboflow-ai/zero-shot-object-tracking)]
- [Zero-Shot Detection via Vision and Language Knowledge Distillation](https://arxiv.org/abs/2104.13921) [[code](https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild)]
- Crop-CLIP [[code](https://github.com/vijishmadhavan/Crop-CLIP)]
- [Detic: Detecting Twenty-thousand Classes using Image-level Supervision](https://arxiv.org/abs/2201.02605) [[code](https://github.com/facebookresearch/Detic)]
- [CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks](https://arxiv.org/abs/2201.05729)
- [SLIP: Self-supervision meets Language-Image Pre-training](https://arxiv.org/abs/2112.12750) [[code](https://github.com/facebookresearch/SLIP)]
- [ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension](https://arxiv.org/pdf/2204.05991.pdf) [[code](https://github.com/allenai/reclip)]

### Information Retrieval
- Unsplash Image Search [[code](https://github.com/haltakov/natural-language-image-search)]
- [CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval](https://arxiv.org/abs/2104.08860) [[code](https://github.com/ArrowLuo/CLIP4Clip)]
- [Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling](https://arxiv.org/abs/2102.06183) [[code](https://github.com/jayleicn/ClipBERT)]
- Natural Language YouTube Search [[code](https://github.com/haltakov/natural-language-youtube-search)]
- [CLIP-as-service: Embed images and sentences into fixed-length vectors with CLIP](https://github.com/jina-ai/clip-as-service/tree/main/docs) [[code](https://github.com/jina-ai/clip-as-service)]
- clip-retrieval [[code](https://github.com/rom1504/clip-retrieval)]
- [A CLIP-Hitchhiker’s Guide to Long Video Retrieval](https://arxiv.org/pdf/2205.08508.pdf) [code]
- [CLIP2Video: Mastering Video-Text Retrieval via Image CLIP](https://arxiv.org/pdf/2106.11097.pdf) [[code](https://github.com/CryhanFang/CLIP2Video)]
- [X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval](https://arxiv.org/pdf/2207.07285.pdf) [[code](https://github.com/xuguohai/X-CLIP)]
- [Extending CLIP for Category-to-image Retrieval in E-commerce](https://mariyahendriksen.github.io/files/ecir22.pdf) [[code](https://github.com/mariyahendriksen/ecir2022_category_to_image_retrieval)]

### Representation Learning
- [Wav2CLIP: Learning Robust Audio Representations From CLIP](https://arxiv.org/pdf/2110.11499.pdf) [[code](https://github.com/descriptinc/lyrebird-Wav2CLIP)]
- [CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotation](https://arxiv.org/abs/2112.07133) [code]
- [RegionCLIP: Region-based Language-Image Pretraining](https://arxiv.org/pdf/2112.09106.pdf) [[code](https://github.com/microsoft/RegionCLIP)]
- [CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification](https://arxiv.org/abs/2112.03562) [code]
- [DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting](https://arxiv.org/pdf/2112.01518.pdf) [[code](https://github.com/raoyongming/DenseCLIP)]
- [CyCLIP: Cyclic Contrastive Language-Image Pretraining](https://arxiv.org/pdf/2205.14459v1.pdf) [[code](https://github.com/goel-shashank/CyCLIP)]
- [CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment](https://arxiv.org/pdf/2209.06430.pdf) [[code](https://github.com/microsoft/XPretrain/tree/main/CLIP-ViP)]
- [DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection](https://arxiv.org/pdf/2209.09407.pdf) [[code](https://github.com/Sense-GVT/DeCLIP)]
- [UniCLIP: Unified Framework for Contrastive Language–Image Pre-training](https://arxiv.org/pdf/2209.13430.pdf) [code]
- [SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model](https://arxiv.org/pdf/2210.00705.pdf) [[code](https://github.com/atosystem/SpeechCLIP)]
- [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/pdf/2211.01335.pdf) [[code](https://github.com/OFA-Sys/Chinese-CLIP)]
- [PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining](https://arxiv.org/pdf/2204.14095v2.pdf) [[code](https://github.com/Yuting-Gao/PyramidCLIP)]
- [Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training ](https://arxiv.org/pdf/2207.12661.pdf) [[code](https://github.com/Hxyou/MSCLIP)]
- [Fine-tuned CLIP Models are Efficient Video Learners](https://arxiv.org/pdf/2212.03640.pdf)[[code](https://github.com/muzairkhattak/ViFi-CLIP)]
- [MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining](https://arxiv.org/pdf/2208.12262.pdf) [code]
- [Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm](https://arxiv.org/abs/2110.05208) [[code](https://github.com/Sense-GVT/DeCLIP)]
- [Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision](https://arxiv.org/pdf/2203.05796v1.pdf) [[code](https://github.com/sense-gvt/declip)]

### Text-to-3D Generation
- [CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation](https://arxiv.org/pdf/2110.02624.pdf) [[code](https://github.com/AutodeskAILab/Clip-Forge)]
- [Text2Mesh: Text-Driven Neural Stylization for Meshes](https://arxiv.org/abs/2112.03221) [[code](https://github.com/threedle/text2mesh)]
- [CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP](https://arxiv.org/pdf/2203.00386.pdf) [[code](https://github.com/HFAiLab/clip-gen)]
- [CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders](https://arxiv.org/pdf/2106.14843.pdf) [[code](https://github.com/kvfrans/clipdraw/)]
- [CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields](https://arxiv.org/pdf/2112.05139.pdf) [[code](https://github.com/cassiePython/CLIPNeRF)]
- [MotionCLIP: Exposing Human Motion Generation to CLIP Space](https://arxiv.org/pdf/2203.08063.pdf) [[code](https://github.com/GuyTevet/MotionCLIP)]
- [AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars](https://arxiv.org/pdf/2205.08535.pdf) [[code](https://github.com/hongfz16/AvatarCLIP)]
- [ClipFace: Text-guided Editing of Textured 3D Morphable Models](https://arxiv.org/pdf/2212.01406.pdf?) [[code](https://github.com/sanonymous22/ClipFace)]

### Text-to-Image Generation
- Big Sleep: A simple command line tool for text to image generation [[code](https://github.com/lucidrains/big-sleep)]
- Deep Daze: A simple command line tool for text to image generation [[code](https://github.com/lucidrains/deep-daze)]
- [CLIP-CLOP: CLIP-Guided Collage and Photomontage](https://arxiv.org/pdf/2205.03146v2.pdf) [[code](https://github.com/deepmind/arnheim)]
- [CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP](https://arxiv.org/pdf/2203.00386.pdf) [[code](https://github.com/HFAiLab/clip-gen/blob/main/README_en.md)]

### Prompt Learning
- [Learning to Prompt for Vision-Language Models](https://arxiv.org/abs/2109.01134.pdf) [[code](https://github.com/KaiyangZhou/CoOp)]
- [Conditional Prompt Learning for Vision-Language Models](https://arxiv.org/abs/2203.05557.pdf) [[code](https://github.com/KaiyangZhou/CoOp)]
- [Prompt-aligned Gradient for Prompt Tuning](https://arxiv.org/abs/2205.14865.pdf) [[code](https://github.com/BeierZhu/Prompt-align)]
- [CLIP-Adapter: Better Vision-Language Models with Feature Adapters](https://arxiv.org/abs/2110.04544.pdf) [[code](https://github.com/gaopengcuhk/CLIP-Adapter)]
- [Learning to Compose Soft Prompts for Compositional Zero-Shot Learning](https://arxiv.org/abs/2204.03574) [[code](https://github.com/BatsResearch/csp)]

### Video Understanding
- [VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding](https://arxiv.org/pdf/2109.14084.pdf) [[code](https://github.com/pytorch/fairseq/tree/main/examples/MMPT)]
- [FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks](https://arxiv.org/pdf/2203.13371.pdf) [[code](https://github.com/bryant1410/fitclip)]
- [Frozen CLIP Models are Efficient Video Learners](https://arxiv.org/pdf/2208.03550.pdf) [[code](https://github.com/OpenGVLab/efficient-video-recognition)]
- [Towards Real-Time Text2Video via CLIP-Guided, Pixel-Level Optimization](https://arxiv.org/pdf/2210.12826.pdf) [[code](https://github.com/pschaldenbrand/Text2Video)]
- [MovieCLIP: Visual Scene Recognition in Movies](https://arxiv.org/pdf/2210.11065v2.pdf) [[code](https://github.com/usc-sail/mica-MovieCLIP)]

### Image Captioning
- CLIP prefix captioning [[code](https://github.com/rmokady/CLIP_prefix_caption)]
- [CLIPScore: A Reference-free Evaluation Metric for Image Captioning](https://arxiv.org/abs/2104.08718) [[code](https://github.com/jmhessel/clipscore)]
- [ClipCap: CLIP Prefix for Image Captioning](https://arxiv.org/pdf/2111.09734v1.pdf) [[code](https://github.com/rmokady/CLIP_prefix_caption)]
- [Text-Only Training for Image Captioning using Noise-Injected CLIP](https://arxiv.org/pdf/2211.00575.pdf) [[code](https://github.com/DavidHuji/CapDec)]
- [Fine-grained Image Captioning with CLIP Reward](https://arxiv.org/pdf/2205.13115.pdf) [[code](https://github.com/j-min/CLIP-Caption-Reward)]

### Image Editing
- [HairCLIP: Design Your Hair by Text and Reference Image](https://arxiv.org/pdf/2112.05142.pdf) [[code](https://github.com/wty-ustc/HairCLIP)]
- [CLIPstyler: Image Style Transfer with a Single Text Condition](https://arxiv.org/pdf/2112.00374.pdf) [[code](https://github.com/paper11667/CLIPstyler)]
- [CLIPasso: Semantically-Aware Object Sketching](https://clipasso.github.io/clipasso/static/source/paper_CLIPasso_Semantically_Aware_Object_Sketching.pdf) [[code](https://clipasso.github.io/clipasso/)]
- [Image-based CLIP-Guided Essence Transfer](https://arxiv.org/pdf/2110.12427.pdf) [[code](https://github.com/hila-chefer/TargetCLIP)]
- [CLIPDraw: Synthesize drawings to match a text prompt!](https://arxiv.org/pdf/2106.14843.pdf) [[code](https://github.com/kvfrans/clipdraw)]
- [CLIP-CLOP: CLIP-Guided Collage and Photomontage](https://arxiv.org/pdf/2205.03146.pdf) [[code](https://github.com/deepmind/arnheim)]
- [Towards Counterfactual Image Manipulation via CLIP](https://arxiv.org/pdf/2207.02812.pdf) [[code](https://github.com/yingchen001/CF-CLIP)]
- [ClipCrop: Conditioned Cropping Driven by Vision-Language Model](https://arxiv.org/pdf/2211.11492.pdf) [code]
- [CLIPascene: Scene Sketching with Different Types and Levels of Abstraction](https://arxiv.org/pdf/2211.17256.pdf) [[code](https://clipascene.github.io/CLIPascene/)]

### Image Segmentation
- [CLIMS: Cross Language Image Matching for Weakly Supervised Semantic Segmentation](https://arxiv.org/pdf/2203.02668.pdf) [[code](https://github.com/CVI-SZU/CLIMS)]
- [Image Segmentation Using Text and Image Prompts](https://arxiv.org/pdf/2112.10003.pdf) [[code](https://github.com/timojl/clipseg)]
- [Extract Free Dense Labels from CLIP](https://arxiv.org/pdf/2112.01071.pdf) [[code](https://github.com/chongzhou96/MaskCLIP)]
- [Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP](https://arxiv.org/pdf/2210.04150.pdf) [[code](https://github.com/facebookresearch/ov-seg)]

### 3D Recognition
- [PointCLIP: Point Cloud Understanding by CLIP](https://arxiv.org/pdf/2112.02413.pdf) [[code](https://github.com/zrrskywalker/pointclip)]
- [CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training](https://arxiv.org/pdf/2210.01055.pdf) [[code](https://github.com/tyhuang0428/CLIP2Point)]
- [MotionCLIP: Exposing Human Motion Generation to CLIP Space](https://arxiv.org/pdf/2203.08063.pdf) [[code](https://github.com/GuyTevet/MotionCLIP)]
- [LidarCLIP or: How I Learned to Talk to Point Clouds](https://arxiv.org/pdf/2212.06858.pdf?)[[code](https://github.com/atonderski/lidarclip)]
- [CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP](https://arxiv.org/pdf/2301.04926.pdf) [code]

### Audio
- [AudioCLIP: Extending CLIP to Image, Text and Audio](https://arxiv.org/pdf/2106.13043.pdf) [[code](https://github.com/AndreyGuzhov/AudioCLIP)]
- [Wav2CLIP: Learning Robust Audio Representations from Clip](https://arxiv.org/pdf/2110.11499.pdf) [[code](https://github.com/descriptinc/lyrebird-wav2clip)]
- [AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization](https://arxiv.org/pdf/2210.05060.pdf) [code]

### Language Tasks
- [CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment](https://arxiv.org/pdf/2203.07190v1.pdf) [code]

### Object Navigation
- [CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and Exploration](https://arxiv.org/pdf/2203.10421.pdf) [code]

### Localization
- [Adapting CLIP For Phrase Localization Without Further Training](https://arxiv.org/pdf/2204.03647.pdf) [[code](https://github.com/pals-ttic/adapting-CLIP)]

### Others
- Multilingual-CLIP [[code](https://github.com/FreddeFrallan/Multilingual-CLIP)]
- CLIP (With Haiku + Jax!) [[code](https://github.com/kingoflolz/CLIP_JAX)]
- [CLIP-Event: Connecting Text and Images with Event Structures](https://arxiv.org/abs/2201.05078) [[code](https://github.com/limanling/clip-event)]
- [How Much Can CLIP Benefit Vision-and-Language Tasks?](https://openreview.net/forum?id=zf_Ll3HZWgy) [[code](https://github.com/clip-vil/CLIP-ViL)]
- [CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning](https://arxiv.org/pdf/2203.11096.pdf) [[code](https://asgaardlab.github.io/CLIPxGamePhysics/)]
- [CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory](https://arxiv.org/pdf/2210.05663.pdf) [[code](https://github.com/notmahi/clip-fields)]
- [CLIP-Event: Connecting Text and Images with Event Structures](https://arxiv.org/pdf/2201.05078.pdf) [[code](https://github.com/limanling/clip-event)]
- [CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracywith ViT-B and ViT-L on ImageNet](https://arxiv.org/pdf/2212.06138v1.pdf) [[code](https://github.com/lightdxy/ft-clip)]
- [Task Residual for Tuning Vision-Language Models](https://arxiv.org/pdf/2211.10277.pdf) [[code](https://github.com/geekyutao/TaskRes)]

## Acknowledgment
Inspired by [Awesome Visual-Transformer](https://github.com/dk-liang/Awesome-Visual-Transformer).