Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Awesome-CLIP
Awesome list for research on CLIP (Contrastive Language-Image Pre-Training).
https://github.com/yzhuoning/Awesome-CLIP
Last synced: 4 days ago
JSON representation
-
CLIP
-
Applications
-
GAN
- StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
- CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions
- TargetCLIP: Image-Based CLIP-Guided Essence Transfer - chefer/TargetCLIP)]
- DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation - kim/DiffusionCLIP)]
- Clip2latent: Text driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP
- [code
- [code
- DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation - kim/DiffusionCLIP)]
-
Object Detection
- Zero-Shot Detection via Vision and Language Knowledge Distillation
- Detic: Detecting Twenty-thousand Classes using Image-level Supervision
- CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks
- SLIP: Self-supervision meets Language-Image Pre-training
- ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension
- [code
- ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension
-
Information Retrieval
- CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
- Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
- CLIP-as-service: Embed images and sentences into fixed-length vectors with CLIP - ai/clip-as-service)]
- CLIP2Video: Mastering Video-Text Retrieval via Image CLIP
- X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval - CLIP)]
- Extending CLIP for Category-to-image Retrieval in E-commerce
- [code
- [code
- [code
- X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval - CLIP)]
- A CLIP-Hitchhiker’s Guide to Long Video Retrieval
- A CLIP-Hitchhiker’s Guide to Long Video Retrieval
- CLIP2Video: Mastering Video-Text Retrieval via Image CLIP
-
Representation Learning
- CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotation
- RegionCLIP: Region-based Language-Image Pretraining
- CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification
- DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
- CyCLIP: Cyclic Contrastive Language-Image Pretraining - shashank/CyCLIP)]
- CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment - ViP)]
- DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection - GVT/DeCLIP)]
- SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model
- Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese - Sys/Chinese-CLIP)]
- PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining - Gao/PyramidCLIP)]
- Fine-tuned CLIP Models are Efficient Video Learners - CLIP)]
- MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
- Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm - GVT/DeCLIP)]
- Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision - gvt/declip)]
- CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment - ViP)]
- DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection - GVT/DeCLIP)]
- UniCLIP: Unified Framework for Contrastive Language–Image Pre-training
- SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model
- Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese - Sys/Chinese-CLIP)]
- Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training
- Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision - gvt/declip)]
- Fine-tuned CLIP Models are Efficient Video Learners - CLIP)]
- Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training
- RegionCLIP: Region-based Language-Image Pretraining
- MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
- UniCLIP: Unified Framework for Contrastive Language–Image Pre-training
- DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
- CyCLIP: Cyclic Contrastive Language-Image Pretraining - shashank/CyCLIP)]
-
Text-to-3D Generation
- CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation - Forge)]
- Text2Mesh: Text-Driven Neural Stylization for Meshes
- CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields
- AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars
- ClipFace: Text-guided Editing of Textured 3D Morphable Models
- CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields
- CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation - Forge)]
- AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars
-
Image Editing
- CLIPasso: Semantically-Aware Object Sketching
- HairCLIP: Design Your Hair by Text and Reference Image - ustc/HairCLIP)]
- CLIPstyler: Image Style Transfer with a Single Text Condition
- Image-based CLIP-Guided Essence Transfer - chefer/TargetCLIP)]
- CLIP-CLOP: CLIP-Guided Collage and Photomontage
- Towards Counterfactual Image Manipulation via CLIP - CLIP)]
- ClipCrop: Conditioned Cropping Driven by Vision-Language Model
- CLIPascene: Scene Sketching with Different Types and Levels of Abstraction
- Towards Counterfactual Image Manipulation via CLIP - CLIP)]
- CLIPDraw: Synthesize drawings to match a text prompt!
- ClipCrop: Conditioned Cropping Driven by Vision-Language Model
- Image-based CLIP-Guided Essence Transfer - chefer/TargetCLIP)]
- CLIPDraw: Synthesize drawings to match a text prompt!
-
3D Recognition
- LidarCLIP or: How I Learned to Talk to Point Clouds
- PointCLIP: Point Cloud Understanding by CLIP
- CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training
- CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP
- CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training
- LidarCLIP or: How I Learned to Talk to Point Clouds
- MotionCLIP: Exposing Human Motion Generation to CLIP Space
- MotionCLIP: Exposing Human Motion Generation to CLIP Space
- PointCLIP: Point Cloud Understanding by CLIP
- CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP
-
Text-to-Image Generation
- CLIP-CLOP: CLIP-Guided Collage and Photomontage
- CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP - gen/blob/main/README_en.md)]
- [code
- [code
- CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP - gen/blob/main/README_en.md)]
- CLIP-CLOP: CLIP-Guided Collage and Photomontage
-
Prompt Learning
- Learning to Prompt for Vision-Language Models
- Conditional Prompt Learning for Vision-Language Models
- Prompt-aligned Gradient for Prompt Tuning - align)]
- CLIP-Adapter: Better Vision-Language Models with Feature Adapters - Adapter)]
- Learning to Compose Soft Prompts for Compositional Zero-Shot Learning
- Learning to Prompt for Vision-Language Models
- Conditional Prompt Learning for Vision-Language Models
- Prompt-aligned Gradient for Prompt Tuning - align)]
- CLIP-Adapter: Better Vision-Language Models with Feature Adapters - Adapter)]
-
Video Understanding
- VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
- FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks
- Frozen CLIP Models are Efficient Video Learners - video-recognition)]
- Towards Real-Time Text2Video via CLIP-Guided, Pixel-Level Optimization
- MovieCLIP: Visual Scene Recognition in Movies - sail/mica-MovieCLIP)]
- VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
- FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks
- Frozen CLIP Models are Efficient Video Learners - video-recognition)]
- MovieCLIP: Visual Scene Recognition in Movies - sail/mica-MovieCLIP)]
-
Image Captioning
- CLIPScore: A Reference-free Evaluation Metric for Image Captioning
- ClipCap: CLIP Prefix for Image Captioning
- Text-Only Training for Image Captioning using Noise-Injected CLIP
- Fine-grained Image Captioning with CLIP Reward - min/CLIP-Caption-Reward)]
- [code
- Text-Only Training for Image Captioning using Noise-Injected CLIP
- ClipCap: CLIP Prefix for Image Captioning
- Fine-grained Image Captioning with CLIP Reward - min/CLIP-Caption-Reward)]
-
Image Segmentation
- CLIMS: Cross Language Image Matching for Weakly Supervised Semantic Segmentation - SZU/CLIMS)]
- Image Segmentation Using Text and Image Prompts
- Extract Free Dense Labels from CLIP
- Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP - seg)]
- Image Segmentation Using Text and Image Prompts
- Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP - seg)]
- CLIMS: Cross Language Image Matching for Weakly Supervised Semantic Segmentation - SZU/CLIMS)]
- Extract Free Dense Labels from CLIP
-
Audio
- AudioCLIP: Extending CLIP to Image, Text and Audio
- AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization
- AudioCLIP: Extending CLIP to Image, Text and Audio
- AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization
- Wav2CLIP: Learning Robust Audio Representations from Clip - wav2clip)]
- Wav2CLIP: Learning Robust Audio Representations from Clip - wav2clip)]
-
Language Tasks
-
Object Navigation
-
Localization
- Adapting CLIP For Phrase Localization Without Further Training - ttic/adapting-CLIP)]
- Adapting CLIP For Phrase Localization Without Further Training - ttic/adapting-CLIP)]
-
Others
- CLIP-Event: Connecting Text and Images with Event Structures - event)]
- How Much Can CLIP Benefit Vision-and-Language Tasks? - vil/CLIP-ViL)]
- CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning
- CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory - fields)]
- CLIP-Event: Connecting Text and Images with Event Structures - event)]
- CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracywith ViT-B and ViT-L on ImageNet - clip)]
- Task Residual for Tuning Vision-Language Models
- [code
- [code
- CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning
- CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracywith ViT-B and ViT-L on ImageNet - clip)]
- Task Residual for Tuning Vision-Language Models
-
-
Acknowledgment
-
Others
-
-
Training
Programming Languages
Categories
Sub Categories
Representation Learning
28
Information Retrieval
13
Others
13
Image Editing
13
3D Recognition
10
Prompt Learning
9
Video Understanding
9
Text-to-3D Generation
8
Image Captioning
8
Image Segmentation
8
GAN
8
Object Detection
7
Text-to-Image Generation
6
Audio
6
Object Navigation
2
Localization
2
Language Tasks
2
Keywords
deep-learning
5
clip
4
text-to-image
4
computer-vision
3
artificial-intelligence
3
machine-learning
2
multimodal
2
multimodality
2
transformer
1
detr
1
transformers
1
siren
1
multi-modality
1
implicit-neural-representation
1
generative-adversarial-networks
1
semantic-search
1
knn
1
ai
1
youtube
1
search
1
unsplash
1
photos
1
image-search
1
text-to-image-synthesis
1
openai-clip
1
openai
1
image-generation
1
diffusion
1
text2image
1
paddlepaddle
1
zero-shot-classification
1
pytorch
1
pretrained-models
1
multi-modal-learning
1
language-model
1
contrastive-loss
1
visual-transformer
1
transformer-with-cv
1
transformer-cv
1
transformer-awesome
1