awesome-clip
A curated list of research based on CLIP.
https://github.com/talkuhulk/awesome-clip
Last synced: 1 day ago
JSON representation
-
Retrieval
- [Paper
- ] <br> [[Paper](https://arxiv.org/pdf/2308.11485)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2308.11485)] |
- [huggingface
- [huggingface
- ] <br> [[Paper](https://arxiv.org/pdf/2205.00823)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2303.13440)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2104.08860)] |
-
Segmentation
- ![Star - Vocabulary Semantic Segmentation** <br>| 使用VLM解决语义分割任务. 作者提出一种端到端的框架,名为Side Adapter Network(SAN). SAN主要有MLP和ransformer层构成,具有双分支结构,分别预测mask proposals和attention biases . SAN利用预训练的CLIP,freeze掉 CLIP的参数,只用作分类器使用. 将浅层 CLIP 特征融合到 SAN,其他更深的特征与attention biases相结合以进行mask识别. 这里有个点需要注意,原始的CLIP模型只能通过[CLS] token进行图像级的识别. 作者希望在不改变 CLIP 模型参数的情况下,实现精确的mask识别. 借鉴MaskCLIP,使用一组 shadow [CLS] token, 起名 [SLS] tokens,通过attention biases,[SLS]token的特征逐渐演化以适应mask预测,并且通过比较[SLS]标记和CLIP 文本特征之间的相似度得到mask的类别预测. |<img src="./images/SAN.png" width="640px"/>| [[Github](https://github.com/MendelXu/SAN)] <br> [[Paper](https://arxiv.org/pdf/2302.12242)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2212.09506)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2308.02487)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2310.00240)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2212.03588)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2310.01403)] |
- ![Star - Shot Semantic Segmentation** <br>| 文章提出CLIP-RC (CLIP with Regional Clues)来解决CLIP在零样本语义分割任务上的domain shift(image->region)问题. 【整个论文看的有点晕...】CLIP-RC主要包括三个创新,1.引入Region-Level Bridge (RLB)连接image和pixel的特征表示(个人理解就是把图像特征、deep prompt tokens和[CLS] token做了拼接送入网络,做了信息的interaction);2.Region Alignment Module (RAM): 图像编码器做了修改,重新对齐text-image,同时在RLB基础上,充分利用各种信息;3.Recovery Decoder with Recovery Loss (RDL),训练时使用,减轻零样本学习中的过拟合, 个人理解可以算一个正则项. |<img src="./images/CLIP-RC.png" width="640px"/>| [[Github](https://github.com/Jittor/JSeg)] <br> [[Paper](https://openaccess.thecvf.com/content/CVPR2024/papers/Zhang_Exploring_Regional_Clues_in_CLIP_for_Zero-Shot_Semantic_Segmentation_CVPR_2024_paper.pdf)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2312.01597)] |
- ] <br> [[Paper](https://openaccess.thecvf.com/content/CVPR2024/papers/Wang_Learn_to_Rectify_the_Bias_of_CLIP_for_Unsupervised_Semantic_CVPR_2024_paper.pdf)] <br> [Paper](https://arxiv.org/pdf/2408.06747)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2312.12359)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2112.01518)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2112.01518)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2407.12442)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2107.12518)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2112.01071)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2310.01403)] |
-
Generation
- [Code
- ] <br> [[Paper](https://arxiv.org/pdf/2112.01573)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2106.14843)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2203.00386)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2112.00374)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2204.08583)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2210.02347)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2301.12959)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2306.16805)] |
- [Paper
- ] <br> [[Paper](https://arxiv.org/pdf/2110.12427)] |
- 绘制出草图,借助CLIP计算loss,更新那一组点 ,直到收敛| <img src="./images/CLIPasso.png" width="640px"/> | [[Github](https://github.com/yael-vinker/CLIPasso)] <br> [[Paper](https://arxiv.org/pdf/2202.05822)] |
- [Paper
-
Improvement & Innovation
- ] <br> [[Paper](https://arxiv.org/pdf/2211.01335)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2304.05653)] [[知乎](https://www.zhihu.com/question/595372017/answer/2982207851)]|
- ] <br> [[Paper](https://arxiv.org/pdf/2211.06679)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2403.15378)]|
- ] <br> [[Paper](https://arxiv.org/pdf/2312.03818)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2211.01335)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2402.04252)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2311.17049)]|
- ] <br> [[Paper](https://arxiv.org/pdf/2211.06679)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2212.08045)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2312.03818)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2303.15389)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2304.05653)] [[知乎](https://www.zhihu.com/question/595372017/answer/2982207851)]|
- ] <br> [[Paper](https://arxiv.org/pdf/2403.15378)]|
- ] <br> [[Paper](https://arxiv.org/pdf/2311.17049)]|
- | <img src="./images/RWKV-CLIP1.png" width="640px"/> <br> <img src="./images/RWKV-CLIP2.png" width="640px"/> | [[Github](https://github.com/deepglint/RWKV-CLIP)] <br> [[Paper](https://arxiv.org/pdf/2406.06973)]|
- ] <br> [[Paper](https://arxiv.org/pdf/2404.19394)]|
- ] <br> [[Paper](https://arxiv.org/pdf/2310.05916)]|
- [Paper
- ![Star - Image Pre-training for Long Text Understanding** <br>| 提升language-image pre-training models对长文本的理解能力,作者提出LoTLIP. 作者团队首先对100 million的数据,重新用长文本进行描述. 然后训练模型. 作者发现这样简单粗暴的方式可以提升模型对长文本的理解,但在短文本上的表现却产生了退化. 为解决短文本退化问题,作者在transformer block的[CLS]后加入M个可学习的corner tokens,同时使用注意力掩码机制,限制[CLS]和corner tokens之间的交互,以确保收集的特征的多样性. 实验证明,在长文本图像检索任务上,LoTLIP击败了Long-CLIP,提高了 11.1%. | <img src="./images/LoTLIP.png" width="640px"/> | [[Github](https://github.com/wuw2019/LoTLIP)] <br> [[Paper](https://arxiv.org/pdf/2410.05249)]|
-
Zero-Shot & Few-Shot & Classification
- ] <br> [[Paper](https://arxiv.org/pdf/2203.05557)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2110.04544)] |
- ] <br> [[Paper]( https://arxiv.org/pdf/2207.09519)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2209.14169)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2110.04544)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2209.14169)] |
- ] <br> [[Paper]( https://arxiv.org/pdf/2207.09519)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2212.06138)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2308.12213v2)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2404.089588)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2109.01134)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2212.06138)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2308.12213v2)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2404.089588)] |
- [Paper
- ] <br> [[Paper](https://openaccess.thecvf.com/content/CVPR2021W/CVFAD/papers/Conde_CLIP-Art_Contrastive_Pre-Training_for_Fine-Grained_Art_Classification_CVPRW_2021_paper.pdf)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2210.03117)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2303.02982)] |
- ] <br> [[Paper](https://hal.science/hal-04534868v1/document)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2312.12828)] |
- ] <br> [[Paper](https://openaccess.thecvf.com/content/CVPR2024/papers/Shao_DeIL_Direct-and-Inverse_CLIP_for_Open-World_Few-Shot_Learning_CVPR_2024_paper.pdf)] |
- ] <br> [[Paper](https://hal.science/hal-04534868v1/document)] <br> [知乎](https://zhuanlan.zhihu.com/p/681708426) |
- ] <br> [[Paper](https://arxiv.org/pdf/2301.06267)] |
-
Loss
- ] <br> [[Paper](https://arxiv.org/pdf/2303.15343)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2205.14459)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2205.14459)] |
-
Train
- ] <br> [[Paper](https://arxiv.org/pdf/2306.15658)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2305.20088)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2311.16445)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2310.16226)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2110.05208)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2212.00794)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2305.20088)] |
- [Paper
- ] <br> [[Paper](https://arxiv.org/pdf/2311.16445)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2310.16226)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2305.06152)] |
- [Paper
- ] <br> [[Paper](https://openreview.net/pdf?id=KRLUvxh8uaX)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2309.16671)]|
- ] <br> [[Paper](https://arxiv.org/pdf/2409.09721)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2409.09721)] |
-
Data
- ] <br> [[Paper](https://arxiv.org/pdf/2310.07699)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2310.07699)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2307.12732)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2307.12732)] |
-
Captioning
- ] <br> [[Paper](https://arxiv.org/pdf/2111.09734)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2211.00575)] |
- ] <br> [[Paper](https://openreview.net/pdf?id=Lt8bMlhiwx2)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2111.09734)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2211.00575)] |
- ] <br> [[Paper](https://openreview.net/pdf?id=Lt8bMlhiwx2)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2205.13115)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2111.09734)] |
-
Detection
- ] <br> [[Paper](https://arxiv.org/pdf/2104.13921)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2112.09106)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2204.05991)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2305.07304)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2302.14338)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2303.13076)] |
- ![Star - AGNOSTIC PROMPT LEARNING FOR ZERO-SHOT ANOMALY DETECTION** <br>|借助VLM解决零样本异常检测任务(ZSAD),但VLM 更多地关注于前景目标,不关心图像中的异常/正常. 本文提出AnomalyCLIP, 提升CLIP在ZSAD任务上的表现. 作者认为,学习与对象无关的文本提示,无论其前景对象如何,都可以捕获图像中的正常和异常特征. AnomalyCLIP设计了object-agnostic的可学习的prompt模版(2个通用的与对象无关的文本提示模板,[object] & [damaged][object]), 来学习正常和异常的情况. 使用glocal context optimization,将global和fine-grained的异常语义纳入与对象无关的文本提示学习中. 最后利用textual prompt tuning和DPAM在CLIP的text和local visuals 空间进行prompt学习. |<img src="./images/CORA.png” width="640px"/>| [[Github](https://github.com/zqhang/AnomalyCLIP)] <br> [[Paper](https://arxiv.org/pdf/2310.18961)] |
- [Paper
-
Other
- :对输入样本故意添加一些人无法察觉的细微的干扰,导致模型以高置信度给出一个错误的输出. | <img src="./images/adversarial.png" width="640px"/> | [[Github](https://github.com/stanislavfort/OpenAI_CLIP_adversarial_examples)] <br> [[Blog](https://stanislavfort.com/blog/OpenAI_CLIP_adversarial_examples/)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2207.12396)] |
- ] <br> [[Paper](https://www.arxiv.org/pdf/2408.10012)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2408.09647)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2303.11313)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2309.16020v2)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2203.05796)] |
-
Video
- ] <br> [[Paper](https://arxiv.org/pdf/2208.03550)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2212.03640)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2408.06158)] |
- ] <br> [[Paper](https://arxiv.org/pdf/2208.03550)] |