Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/xmu-xiaoma666/fightingcv-paper-reading
⭐⭐⭐FightingCV Paper Reading, which helps you understand the most advanced research work in an easier way 🍀 🍀 🍀
https://github.com/xmu-xiaoma666/fightingcv-paper-reading
Last synced: 2 days ago
JSON representation
⭐⭐⭐FightingCV Paper Reading, which helps you understand the most advanced research work in an easier way 🍀 🍀 🍀
- Host: GitHub
- URL: https://github.com/xmu-xiaoma666/fightingcv-paper-reading
- Owner: xmu-xiaoma666
- Created: 2021-08-03T08:32:50.000Z (over 3 years ago)
- Default Branch: master
- Last Pushed: 2023-04-20T03:46:07.000Z (over 1 year ago)
- Last Synced: 2024-12-13T16:53:02.589Z (9 days ago)
- Language: Shell
- Homepage:
- Size: 2.61 MB
- Stars: 800
- Watchers: 15
- Forks: 90
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# FightingCV-Paper-Reading
Hello,大家好,我是小马🚀🚀🚀
作为研究生,读论文一直都是都是一件非常**费时费脑**的事情,因为帮助大家用**5分钟**的时间就能知道某篇论文的大致内容,我会把我看过的论文做好解析分享在这里。⭐⭐⭐本项目的宗旨是🚀**让世界上没有难读的论文**🚀,论文主题包括但不限于检测、分类、分割、Backbone、多模态等等,论文来源包括但不限于最新的arXiv论文、ICCV2021、CVPR2021、MM2021、AAAI2022、ECCV2022、TPAMI2022等各大顶会顶刊⭐⭐⭐
(最新还更新了[【Attention、MLP、Conv、MLP、Backbone的代码复现项目】](https://github.com/xmu-xiaoma666/External-Attention-pytorch),欢迎大家学习交流)## 技术交流
欢迎大家关注公众号:**FightingCV**
| FightingCV公众号 | 小助手微信 (备注【**公司/学校+方向+ID**】)|
:-------------------------:|:-------------------------:
|- 公众号**每天**都会进行**论文、算法和代码的干货分享**哦~
- **交流群每天分享一些最新的论文和解析**,欢迎大家一起**学习交流**哈~~~
- 强烈推荐大家关注[**知乎**](https://www.zhihu.com/people/jason-14-58-38/posts)账号和[**FightingCV公众号**](https://mp.weixin.qq.com/s/m9RiivbbDPdjABsTd6q8FA),可以快速了解到最新优质的干货资源。
## 总结性文章
- [一文看尽MAE最新进展!恺明的MAE已经提出大半年,目前发展如何?](https://mp.weixin.qq.com/s/SoZyuX3NmB_8Tyi9F1Nrfw)
- [从多篇2021年顶会论文看多模态预训练模型最新研究进展](https://zhuanlan.zhihu.com/p/425859974)
- [从2019年-2021年的各大顶会论文,看动态神经网络的发展](https://mp.weixin.qq.com/s?__biz=MzIzNzU4OTAxMQ==&mid=2247484386&idx=1&sn=d3275fe4f51d7d559c855adcbc2b42df&chksm=e8c7049edfb08d88ec7805eebbb5236d165ba797982bbe56fe0fddca660e39b8f7faf06372ff&token=876992619&lang=zh_CN#rd)
- [深度学习中的重参数机制总结与代码实现](https://zhuanlan.zhihu.com/p/383660483)
- [深度学习中的Attention总结(一)](https://zhuanlan.zhihu.com/p/379657870)
- [深度学习中的Attention总结(二)](https://zhuanlan.zhihu.com/p/386333201)
- [思考NLP和CV中的Local和Global建模](https://zhuanlan.zhihu.com/p/387766129)
- [怎么用图文预训练模型CLIP做视频任务?](https://mp.weixin.qq.com/s/4Wg8tr7hhfRrzG8d4o_iYw)
- [聚焦视频文本检索:一文概览视频文本检索任务最新研究进展](https://mp.weixin.qq.com/s/ZD7JGtBzqo7Vpo-YkBmV2A)
- [调研280+篇文献!CVPR最佳论文得主清华黄高团队提出首篇动态网络综述,全面回顾动态网络的发展!](https://mp.weixin.qq.com/s/GROX2pFGxQYU2BezJQu7uw)
## CV知识点汇总与解析
- [【CV知识点汇总与解析】|损失函数篇](https://mp.weixin.qq.com/s/6M0xp5voIevMpqA7RJ7SfQ)
- [【CV知识点汇总与解析】|激活函数篇](https://mp.weixin.qq.com/s/Nohv9kJskyJQjCU7aL6QEg)
- [【CV知识点汇总与解析】| optimizer和学习率篇](https://mp.weixin.qq.com/s/fNJ7rnQOV4rcr30eTVXR1w)
## MM 2022
- [MM2022 | 用StyleGAN进行数据增强,真的太好用了](https://mp.weixin.qq.com/s/gla2Ej0Fd_r5KIDA4frj3Q)
[【Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval】](https://arxiv.org/abs/2207.14428)
- [MM2022 | 在特征空间中的多模态数据增强方法](https://mp.weixin.qq.com/s/gC6M_KZfr-2UWHI1yfh56g)
[【A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval】](https://arxiv.org/abs/2208.02080)
[【Code】](https://github.com/aranciokov/FSMMDA_VideoRetrieval)## NeurIPS 2022
- [NeurIPS2022 | SegNeXt,重新思考卷积注意力设计](https://mp.weixin.qq.com/s/5VJvNRY1TG79x2D3y3itVw)
[【SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation】](https://github.com/Visual-Attention-Network/SegNeXt/blob/main/resources/paper.pdf)
[【Code】](https://github.com/Visual-Attention-Network/SegNeXt)## ICLR 2022
- [ICLR22 | 将Anchor box重新引入DETR,提供query可解释性并加速收敛](https://mp.weixin.qq.com/s/Q__tIi7ZTZCVNy18JC8DQQ)
[【DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR】](https://readpaper.com/paper/4588454363555438593/abstract)## TPAMI 2022
- [TPAMI2022 | Dual ViT!京东(梅涛团队)提出双路径ViT结构,大大降低计算开销!](https://mp.weixin.qq.com/s/Ap05SyIg6W0aLX50PeAt8Q)
[【Dual Vision Transformer】](https://arxiv.org/abs/2207.04976)## CBMI 2022
- [CBMI 2022 | 蒸馏细粒度对齐分数以实现高效的图文匹配和检索](https://mp.weixin.qq.com/s/PuXW1LrdnAbN0n0kDeymBw)
[【ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval】](https://arxiv.org/abs/2207.14757)## ECCV 2022
- [ECCV2022 Oral | 任务范式大统一,微软提出UniTAB用Seq2Seq模式统一多模态任务!](https://mp.weixin.qq.com/s/8eZUbDc3f02_C0AeNEuNdA)
[【UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling】](https://arxiv.org/abs/2111.12085)- [ECCV2022 Oral | MaskCLIP](https://mp.weixin.qq.com/s/zclNaZOR5JTQAutSA1sPAQ)
[【Extract Free Dense Labels from CLIP】](https://arxiv.org/abs/2112.01071)- [ECCV2022|合工大&商汤&澳国大提出新任务和数据集,用于分割视频中发出声音的物体!](https://mp.weixin.qq.com/s/2UMHCdLVXFjx1rwoTm5D8A)
[【Audio−Visual Segmentation】](https://arxiv.org/pdf/2207.05042.pdf)- [ECCV2022 | 人大提出轻量级基于注意力的特征融合机制,在多个公开数据集上有效!代码已开源!](https://mp.weixin.qq.com/s/DYr5ErhRMpB1ttyQHp5Nxw)
[【Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval】](https://arxiv.org/abs/2112.01832)
[【Code】](https://github.com/ruc-aimc-lab/laff)- [ECCV2022 Oral | 任务范式大统一,微软提出UniTAB用Seq2Seq模式统一多模态任务!](https://mp.weixin.qq.com/s/8eZUbDc3f02_C0AeNEuNdA)
[【UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling】](https://arxiv.org/abs/2111.12085)
- [ECCV2022 | 时尚领域的多模态预训练预训练模型FashionViL,在五个下游任务中SOTA!代码已开源!](https://mp.weixin.qq.com/s/mX78IvjdIlBRREcZjlXVQg)
[【FashionViL: Fashion-Focused Vision-and-Language Representation Learning】](https://arxiv.org/abs/2207.08150)
[【Code】](https://github.com/brandonhanx/mmf)
- [ECCV22|只能11%的参数就能优于Swin,微软提出快速预训练蒸馏方法TinyViT](https://mp.weixin.qq.com/s/ZOqnkk_Fwx5nYnoDFpXR0Q)
[【TinyViT: Fast Pretraining Distillation for Small Vision Transformers】](https://arxiv.org/abs/2207.10666)- [ECCV2022 | RU&谷歌提出用CLIP进行zero-shot目标检测!](https://mp.weixin.qq.com/s/gMeOhnWT5CtHnRU9DaM6xg)
[【Exploiting Unlabeled Data with Vision and Language Models for Object Detection】](https://arxiv.org/abs/2207.08954)## CVPR 2022
- [CVPR2022|比VinVL快一万倍!人大提出交互协同的双流视觉语言预训练模型COTS,又快又好!](https://mp.weixin.qq.com/s/haJb4c3_t3DqAFoVEleqxQ)
[【COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval】](https://arxiv.org/abs/2204.07441)- [CVPR2022 | 中科大&华为提出用于非对称图像检索的上下文相似性蒸馏](https://mp.weixin.qq.com/s/HqFTOKD5NTS0sbxkP8-8Hg)
[【Contextual Similarity Distillation for Asymmetric Image Retrieval】](https://openaccess.thecvf.com/content/CVPR2022/papers/Wu\_Contextual\_Similarity\_Distillation\_for\_Asymmetric\_Image\_Retrieval\_CVPR\_2022\_paper.pdf)- [CVPR22 Oral|通过多尺度token聚合分流自注意力,代码已开源](https://mp.weixin.qq.com/s/9alGDizMO0KVinfM3IOruQ)
[【Shunted Self-Attention via Multi-Scale Token Aggregation】](https://arxiv.org/abs/2111.15193)## AAAI 2022
- [LGD:涨点神器!旷视孙剑、张祥雨团队提出标签引导的自蒸馏技术,助力目标检测!](https://mp.weixin.qq.com/s/dPhH6GvOwfx4qErcLsSQpA)
[【LGD: Label-guided Self-distillation for Object Detection】](https://arxiv.org/abs/2109.11496)## ArXiv 2022
- [多边形战士模型!微软提出19亿参数的超大通用模型BEIT-3,刷榜多个CV和多模态任务!](https://mp.weixin.qq.com/s/FyLGIZHYjRm09Tkn-mm_cA)
[【Image as a Foreign Language: BEIT Pretraining for All Vision and Vision-Language Tasks】](https://arxiv.org/abs/2208.10442)- [超越所有MIM模型的BEiT v2来了!微软使用矢量量化视觉Tokenizers的掩码图像建模!](https://mp.weixin.qq.com/s/eLDp_KCaLv9TM4-yI0WoHw)
[【BEIT V2: Masked Image Modeling with Vector-Quantized Visual Tokenizers】](https://arxiv.org/pdf/2208.06366.pdf)- [何恺明团队提出探索用于目标检测的不分层ViT Backbone](https://mp.weixin.qq.com/s/RZVb1mxgIxQi-zz4M4ykKg)
[【Exploring Plain Vision Transformer Backbones for Object Detection】](https://arxiv.org/abs/2203.16527)- [数据标注太昂贵?这个方法可以用有限的数据训练模型实现基于文本的ReID!](https://mp.weixin.qq.com/s/m9RiivbbDPdjABsTd6q8FA)
[【Text-Based Person Search with Limited Data】](https://arxiv.org/abs/2110.10807)- [南信大提出TIPCB,一个简单但有效的用于基于文本的人员搜索的基于部分的卷积baseline](https://mp.weixin.qq.com/s/CKXcLFnmcp_Guv83SOYOYQ)
[【TIPCB: A Simple but Effective Part-based Convolutional Baseline for Text-based Person Search】](https://www.sciencedirect.com/science/article/abs/pii/S0925231222004726)- [兼顾Accuracy和Diversity!用于Image Captioning的变分Transformer模型!](https://mp.weixin.qq.com/s/yCBn87Ip2j9hqCctziiBhQ)
[【Variational Transformer: A Framework Beyond the Trade-off between Accuracy and Diversity for Image Captioning】](https://arxiv.org/abs/2205.14458)## ICLR 2021
- [动态卷积效率低?UCSD&微软用矩阵分解的方法解决了这个问题,性能还更高!(ICLR2021)](https://mp.weixin.qq.com/s/LPJXi1VFLKvm3jQ3_bojoA)
[【Revisiting Dynamic Convolution via Matrix Decomposition】](https://arxiv.org/abs/2103.08756)## NeurIPS2021
### Transformer- [NeurIPS2021-《HRFormer》-HRNet又出续作啦!国科大&北大&MSRA提出高分辨率Transformer,代码已开源!](https://zhuanlan.zhihu.com/p/429936715)
[【HRFormer: High-Resolution Transformer for Dense Prediction】](https://arxiv.org/abs/2110.09408)- [NeurIPS2021-ViT现在可以做目标检测任务啦!华科提出目标检测新方法YOLOS](https://zhuanlan.zhihu.com/p/4262947628)
[【You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection】](https://www.arxiv-vanity.com/papers/2106.00666/)- [NeurIPS2021-没有残差连接的ViT准确率只有0.15%!!!北大&华为提出用于Vision Transformer的Augmented Shortcuts,涨点显著!](https://zhuanlan.zhihu.com/p/424214038)
[【Augmented Shortcuts for Vision Transformers】](https://arxiv.org/abs/2106.15941)- [NeurIPS2021- Transformer部署难?北大&华为诺亚提出Vision Transformer的后训练量化方法](https://zhuanlan.zhihu.com/p/423936004)
[【Post-Training Quantization for Vision Transformer】](https://arxiv.org/abs/2106.14156)- [Multi-Scale Densenet续作?动态ViT](https://zhuanlan.zhihu.com/p/386929227)
[【Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length】](https://arxiv.org/abs/2105.15075)- [微软新作Focal Self-Attention:具备Local和Global交互能力的Transformer](https://zhuanlan.zhihu.com/p/387693270)
[【Focal Self-attention for Local-Global Interactions in Vision Transformers】](https://arxiv.org/abs/2107.00641)- [显著提高Transformer在小规模数据集的性能,特伦托大学&腾讯提出新的损失函数,涨点显著!(NeurIPS2021)](https://mp.weixin.qq.com/s/PYx5IH3rYiEztkmZo_EZeA)
[【Efficient Training of Visual Transformers with Small-Size Datasets】](https://arxiv.org/abs/2106.03746)- [ImageNet准确率超过90%!谷歌大脑开源V-MoE,用稀疏条件计算来训练目前最大的视觉模型!(NeurIPS 2021)](https://mp.weixin.qq.com/s/vDFLKOlqaF06PDNdLTHXQA)
[【Scaling Vision with Sparse Mixture of Experts】](https://arxiv.org/abs/2106.05974)***
### 多模态
- [NeurIPS2021-《MBT》-多模态数据怎么融合?谷歌提出基于注意力瓶颈的方法,简单高效还省计算量](https://zhuanlan.zhihu.com/p/427779731)
[【Attention Bottlenecks for Multimodal Fusion】](https://arxiv.org/abs/2107.00135)- [NeurIPS2021-快来刷榜吧!微软提出新的视频多模态benchmark,同时包含检索、caption、QA等多个任务!](https://zhuanlan.zhihu.com/p/433827807)
[【VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation】](https://arxiv.org/abs/2106.04632)- [NeurIPS 2021-《ALBEF》-先对齐再融合,Salesforce Research提出ALBEF,用动量蒸馏进行多模态表示学习!多个下游任务性能SOTA!](https://zhuanlan.zhihu.com/p/438014941)
[【Align before Fuse: Vision and Language Representation Learning with Momentum Distillation】](https://arxiv.org/abs/2107.07651)***
### 动态网络
- [NeurIPS2021-用多大分辨率的图像做分类更适合?浙大&华为&国科大提出Dynamic Resolution Network,降低计算量还能提性能!](https://zhuanlan.zhihu.com/p/428436758)
[【Dynamic Resolution Network】](https://arxiv.org/abs/2106.02898)### 其他
- [MoCo不适用于目标检测?MSRA提出对象级对比学习的目标检测预训练方法SoCo!性能SOTA!(NeurIPS 2021)](https://mp.weixin.qq.com/s/tHLElQSe7YWFb4lgJsxpEA)
[【Aligning Pretraining for Detection via Object-Level Contrastive Learning】](https://arxiv.org/abs/2106.02637)***
## ACL2021
- [扔掉目标检测器,实现真正的端到端多模态预训练!阿里提出E2E-VLP(ACL2021)](https://mp.weixin.qq.com/s/aKYRYrP79ArxJ_wJdOWW1A)
[【E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning】](https://arxiv.org/abs/2106.01804)## ICCV2021
### 多模态(Multi-Modal)- [ICCV2021 | 你以为这是一个填色模型?其实我是检索模型!](https://mp.weixin.qq.com/s/ugGuKSnOH_i67Eatu0QJ7Q)
[【LapsCore: Language-guided Person Search via Color Reasoning】](https://ieeexplore.ieee.org/document/9711140/)- [ICCV2021 Oral-MDETR:图灵奖得主Yann LeCun的团队&Facebook提出端到端多模态理解的目标检测器](https://zhuanlan.zhihu.com/p/394239659)
[【MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding】](https://arxiv.org/abs/2104.12763)- [ICCV2021-NTU用多样性的query生成,涨点基于文本的实例分割(已开源)](https://zhuanlan.zhihu.com/p/404955179)
[【Vision-Language Transformer and Query Generation for Referring Segmentation】](https://arxiv.org/abs/2108.05565)- [ICCV2021-如何高效视频定位?北大&Adobe&QMUL强强联手提出弱监督CRM,性能SOTA](https://zhuanlan.zhihu.com/p/406704588)
[【Cross-Sentence Temporal and Semantic Relations in Video Activity Localisation】](https://arxiv.org/abs/2107.11443)
- [ICCV2021-TACo-微软&CMU提出Token感知的级联对比学习方法,在视频文本对齐任务上“吊打”其他SOTA方法](https://zhuanlan.zhihu.com/p/406827017)
[【TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment】](https://arxiv.org/abs/2108.09980)- [ICCV2021 Oral-新任务!新数据集!康奈尔大学提出了类似VG但又不是VG的PVG任务](https://zhuanlan.zhihu.com/p/407102211)
[【Who’s Waldo? Linking People Across Text and Images】](https://arxiv.org/abs/2108.07253)- [ICCV2021-新任务!NTU&港中文提出以对话的方式进行细粒度的图片编辑](https://zhuanlan.zhihu.com/p/418089405)
[【Talk-to-Edit: Fine-Grained Facial Editing via Dialog】](https://arxiv.org/abs/2109.04425)- [ICCV2021-用DETR的方法做Dense Video Captioning!港大&南科大提出端到端PDVC,简化训练流程。](https://zhuanlan.zhihu.com/p/418100751)
[【End-to-End Dense Video Captioning with Parallel Decoding】](https://arxiv.org/abs/2108.07781)- [ICCV2021-北大&FAIR&自动化所&快手提出基于动量对比学习的层次Transformer——HiT,用于视频文本检索!代码已开源!](https://zhuanlan.zhihu.com/p/438013433)
[【HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval】](https://arxiv.org/abs/2103.15049)- [ICCV2021 视频领域的纯Transformer方案!谷歌提出ViViT,在多个视频分类基准上SOTA!代码已开源!](https://mp.weixin.qq.com/s?__biz=MzIzNzU4OTAxMQ==&mid=2247484373&idx=1&sn=ab686693985a4aaba9a88a62a79f888b&chksm=e8c704a9dfb08dbf218880b39a1ff40cbd34b6e26b7273a4b420bfc5ae318081ce34458aa90e&token=876992619&lang=zh_CN#rd)
[【ViViT: A Video Vision Transformer】](https://arxiv.org/abs/2103.15691)- [Transformer开始往动态路由的方向前进了!厦大&华为提出TRAR,在VQA、REC上性能SOTA!(ICCV 2021)](https://mp.weixin.qq.com/s/RXWUHTdM66FdnNJ2FmvtTQ)
[【TRAR: Routing the Attention Spans in Transformer for Visual Question Answering】](https://openaccess.thecvf.com/content/ICCV2021/papers/Zhou_TRAR_Routing_the_Attention_Spans_in_Transformer_for_Visual_Question_ICCV_2021_paper.pdf)### 对比学习(Contrastive Learning)
- [ICCV2021-DetCo:性能优于何恺明等人提出的MoCo v2,为目标检测定制任务的对比学习。](https://zhuanlan.zhihu.com/p/393202411)
[【DetCo: Unsupervised Contrastive Learning for Object Detection】](https://arxiv.org/abs/2102.04803)### 可解释性(Interpretability)
- [ICCV2021 Oral-TAU&Facebook提出了通用的Attention模型可解释性](https://zhuanlan.zhihu.com/p/394794493)
[【Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers】](https://arxiv.org/abs/2103.15679)- [ICCV2021 -为什么深度学习模型能够分类正确?SCOUTER能够“正”“反”两个方面说服你。](https://zhuanlan.zhihu.com/p/396783525)
[【SCOUTER: Slot Attention-based Classifier for Explainable Image Recognition】](https://arxiv.org/abs/2009.06138)### 主干网络(Backbone,CNN,Transformer)
- [ICCV2021-iRPE-还在魔改Transformer结构吗?微软&中山大学提出超强的图片位置编码,涨点显著](https://zhuanlan.zhihu.com/p/395766591)
[【Rethinking and Improving Relative Position Encoding for Vision Transformer】](https://arxiv.org/abs/2107.14222)- [ICCV2021 | 池化操作不是CNN的专属,Vision Transformer说:“我也可以”;南大提出池化视觉Transformer(PiT)](https://zhuanlan.zhihu.com/p/398763751)
[【Rethinking Spatial Dimensions of Vision Transformers】](https://arxiv.org/abs/2103.16302)- [ICCV2021 | CNN+Transformer=Better,国科大&华为&鹏城实验室 出Conformer,84.1% Top-1准确率](https://zhuanlan.zhihu.com/p/400244375)
[【Conformer: Local Features Coupling Global Representations for Visual Recognition】](https://arxiv.org/abs/2105.03889)- [ICCV2021 | MicroNets-更小更快更好的MicroNet,三大CV任务都秒杀MobileNetV3](https://zhuanlan.zhihu.com/p/400661708)
[【MicroNet: Improving Image Recognition with Extremely Low FLOPs】](https://arxiv.org/abs/2108.05894)- [ICCV2021-MIT-IBM AI Lab开源CrossViT,Transformer开始走向多分支、多尺度(附目前多尺度ViT的异同点对比)](https://zhuanlan.zhihu.com/p/418086070)
[【CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification】](https://arxiv.org/abs/2103.14899)### 多任务(Multi-Task)
- [ICCV2021-MuST-还在特定任务里为刷点而苦苦挣扎?谷歌的大佬们都已经开始玩多任务训练了](https://zhuanlan.zhihu.com/p/406014791)
[【Multi-Task Self-Training for Learning General Representations】](https://arxiv.org/abs/2108.11353)- [ICCV2021-CV多任务新进展!一节更比三节强的MultiTask CenterNet,用一个网络同时完成目标检测、语义分割和人体姿态估计三个任务](https://zhuanlan.zhihu.com/p/405652732)
[【MultiTask-CenterNet (MCN): Efficient and Diverse Multitask Learning using an Anchor Free Approach】](https://arxiv.org/abs/2108.05060)### 数据增强
- [ICCV 2021|“白嫖”性能的MixMo,一种新的数据增强or模型融合方法](https://zhuanlan.zhihu.com/p/418098973)
[【MicroNet: Improving Image Recognition with Extremely Low FLOPs】](https://arxiv.org/abs/2108.05894)- [ICCV2021 Oral-简单高效的数据增强!华为提出了一种简单的鲁棒目标检测自适应方法](https://zhuanlan.zhihu.com/p/396528978)
[【SimROD: A Simple Adaptation Method for Robust Object Detection】](https://arxiv.org/abs/2107.13389)### 其他
- [ICCV'21 Oral|拒绝调参,显著提点!检测分割任务的新损失函数RS Loss开源](https://zhuanlan.zhihu.com/p/397519850)
[【Rank & Sort Loss for Object Detection and Instance Segmentation】](https://arxiv.org/abs/2107.11669)- [ICCV21 | 大道至简,仅需4行代码提升多标签分类性能! 南大提出Residual Attention](https://zhuanlan.zhihu.com/p/397990353)
[【Residual Attention: A Simple but Effective Method for Multi-Label Recognition】](https://arxiv.org/abs/2108.02456)- [ICCV2021 Oral-UNO-用于Novel Class Discovery 的统一目标函数,简化训练流程!已开源!](https://zhuanlan.zhihu.com/p/407365987)
[【A Unified Objective for Novel Class Discovery】](https://arxiv.org/abs/2108.08536)- [ICCV2021-别魔改网络了,模型精度不高,是你Resize的方法不够好!Google提出基于DL的调整器模型学习更好的Resize方法](https://zhuanlan.zhihu.com/p/409582813)
[【Learning to Resize Images for Computer Vision Tasks】](https://arxiv.org/abs/2103.09950)- [ICCV2021-《GroupFormer》-商汤&港理工提出基于聚类的联合建模时空关系的GroupFormer用于解决群体活动识别问题,性能SOTA](https://zhuanlan.zhihu.com/p/411674711)
[【GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer】](https://arxiv.org/abs/2108.12630)- [ICCV2021-去除冗余token的DETR效果怎么样?NUS颜水成大佬团队给出了答案!](https://zhuanlan.zhihu.com/p/415579801)
[【PnP-DETR: Towards Efficient Visual Analysis with Transformers】](https://arxiv.org/abs/2109.07036)- [ICCV2021-还在用大量数据暴力train模型?主动学习,教你选出数据集中最有价值的样本](https://zhuanlan.zhihu.com/p/420756941)
[【Active Learning for Deep Object Detection via Probabilistic Modeling】](https://arxiv.org/abs/2103.16130)- [ICCV2021-比MoCo更通用的对比学习范式,中科大&MSRA提出对比学习新方法MaskCo](https://zhuanlan.zhihu.com/p/4209392131)
[【Self-Supervised Visual Representations Learning by Contrastive Mask Prediction】](https://arxiv.org/abs/2108.07954)***
## ACM MM2021
### 主干网络(Backbone,CNN,Transformer)
- [ACM MM2021-还在用ViT的16x16 Patch分割方法吗?中科院自动化所提出Deformable Patch-based方法,涨点显著!](https://zhuanlan.zhihu.com/p/399417704)
[【DPT: Deformable Patch-based Transformer for Visual Recognition】](https://arxiv.org/abs/2107.14467)- [ACMMM 2021-多模态宝藏!京东梅涛团队重磅开源第一个适用于多个任务的多模态代码库x-modaler!](https://zhuanlan.zhihu.com/p/403688076)
[【X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics】](https://arxiv.org/abs/2108.08217)- [ACMMM 2021-性能SOTA!用GNN和GAN的方式来强化Video Captioning的学习!](https://zhuanlan.zhihu.com/p/403895573)
[【Discriminative Latent Semantic Graph for Video Captioning】](https://arxiv.org/abs/2108.03662)### 多模态
- [ACM MM2021-从局部到整体的检索!阿里提出用于视频文本检索的分层对齐网络HANet!代码已开源!](https://zhuanlan.zhihu.com/p/436531598)
[【HANet: Hierarchical Alignment Networks for Video-Text Retrieval】](https://arxiv.org/abs/2107.12059)- [CLIP还能做视频字幕任务!腾讯&清华提出CLIP4Caption,ACM MM2021挑战赛第二名!](https://mp.weixin.qq.com/s?__biz=MzIzNzU4OTAxMQ==&mid=2247484376&idx=1&sn=f1312c32ee10aa4168d2119d7615256b&chksm=e8c704a4dfb08db21ae5c06cb070a2a4007961667a7ecbe74e356974b39a3856e624e1a9793c&token=876992619&lang=zh_CN#rd)
[【CLIP4Caption: CLIP for Video Caption】](https://arxiv.org/abs/2110.06615)***
## ICML2021
### 预训练(pre-train)
- [ICML2021-《ALIGN》-大力出奇迹,谷歌用18亿的图像-文本对训练了一个这样的模型。](https://zhuanlan.zhihu.com/p/410499923)
[【Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision】](https://arxiv.org/abs/2102.05918)- [追求极致速度,极简多模态预训练模型ViLT,推理速度比UNITER快60倍!(ICML2021)](https://mp.weixin.qq.com/s/gpiqOMpG1sIF1rGDGMUsCQ)
[【ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision】](https://arxiv.org/abs/2102.03334)***
## CVPR2021
### 多模态(Multi-Modal)
- [Less is More-CVPR2021最佳学生论文提名](https://zhuanlan.zhihu.com/p/388824565)
[【Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling】](https://arxiv.org/abs/2102.06183)- [CVPR2021-RSTNet:自适应Attention的“看图说话”模型](https://zhuanlan.zhihu.com/p/394793465)
[【RSTNet: Captioning With Adaptive Attention on Visual and Non-Visual Words】](https://openaccess.thecvf.com/content/CVPR2021/html/Zhang_RSTNet_Captioning_With_Adaptive_Attention_on_Visual_and_Non-Visual_Words_CVPR_2021_paper.html)- [CVPR2021 Oral《Seeing Out of the Box》北科大&中山大学&微软提出端到端视觉语言表征预训练方法](https://zhuanlan.zhihu.com/p/395982625)
[【Seeing Out of the Box: End-to-End Pre-Training for Vision-Language Representation Learning】](https://openaccess.thecvf.com/content/CVPR2021/html/Huang_Seeing_Out_of_the_Box_End-to-End_Pre-Training_for_Vision-Language_Representation_CVPR_2021_paper.html)- [CVPR2021-开放式的Video Captioning,中科院自动化所提出基于“检索-复制-生成”的网络](https://zhuanlan.zhihu.com/p/401333569)
[【Open-book Video Captioning with Retrieve-Copy-Generate Network】](https://arxiv.org/abs/2103.05284)- [CVPR2021-多模态任务新进展!哥大&Facebook提出VX2TEXT模型,实现了“视频+X”到“文本”的任务](https://zhuanlan.zhihu.com/p/403340498)
[【VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs】](https://arxiv.org/abs/2101.12059)- [CVPR2021-人大提出新模型,将Two Stage的Video Paragraph Captioning变成One Stage,性能却没下降](https://zhuanlan.zhihu.com/p/404419987)
[【Towards Diverse Paragraph Captioning for Untrimmed Videos】](https://arxiv.org/abs/2105.14477)- [CVPR2021-用更好的目标检测器提取视觉特征!微软提出VinVL,基于更好的视觉特征,达到更强的多模态性能。](https://zhuanlan.zhihu.com/p/422114283)
[【VinVL: Revisiting Visual Representations in Vision-Language Models】](https://arxiv.org/abs/2104.13682)- [CVPR2021 Oral-不再需要后处理步骤!Kakao提出端到端的Human-Object交互检测模型](https://zhuanlan.zhihu.com/p/426929486)
[【HOTR: End-to-End Human-Object Interaction Detection with Transformers】](https://arxiv.org/abs/2101.00529)- [CVPR2021-《T2VLAD》-浙大&百度&悉尼科技提出用局部全局对齐来进行视频文本检索!效果优于MMT!](https://zhuanlan.zhihu.com/p/435630887)
[【T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval】](https://arxiv.org/abs/2104.10054)### 主干网络(Backbone,CNN,Transformer)
- [谷歌新作HaloNet:Transformer一作用Self-Attention的方式进行卷积](https://zhuanlan.zhihu.com/p/388598744)
[【Scaling Local Self-Attention for Parameter Efficient Visual Backbones】](https://zhuanlan.zhihu.com/p/388598744)- [Involution(附对Involution的思考):港科大、字节跳动、北大提出“内卷”神经网络算子,在CV三大任务上提点明显](https://zhuanlan.zhihu.com/p/395950242)
[【Involution: Inverting the Inherence of Convolution for Visual Recognition】](https://arxiv.org/pdf/2103.06255.pdf)- [CVPR2021-比CNN和Transformer更好的Backbone?UC Berkeley&Google Research,提出BoTNet,ImageNet上精度达84.7%](https://zhuanlan.zhihu.com/p/418096136)
[【Bottleneck Transformers for Visual Recognition】](https://arxiv.org/abs/2101.11605)### 目标检测(Detection)
- [CVPR2021 Oral-收敛更快!精度更高!南科大&腾讯微信团队重磅开源无监督预训练的UP-DETR](https://zhuanlan.zhihu.com/p/419660108)
[【UP-DETR: Unsupervised Pre-training for Object Detection with Transformers】](https://arxiv.org/abs/2011.09094)- [CVPR Oral | 谷歌&斯坦福(李飞飞组)提出TIRG,用组合的文本和图像来进行图像检索](https://mp.weixin.qq.com/s/_-EJhzkogoNu8kKql7_f7Q)
[【Composing Text and Image for Image Retrieval - An Empirical Odyssey】](https://arxiv.org/abs/1812.07119)***
## SIGIR 2021
### 多模态(Multi-Modal)
- [SIGIR 2021 最佳学生论文-图像文本检索的动态模态交互建模](https://zhuanlan.zhihu.com/p/402122260)
[【Dynamic Modality Interaction Modeling for Image-Text Retrieval】](https://dl.acm.org/doi/abs/10.1145/3404835.3462829)
- [SimVLM-拒绝各种花里胡哨!CMU&Google提出弱监督极简VLP模型,在多个多模态任务上性能SOTA](https://zhuanlan.zhihu.com/p/406354414)
[【SimVLM: Simple Visual Language Model Pretraining with Weak Supervision】](https://zhuanlan.zhihu.com/p/406354414)***
## EMNLP2021
### 多模态(Multi-Modal)
- [多模态Transformer真的多模态了吗?论多模态Transformer对跨模态的影响](https://zhuanlan.zhihu.com/p/411890653)
[【Vision-and-Language or Vision-for-Language? On Cross-Modal Inflfluence in Multimodal Transformers】](https://arxiv.org/abs/2109.04448)- [EMNLP2021-“Transformer+预训练”再下一城,港科大开源高效的多模态摘要总结网络](https://zhuanlan.zhihu.com/p/418923591)
[【Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization】](https://arxiv.org/abs/2109.02401)***
## TPAMI
### 压缩加速
- [TPAMI2021-华为诺亚&悉尼大学陶大程团队提出多功能卷积,助力轻量级网络](https://zhuanlan.zhihu.com/p/423130563)
[【Learning Versatile Convolution Filters for Efficient Visual Recognition】](https://arxiv.org/abs/2109.09310v1)***
## ArXiv
### 主干网络(Backbone,CNN,Transformer)
- [OutLook Attention:具有局部信息感知能力的ViT](https://zhuanlan.zhihu.com/p/385561050)
[【VOLO: Vision Outlooker for Visual Recognition】](https://arxiv.org/abs/2106.13112)- [CoAtNet:卷积+注意力=???](https://zhuanlan.zhihu.com/p/385578588)
[【CoAtNet: Marrying Convolution and Attention for All Data Sizes】](https://arxiv.org/abs/2106.04803)- [CSWin-T:微软、中科大提出十字形注意力的CSWin Transformer](https://zhuanlan.zhihu.com/p/388370370)
[【CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows】](https://arxiv.org/abs/2107.00652)- [Circle Kernel:清华黄高团队、康奈尔大学提出圆形卷积,进一步提升卷积结构的性能](https://zhuanlan.zhihu.com/p/389159556)
[【Integrating Circle Kernels into Convolutional Neural Networks】](https://arxiv.org/abs/2107.02451)- [视觉解析器ViP:牛津大学&字节跳动提出Visual Parser,显式建模高级语义信息](https://zhuanlan.zhihu.com/p/390765725)
[【Visual Parser: Representing Part-whole Hierarchies with Transformers】](https://arxiv.org/abs/2107.05790)
- [LG-Transformer:全局和局部建模Transformer结构新作](https://zhuanlan.zhihu.com/p/393202842)
[【Local-to-Global Self-Attention in Vision Transformers】](https://arxiv.org/abs/2107.04735)- [CoTNet-重磅开源!京东AI Research提出新的主干网络CoTNet,在CVPR上获得开放域图像识别竞赛冠军](https://zhuanlan.zhihu.com/p/394795481)
[【Contextual Transformer Networks for Visual Recognition】](https://arxiv.org/abs/2107.12292)- [S²-MLPv2-百度提出目前最强的视觉MLP架构,超越MLP-Mixer、Swin Transformer、CycleMLP等,达到83.6% Top-1准确率](https://zhuanlan.zhihu.com/p/397003638)
[【S²-MLPv2: Improved Spatial-Shift MLP Architecture for Vision】](https://arxiv.org/abs/2108.01072)- [更深和更宽的Transformer,那个比较好?NUS团队给出了给出“Go Wider Instead of Deeper”的结论](https://zhuanlan.zhihu.com/p/398168686)
[【Go Wider Instead of Deeper】](https://arxiv.org/abs/2107.11817)- [在目标检测任务上怒涨8.6 AP,微软新作MobileFormer](https://zhuanlan.zhihu.com/p/400291282)
[【Mobile-Former: Bridging MobileNet and Transformer】](https://arxiv.org/abs/2108.05895)- [又简单又好用的Transformer变体!清华&MSRA开源线性复杂度的Fastformer!](https://zhuanlan.zhihu.com/p/409050589)
[【Fastformer: Additive Attention Can Be All You Need】](https://arxiv.org/abs/2108.09084)- [《Visformer》-对视觉任务更友好的Transformer,北航团队开源Visformer!](https://zhuanlan.zhihu.com/p/409784985)
[【Visformer: The Vision-friendly Transformer】](https://arxiv.org/abs/2104.12533v4)- [《CrossFormer》-简单高效!浙大CAD&腾讯&哥大开源跨尺度的Transformer,显著涨点检测、分割、分类三大CV任务](https://zhuanlan.zhihu.com/p/410155334)
[【CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention】](https://arxiv.org/abs/2108.00154)- [你见过长得像CNN的MLP吗?UO&UIUC提出了用于视觉任务的层次卷积MLP](https://zhuanlan.zhihu.com/p/418094475)
[【ConvMLP: Hierarchical Convolutional MLPs for Vision】](https://arxiv.org/abs/2109.04454)- [Self-Attention真的是必要的吗?微软&中科大提出Sparse MLP,降低计算量的同时提升性能!](https://zhuanlan.zhihu.com/p/418093199)
[【Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?】](https://arxiv.org/abs/2109.05422)- [目标检测再次革新!图灵奖得主Hinton团队提出Pix2Seq,将Detection变成了Image Captioning](https://zhuanlan.zhihu.com/p/418095279)
[【Pix2seq: A Language Modeling Framework for Object Detection】](https://arxiv.org/abs/2109.10852)- [它来了!轻量、通用、适用于移动设备的Transformer!苹果公司提出了MobileViT](https://zhuanlan.zhihu.com/p/424669337)
[【MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer】](https://arxiv.org/abs/2110.02178)- [《UFO-ViT》-Transformer可以不需要Softmax?Kakao提出了UFO-ViT,性能高,计算量还小](https://zhuanlan.zhihu.com/p/431194075)
[【UFO-ViT: High Performance Linear Vision Transformer without Softmax】](https://arxiv.org/abs/2109.14382)- [McGill&微软将卷积操作加入到Vision Transformer中,捕获更详细的局部信息!预训练下ImageNet Top-1准确率达到87.7%!代码已开源!](https://zhuanlan.zhihu.com/p/434717405)
[【CvT: Introducing Convolutions to Vision Transformers】](https://arxiv.org/abs/2103.15808)### 分割(Segmentation)
- [MaskFormer:语义分割、实例分割“大一统”:Facebook&UIUC提出MaskFormer](https://zhuanlan.zhihu.com/p/392731360)
[【Per-Pixel Classification is Not All You Need for Semantic Segmentation】](https://arxiv.org/abs/2107.06278)- [新的通道和空间注意力建模结构Polarized Self-Attention,霸榜COCO人体姿态估计和Cityscapes语义分割](https://zhuanlan.zhihu.com/p/389770482)
[【Polarized Self-Attention: Towards High-quality Pixel-wise Regression】](https://arxiv.org/pdf/2107.00782.pdf)- [全景分割第一名!南大&港大&NVIDIA提出Panoptic SegFormer,霸榜全景分割](https://zhuanlan.zhihu.com/p/418088118)
[【Panoptic SegFormer】](https://arxiv.org/abs/2109.03814)- [中科院&西交&旷视(孙剑团队)提出用于语义分割的动态路由网络,精确感知多尺度目标,代码已开源!](https://zhuanlan.zhihu.com/p/427709226)
[【Learning Dynamic Routing for Semantic Segmentation】](https://arxiv.org/abs/2003.10401)### 检测(Detection)
- [《Anchor DETR》-加了Anchor Point能够让DETR又快又好?旷视孙剑团队提出Anchor DETR](https://zhuanlan.zhihu.com/p/411889426)
[【Anchor DETR: Query Design for Transformer-Based Detector】](https://arxiv.org/abs/2109.07107)- [加了Anchor Point能够让DETR又快又好?旷视孙剑大佬团队提出Anchor DETR](https://zhuanlan.zhihu.com/p/415578473)
[【Anchor DETR: Query Design for Transformer-Based Detector】](https://arxiv.org/abs/2109.07107)### 增量学习(Incremental Learning)
- [让模型实现“终生学习”,佐治亚理工学院提出Data-Free的增量学习](https://zhuanlan.zhihu.com/p/399085992)
[【Always Be Dreaming: A New Approach for Data-Free Class-Incremental Learning】](https://arxiv.org/abs/2106.09701)### 多模态(Multi-Modal)
- [国科大提出用于VideoQA的跨模态交互时间金字塔Transformer](https://zhuanlan.zhihu.com/p/419923517)
[【Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering】](https://arxiv.org/abs/2109.04735)- [10亿参数!别只玩GPT,来看看这个已经落地的国产模型BriVL!人大&中科院联手打造第一个大规模多模态中文预训练模型](https://zhuanlan.zhihu.com/p/425672126)
[【WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training】](https://arxiv.org/abs/2103.06561)- [CLIP对视觉和语言任务有多大的好处?UC Berkeley&UCLA团队给出了答案!](https://zhuanlan.zhihu.com/p/429243265)
[【How Much Can CLIP Benefit Vision-and-Language Tasks?】](https://arxiv.org/abs/2107.06383)- [消除预训练模型的语言限制!Google提出跨语言的多模态、多任务检索模型MURAL](https://zhuanlan.zhihu.com/p/418098303)
[【MURAL: Multimodal, Multitask Retrieval Across Languages】](https://arxiv.org/abs/2109.05125v1)- [微软提出VLMO,用“模态混合专家”进行统一的视觉语言预训练!即将开源!](https://zhuanlan.zhihu.com/p/436074295)
[【VLMO: Unifified Vision-Language Pre-Training with Mixture-of-Modality-Experts】](https://arxiv.org/abs/2111.02358)### 视频(Video)
- [Video Swin Transformer-既Swin Transformer之后,MSRA开源Video Swin Transformer,在视频数据集上SOTA](https://zhuanlan.zhihu.com/p/401600421)
[【Video Swin Transformer】](https://arxiv.org/abs/2106.13230)- [基于时空混合attention的视频Transformer,大幅度降低计算复杂度](https://zhuanlan.zhihu.com/p/420280467)
[【Space-time Mixing Attention for Video Transformer】](https://arxiv.org/abs/2106.05968)- [视频动作识别不是分类问题,而是检索问题?基于CLIP,浙大提出ActionCLIP,用检索的思想做视频动作识别!性能SOTA!代码已开源!](https://zhuanlan.zhihu.com/p/439011248)
[【ActionCLIP: A New Paradigm for Video Action Recognition】](https://arxiv.org/abs/2109.08472)### 压缩加速
- [DynamicViT-还在用全部token训练ViT?清华&UCLA提出token的动态稀疏化采样,降低inference时的计算量](https://zhuanlan.zhihu.com/p/405326718)
[【DynamicViT: Effificient Vision Transformers with Dynamic Token Sparsifification】](https://arxiv.org/abs/2106.02034)
- [加速了DeiT-S 60%+的吞吐量!自动化所&上交&优图提出Evo-ViT,用Slow-Fast的方式更新token](https://zhuanlan.zhihu.com/p/412199816)
[【Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer】](https://arxiv.org/abs/2108.01390v3)- [压缩之后神经网络忘记了什么?Google研究员给出了答案](https://zhuanlan.zhihu.com/p/418099910)
[【What Do Compressed Deep Neural Networks Forget?】](https://arxiv.org/abs/1911.05248)### 动态网络
- [浙大&华为诺亚&西湖大学提出用于目标检测的动态特征金字塔DyFPN,减少40%的FLOPs!](https://zhuanlan.zhihu.com/p/428439288)
[【Dynamic Feature Pyramid Networks for Object Detection】](https://arxiv.org/abs/2012.00779)- [《Dynamic Routing》-中科院&西交&旷视(孙剑团队)提出用于语义分割的动态路由网络,精确感知多尺度目标,代码已开源!](https://zhuanlan.zhihu.com/p/430452628)
[【Learning Dynamic Routing for Semantic Segmentation】](https://arxiv.org/abs/2003.10401)- [普林斯顿大学&英伟达&Facebook提出基于深度神经网络的全动态推理,助力轻量级网络!](https://zhuanlan.zhihu.com/p/430518300)
[【Fully Dynamic Inference with Deep Neural Networks】](https://arxiv.org/abs/2007.15151)### 多模态检索
- [CLIP再创辉煌!西南交大&MSRA提出CLIP4Clip,进行端到端的视频文本检索!](https://zhuanlan.zhihu.com/p/433063611)
[【CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval】](https://arxiv.org/abs/2104.08860)- [腾讯PCG提出CLIP2Video,基于CLIP解决视频文本检索问题,性能SOTA!代码已开源!](https://zhuanlan.zhihu.com/p/433083355)
[【CLIP2Video: Mastering Video-Text Retrieval via Image CLIP】](https://arxiv.org/abs/2106.11097)- [视频预训练界的HERO!微软提出视频-语言全表示预训练模型HERO,代码已开源!](https://zhuanlan.zhihu.com/p/434323805)
[【HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training】](https://arxiv.org/abs/2005.00200)- [同时利用字幕、音频、视频进行检索!Inria&谷歌提出MMT用于高效跨模态视频检索,代码已开源!](https://zhuanlan.zhihu.com/p/434991340)
[【Multi-modal Transformer for Video Retrieval】](https://arxiv.org/abs/2007.10639)- [《CLIP2TV》-用CLIP和动量蒸馏来做视频文本检索!腾讯提出CLIP2TV,性能SOTA,涨点4.1%!](https://zhuanlan.zhihu.com/p/438016863)
[【CLIP2TV: An Empirical Study on Transformer-based Methods for Video-Text Retrieval】](https://arxiv.org/abs/2111.05610)### 其他
- [拒绝Prompt Engineering,NTU提出CoOp,自适应学习不同下游任务的Prompt,性能碾压手工设计的Prompt](https://zhuanlan.zhihu.com/p/408190719)
[【Learning to Prompt for Vision-Language Models】](https://arxiv.org/abs/2109.01134)- [深度神经网络其实并不需要那么深!普林斯顿大学&Intel提出ParNet,12层的网络就能达到80%以上的准确率!](https://zhuanlan.zhihu.com/p/429732072)
[【Non-deep Networks】](https://arxiv.org/abs/2110.07641)- [NeurIPS2021-港大&腾讯AI Lab&牛津大学提出CARE,让CNN和Transformer能在对比学习中“互帮互助”!](https://zhuanlan.zhihu.com/p/430773996)
[【Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning】](https://arxiv.org/abs/2110.05340)- [FAIR三神Kaiming,Piotr,Ross新作,MAE才是YYDS!仅用ImageNet1K,Top-1准确率87.8%,封神!](https://zhuanlan.zhihu.com/p/432663453)
[【Masked Autoencoders Are Scalable Vision Learners】](https://arxiv.org/abs/2111.06377)- [Swin Transformer V2!MSRA原班人马探究了Swin在超大参数下的拓展!提出了30亿参数版本的Swin Transformer!](https://zhuanlan.zhihu.com/p/436072504)
[【Swin Transformer V2: Scaling Up Capacity and Resolution】](https://arxiv.org/abs/2111.09883)- [《BEIT》-基于图像重建进行预训练!微软提出BEIT,Top-1准确率达86.3%!代码已开源!](https://zhuanlan.zhihu.com/p/438726362)
[【BEIT: BERT Pre-Training of Image Transformers】](https://arxiv.org/abs/2106.08254)- [RANet:MSDNet加强版!清华黄高团队提出分辨率自适应的高效推理网络RANet!](https://mp.weixin.qq.com/s?__biz=MzIzNzU4OTAxMQ==&mid=2247484390&idx=1&sn=1b5ba35d076ba32827b427e224feecc4&chksm=e8c7049adfb08d8cbc12c4c5418d35cf321a01975b92136d57f8f617c6b8c5bdbd871ea1d9a9&token=876992619&lang=zh_CN#rd)
[【Resolution Adaptive Networks for Efficient Inference】](https://arxiv.org/abs/2003.07326)- [字节&约翰斯·霍普金斯&上交提出iBOT框架,基于MIM进行自监督训练,在ImageNet-1K上达到86.3%的微调精度!](https://mp.weixin.qq.com/s/Hid41s7RQlGT6b6tE60XOg)
[【iBOT: Image BERT Pre-Training with Online Tokenizer】](https://arxiv.org/abs/2111.07832)- [清华&MBZUAI&CMU&牛津提出DenseCLIP,用上下文感知的提示进行语言引导密集预测!](https://mp.weixin.qq.com/s?__biz=MzIzNzU4OTAxMQ==&mid=2247484402&idx=1&sn=9bc28bd7faa9632e6fd1179fbb56ac1a&chksm=e8c7048edfb08d988c7495759671fb2d8041e51ee792de249b7ac92407ead06e2bdc11b3373b&token=1352660427&lang=zh_CN#rd)
[【DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting】](https://arxiv.org/abs/2112.01518)- [微软提出第一个端到端的Video Captioning方法:SWIN BERT,涨点显著!](https://mp.weixin.qq.com/s/u1y_Eid8YHC2zM_LUDv1Ag)
[【SWIN BERT: End-to-End Transformers with Sparse Attention for Video Captioning】](https://arxiv.org/abs/2111.13196)- [用CLIP增强视频语言的理解,在VALUE榜单上SOTA!](https://mp.weixin.qq.com/s/iSfqa6lY8KZJ92YTbd8Z3g)
[【A CLIP-Enhanced Method for Video-Language Understanding】](https://arxiv.org/abs/2110.07137)- [中科大&快手提出多模态交叉注意力模型:MMCA,促进图像-文本多模态匹配!](https://mp.weixin.qq.com/s/lnbIIMb42p5xWgE90OvC1A)
[【Multi-Modality Cross Attention Network for Image and Sentence Matching】](https://openaccess.thecvf.com/content_CVPR_2020/papers/Wei_Multi-Modality_Cross_Attention_Network_for_Image_and_Sentence_Matching_CVPR_2020_paper.pdf)- [《AFTrans》来自ViT的免费午餐!北大&阿里提出用于细粒度视觉识别的自适应注意多尺度融合Transformer](https://mp.weixin.qq.com/s/v7w_BwcaeKf4zebPecF_uA)
[【A free lunch from ViT- Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition】](https://arxiv.org/abs/2110.01240)- [《ELF》即插即用!解决长尾问题!GT&UIUC联合提出基于Early-Exiting的网络框架,涨点并加速!](https://mp.weixin.qq.com/s/L7B4m4ON8txqIZ-fgcUSvA)
[【ELF: An Early-Exiting Framework for Long-Tailed Classification】](https://arxiv.org/abs/2006.11979)- [SemVLP 单流和双流Transformer哪个好?阿里:我全都要!提出带可插拔模块的Transformer结构](https://mp.weixin.qq.com/s/zNLETZVcvGc7NPStwzGorw)
[【SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels】](https://arxiv.org/pdf/2103.07829v1.pdf)- [经典重温:FAIR提出SlowFast,用双分支非对称网络处理不同采样率的视频!代码开源!](https://mp.weixin.qq.com/s/_fDwpwEGmse1xCcF42i-uw)
[【SlowFast Networks for Video Recognition】](https://arxiv.org/abs/1812.03982)- [全能型AI!用通用预训练感知模型处理多模态多任务!商汤&西交&港中文提出:Uni-Perceiver](https://mp.weixin.qq.com/s/1hUBIxuKqC6HEbHS577c7A)
[【Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks】](https://arxiv.org/abs/2112.01522)- [视频数据训练太慢?试试UT-Austin&FAIR提出的多重网格训练,加速4.5倍,还能提点!](https://mp.weixin.qq.com/s/Iryl3N60j5kpwXRaekJZwg)
[【A Multigrid Method for Efficiently Training Video Models】](https://arxiv.org/abs/1912.00998)- [一个既能做CV任务,也能做NLP任务的Transformer模型!谷歌&UCLA提出统一的基础模型](https://mp.weixin.qq.com/s/DXvnT61su9fjUkmLS7VwRw)
[【Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text】](https://arxiv.org/abs/2112.07074)- [图本检索的Zero-Shot超过CLIP模型!FILIP用细粒度的后期交互获得更好的预训练效率。](https://mp.weixin.qq.com/s/uRHh8RWzshX8PaYphsxJ5A)
[【FILIP: Fine-grained Interactive Language-Image Pre-Training】](https://arxiv.org/abs/2111.07783)- [Align and Prompt:Salesforce&ANU提出ALPRO,进行细粒度的视频文本对齐!代码已开源!](https://mp.weixin.qq.com/s/8V7VWAsCfsBubKXx14vr6Q)
[【Align and Prompt: Video-and-Language Pre-training with Entity Prompts】](https://arxiv.org/abs/2112.09583)- [用不匹配的图文对也能进行多模态预训练?百度提出统一模态的预训练框架:UNIMO(ACL2021)](https://mp.weixin.qq.com/s/b3NB6pZ3b5laomQvqfTvOw)
[【UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning】](https://arxiv.org/abs/2012.15409)- [CPT:刷爆少样本REC任务!清华刘知远团队提出跨模态预训练Prompt Tuning](https://arxiv.org/abs/2109.11797)
[【CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models】](https://arxiv.org/abs/2109.11797)- [KD-VLP:知识蒸馏和预训练还能这么结合?上科大&Intel&MSRA提出基于知识蒸馏的端到端多模态预训练模型](https://mp.weixin.qq.com/s/PG1FKQU64uL0rCtGpge_iw)
[【KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation】](https://arxiv.org/abs/2109.10504)- [又一任务被Transformer攻陷!NVIDIA开源HORST,用Transformer解决早期动作识别和动作预期任务](https://mp.weixin.qq.com/s/bLcGxUDEJWG3_pVGwlTkZQ)
[【Higher Order Recurrent Space-Time Transformer for Video Action Prediction】](https://arxiv.org/abs/2104.08665)- [【经典回顾】静态结构不能满足模型部署性能需求?微软提出动态卷积结构,Top-1准确率提高2.9%!(附复现代码)](https://mp.weixin.qq.com/s/5eX0mpxiwIrfQ0yE8aUZCg)
[【Dynamic Convolution: Attention over Convolution Kernels】](https://arxiv.org/abs/1912.03458)- [VideoCLIP-Facebook&CMU开源视频文本理解的对比学习预训练,性能SOTA!适用于零样本学习!](https://mp.weixin.qq.com/s/IU2rkDAzXmYBAbBtuWHjFA)
[【VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding】](https://arxiv.org/pdf/2109.14084.pdf)- [【经典重温】所有数据无需共享同一个卷积核!谷歌提出条件参数化卷积CondConv(附Pytorch复现代码)](https://mp.weixin.qq.com/s/AeSz2TvBFRXkQOI67MG1GQ)
[【CondConv: Conditionally Parameterized Convolutions for Efficient Inference】](https://arxiv.org/abs/1904.04971)- [ConvMixer:7行PyTorch代码实现的网络,就能在ImageNet上达到80%+的精度!](https://mp.weixin.qq.com/s/Q04jlsUm-DHUYJxJeC1Tgg)
[【Patches Are All You Need?】](https://openreview.net/forum?id=TVHS5Y4dNvM)- [Facebook AI&牛津提出带“轨迹注意力”的Video Transformer,在视频动作识别任务中性能SOTA!](https://mp.weixin.qq.com/s/16esna6mDGNTSOg-NkmJ-w)
[【Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers】](https://arxiv.org/abs/2106.05392)- [Score-CAM|用kernel加权解释CNN的预测结果](https://mp.weixin.qq.com/s/V_ULnnI4vUlpaPysCj6Zfg)
[【Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks】](https://ieeexplore.ieee.org/document/9150840)- [全尺度表示的上下文非局部对齐,南科大&优图提出NAFS解决基于文本的Re ID](https://mp.weixin.qq.com/s/SMLmy6Vg8hf9tLnZT0gDfg)
[【Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search】](https://arxiv.org/abs/2101.03036)- [用GAN的方法来进行图片匹配!休斯顿大学提出用于文本图像匹配的对抗表示学习,消除模态差异!](https://mp.weixin.qq.com/s/KJ2JJORAimXt2jmtkVJZXQ)
[【Adversarial Representation Learning for Text-to-Image Matching】](https://openaccess.thecvf.com/content_ICCV_2019/papers/Sarafianos_Adversarial_Representation_Learning_for_Text-to-Image_Matching_ICCV_2019_paper.pdf)- [图灵奖得主LeCun提出让Mask策略也能应用于基于ViT的孪生网络,进行自监督学习!](https://mp.weixin.qq.com/s/tspMoUUxKsBAr-p7kw20Iw)
[【Masked Siamese ConvNets】](https://arxiv.org/abs/2206.07700)- [图灵奖得主LeCun提出让Mask策略也能应用于基于ViT的孪生网络,进行自监督学习!](https://mp.weixin.qq.com/s/tspMoUUxKsBAr-p7kw20Iw)
[【Masked Siamese ConvNets】](https://arxiv.org/abs/2206.07700)- [ECCV2018 | 大工(卢湖川团队)提出用于图像-文本匹配的深度跨模态投影学习](https://mp.weixin.qq.com/s/fvOyVDKA6O1llln59nDVqA)
[【Deep Cross-Modal Projection Learning for Image-Text Matching】](https://openaccess.thecvf.com/content_ECCV_2018/papers/Ying_Zhang_Deep_Cross-Modal_Projection_ECCV_2018_paper.pdf)- [经典回顾 | 检索任务的经典工作VSE++](https://mp.weixin.qq.com/s/elT9bYyeKFiQPfkOQqCyqA)
[【VSE++: Improving Visual-Semantic Embeddings with Hard Negatives】](http://www.bmva.org/bmvc/2018/contents/papers/0344.pdf)
[【Code】](https://github.com/fartashf/vsepp)- [还能这么玩?将Prompt Tuning用于细粒度的图像检索!](https://mp.weixin.qq.com/s/TyH9C1t_Z0jiEMKBFdoi9Q)
[【Fine-grained Retrieval Prompt Tuning】](https://arxiv.org/abs/2207.14465)## 多模态ReID
- [TIP | 自动化所谭铁牛院士团队提出用带注意力机制的图神经网络GARN,多个数据集上SOTA!](https://mp.weixin.qq.com/s/O62SaEf7OSm1Vt5s2prqtg)
[【Learning Aligned Image-Text Representations Using Graph Attentive Relational Network】](https://ieeexplore.ieee.org/document/9318563)## 其他技术文章
- [CLIP:多模态领域革命者](https://mp.weixin.qq.com/s/MhtS3o0v14qhEjJNyM_QYg)
- [腾讯万字长文——推荐系统 embedding 技术实践总结](https://mp.weixin.qq.com/s/lgHKt60oHWveXZ-3HYgBYg)
- [自监督表征预训练之掩码图像建模:CAE 及其与 MAE、BEiT 的联系](https://mp.weixin.qq.com/s/vWCz2bU6TKqWPfmFYOHm3w)
- [按需视觉识别的愿景和初步方案](https://mp.weixin.qq.com/s/pql5Po8tYKQjsr1RPJtMjA)
- [为什么残差连接的网络结构更容易学习?](https://mp.weixin.qq.com/s/tSpvwLEYwZQ1PaartVxCBA)
- [机器学习和数据分析的关系是怎么样的,要学习的话哪者为先?](https://mp.weixin.qq.com/s/-ws_Y86vQdDond9S9foNhg)
- [深度学习pytorch训练代码模板(个人习惯)](https://mp.weixin.qq.com/s/s9B2s5fvk2sz8tqVKIvuZA)- [从我开发的深度学习框架看深度学习这几年:TensorFlow, PaddlePaddle, 无量](https://mp.weixin.qq.com/s/vnZ22AMMtkj4LmZdYHsaqg)
- [【面向小白】深究模型大小和推理速度的关系!](https://mp.weixin.qq.com/s/QBDERm70zbfKPVsMIbcznw)
- [机器学习中常用的9种距离](https://mp.weixin.qq.com/s/TgS_8sZZRP-XyffWrsjR7Q)
- [BEVFormer治好了我的精神内耗](https://mp.weixin.qq.com/s/j8SLZVOQU7GDLwBPQJ_8GQ)
- [大佬是如何从头写一篇顶级论文的?](https://mp.weixin.qq.com/s/e7fGPwJwkmYf5IMTYhrOOA)
- [大一统视角理解扩散模型](https://mp.weixin.qq.com/s/-Mp0fpiq7CPDX4a0WDYusw)
- [深度学习刷SOTA有哪些trick?](https://mp.weixin.qq.com/s/zocvJ48qz7y6Q_do08iAYw)
## 科研问题分享
- [博士生做科研想 idea 发现早就有人做过了,该怎么调整心态?](https://mp.weixin.qq.com/s/1bExX_-_K-pIJIkLkFa3ZA)
- [写综述前应如何快速、高效地阅读相关文献?](https://mp.weixin.qq.com/s/Ksn8CIDb9LyQwdvmpk0qzQ)
- [深度学习中创新点比较小,但有效果,可以发(水)论文吗?](https://mp.weixin.qq.com/s/bVufIhg_B9q8Ywmfj-kpxw)
- [研究生发论文是先有idea再做实验,还是先做实验再有idea?](https://mp.weixin.qq.com/s/gK9q7-5l7muKHz3TzZ73ww)
- [什么样的科研是顶尖的科研?](https://mp.weixin.qq.com/s/DJI-jTEZH5s_xQ47yNp8JA)
- [亚马逊沐神 | 博士这五年](https://mp.weixin.qq.com/s/FRsXlg-DPfPSjr5-Di8AXg)
- [亚马逊沐神 | 工作五年反思](https://mp.weixin.qq.com/s/YHGFUJAnAOTLg9bC96Mk9g)
- [为什么面对读博大家都那么悲观?](https://mp.weixin.qq.com/s/RmLkiBoBr0l9DSOmePuBiQ)
- [保研后,你们都怎么样了?](https://mp.weixin.qq.com/s/xpvXamCNGzV_cvpcMZBBIA)
- [那些硕士或博士期间科研灌水,狂发论文的人后来混怎么样了?](https://mp.weixin.qq.com/s/OYNPvyM--7BaqI0cJrZ30Q)
- [博士真的很难熬吗?](https://mp.weixin.qq.com/s/E3t4IATbORcEh3PHcz8D5w)
- [微软亚洲研究院 (MSRA) 的实习体验如何?](https://mp.weixin.qq.com/s/7SEw7uEWrXdMl-wHQhVmlA)
- [实验室的硬件条件好坏对你的科研有多大影响?](https://mp.weixin.qq.com/s/jxbQOC04aQ-PCPY5TtMezQ)
- [2022 阿里全球数学竞赛获奖名单公布,其中 00 后选手占了一半多,如何评价这一现象?](https://mp.weixin.qq.com/s/DBIz3ZAY-Lo89vejRHfvYw)
- [“国庆节”你还在做科研吗?连休7天是不是太奢侈了?](https://mp.weixin.qq.com/s/HyeYL-GbVtqU1IBHZJZ8sw)
- [科研有很水的idea应该发表出来吗?](https://mp.weixin.qq.com/s/Hlbq40hUiKS_OM6KVIXusg)
- [读博那些事儿](https://mp.weixin.qq.com/s/5nJlDO0cVx9lUmaj4L-uLw)