Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/Yutong-Zhou-cv/Awesome-Multimodality

A Survey on multimodal learning research.
https://github.com/Yutong-Zhou-cv/Awesome-Multimodality

List: Awesome-Multimodality

awesome-list multimodal-deep-learning multimodality

Last synced: 3 months ago
JSON representation

A Survey on multimodal learning research.

Awesome Lists containing this project

README

        

#

Awesome Multimodality 🎶📜



[![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)
![GitHub stars](https://img.shields.io/github/stars/Yutong-Zhou-cv/Awesome-Multimodality.svg?color=red&style=for-the-badge)
![GitHub forks](https://img.shields.io/github/forks/Yutong-Zhou-cv/Awesome-Multimodality.svg?color=yellow&style=for-the-badge)
![GitHub activity](https://img.shields.io/github/last-commit/Yutong-Zhou-cv/Awesome-Multimodality?style=for-the-badge)
![Visitors](https://visitor-badge.glitch.me/badge?page_id=Yutong-Zhou-cv/Awesome-Multimodality)
[![Star History Chart](https://api.star-history.com/svg?repos=Yutong-Zhou-cv/Awesome-Multimodality&type=Date)](https://star-history.com/#Yutong-Zhou-cv/Awesome-Multimodality&Date)

A collection of resources on multimodal learning research.

## *Content*
* - [ ] [1. Description](#head1)
* - [ ] [2. Topic Order](#head2)
* - [ ] [Survey](#head-survey)
* - [ ] [👑 Dataset](#head-dataset)
* - [ ] [💬 Vision and language Pre-training (VLP)](#head-VLP)
* - [ ] [3. Chronological Order](#head3)
* - [ ] [2023](#head-2023)
* - [ ] [2022](#head-2022)
* - [x] [2021](#head-2021)
* - [x] [2020](#head-2020)

* - [ ] [4. Courses](#head4)

* [*Contact Me*](#head5)

## *1.Description*

>🐌 Markdown Format:
>
> * (Conference/Journal Year) **Title**, First Author et al. [[Paper](URL)] [[Code](URL)] [[Project](URL)]

> * (Conference/Journal Year) [💬Topic] **Title**, First Author et al. [[Paper](URL)] [[Code](URL)] [[Project](URL)]
> * (Optional) ```🌱``` or ```📌 ```
> * (Optional) 🚀 or 👑 or 📚

* ```🌱: Novel idea```
* ```📌: The first...```
* 🚀: State-of-the-Art
* 👑: Novel dataset/model
* 📚:Downstream Tasks

## *2. Topic Order*

* **[Survey](https://github.com/Yutong-Zhou-cv/Awesome-Survey-Papers)**
* (TPAMI 2023) **Multimodal Image Synthesis and Editing: A Survey and Taxonomy**, Fangneng Zhan et al. [[v1](https://arxiv.org/abs/2112.13592v1)](2021.12.27) ... [[v5](https://arxiv.org/abs/2112.13592v5)](2023.08.05)
* (TPAMI 2023) [💬Transformer] **Multimodal Learning with Transformers: A Survey**, Peng Xu et al. [[v1](https://arxiv.org/abs/2206.06488)](2022.06.13) [[v2](https://ieeexplore.ieee.org/abstract/document/10123038)](2023.05.11)
* (Multimedia Tools and Applications) **A comprehensive survey on generative adversarial networks used for synthesizing multimedia content**, Lalit Kumar & Dushyant Kumar Singh [[v1](https://link.springer.com/article/10.1007/s11042-023-15138-x#Sec47)](2023.03.30)
* ⭐⭐(arXiv preprint 2023) **Multimodal Deep Learning**, Cem Akkus et al. [[v1](https://arxiv.org/abs/2301.04856)](2023.01.12)
* ⭐(arXiv preprint 2022) [💬Knowledge Enhanced] **A survey on knowledge-enhanced multimodal learning**, Maria Lymperaiou et al. [[v1](https://arxiv.org/abs/2211.12328)](2022.11.19)
* ⭐⭐(arXiv preprint 2022) **Vision-Language Pre-training: Basics, Recent Advances, and Future Trends**, Zhe Gan et al. [[v1](https://arxiv.org/abs/2210.09263)](2022.10.17)
* ⭐(arXiv preprint 2022) **Vision+X: A Survey on Multimodal Learning in the Light of Data**, Ye Zhu et al. [[v1](https://arxiv.org/abs/2210.02884)](2022.10.05)
* (arXiv preprint 2022) **Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions**, Paul Pu Liang et al. [[v1](https://arxiv.org/abs/2209.03430)](2022.09.07)
* (arXiv preprint 2022) [💬Cardiac Image Computing] **Multi-Modality Cardiac Image Computing: A Survey**, Lei Li et al. [[v1](https://arxiv.org/pdf/2208.12881.pdf)](2022.08.26)
* (arXiv preprint 2022) [💬Vision and language Pre-training (VLP)] **Vision-and-Language Pretraining**, Thong Nguyen et al. [[v1](https://arxiv.org/abs/2207.01772)](2022.07.05)
* (arXiv preprint 2022) [💬Video Saliency Detection] **A Comprehensive Survey on Video Saliency Detection with Auditory Information: the Audio-visual Consistency Perceptual is the Key!**, Chenglizhao Chen et al. [[v1](https://arxiv.org/abs/2206.13390)](2022.06.20)
* (arXiv preprint 2022) [💬Vision and language Pre-training (VLP)] **Vision-and-Language Pretrained Models: A Survey**, Siqu Long et al. [[v1](https://arxiv.org/abs/2204.07356v1)](2022.04.15)...[[v5](https://arxiv.org/abs/2204.07356v5)](2022.05.03)
* (arXiv preprint 2022) [💬Vision and language Pre-training (VLP)] **VLP: A Survey on Vision-Language Pre-training**, Feilong Chen et al. [[v1](https://arxiv.org/abs/2202.09061v1)](2022.02.18) [[v2](https://arxiv.org/abs/2202.09061v2)](2022.02.21)
* (arXiv preprint 2022) [💬Vision and language Pre-training (VLP)] **A Survey of Vision-Language Pre-Trained Models**, Yifan Du et al. [[v1](https://arxiv.org/abs/2202.10936)](2022.02.18)
* (arXiv preprint 2022) [💬Multi-Modal Knowledge Graph] **Multi-Modal Knowledge Graph Construction and Application: A Survey**, Xiangru Zhu et al. [[v1](https://arxiv.org/pdf/2202.05786.pdf)](2022.02.11)
* (arXiv preprint 2022) [💬Auto Driving] **Multi-modal Sensor Fusion for Auto Driving Perception: A Survey**, Keli Huang et al. [[v1](https://arxiv.org/abs/2202.02703v1)](2022.02.06) [[v2](https://arxiv.org/abs/2202.02703)](2022.02.27)
* (arXiv preprint 2021) **A Survey on Multi-modal Summarization**, Anubhav Jangra et al. [[v1](https://arxiv.org/pdf/2109.05199.pdf)](2021.09.11)
* (Information Fusion 2021) [💬Vision and language] **Multimodal research in vision and language: A review of current and emerging trends**, ShagunUppal et al. [[v1](https://www.sciencedirect.com/science/article/pii/S1566253521001512)](2021.08.01)

* **👑 Dataset**
* (arXiv preprint 2023) **Sticker820K: Empowering Interactive Retrieval with Stickers**, Sijie Zhao et al. [[Paper](https://arxiv.org/abs/2306.06870)] [[Github](https://github.com/sijeh/Sticker820K)]
* (arXiv preprint 2023) **Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration**, Chenyang Lyu et al. [[Paper](https://arxiv.org/abs/2306.09093)] [[Github](https://github.com/lyuchenyang/Macaw-LLMl)]
* (arXiv preprint 2022) **Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework**, Jiaxi Gu et al. [[Paper](https://arxiv.org/abs/2202.06767)] [[Download](https://wukong-dataset.github.io/wukong-dataset/download.html)]
* The Noah-Wukong dataset is a large-scale multi-modality Chinese dataset.
* The dataset contains 100 Million pairs
* Images in the datasets are filtered according to the size ( > 200px for both dimensions ) and aspect ratio ( 1/3 ~ 3 )
* Text in the datasets are filtered according to its language, length and frequency. Privacy and sensitive words are also taken into consideration.
* (arXiv preprint 2022) **WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models**, Sha Yuan et al. [[Paper](https://arxiv.org/abs/2203.11480)] [[Download](https://github.com/BAAI-WuDao/WuDaoMM/)]

* **💬 Vision and language Pre-training (VLP)**
* (arXiv preprint 2023) **mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video**, Haiyang Xu et al. [[Paper](https://arxiv.org/abs/2302.00402)] [[Code](https://github.com/alibaba/AliceMind/tree/main/mPLUG)]
* 📚 Downstream Tasks:
* [Vision Only] Video Action Recognition, Image Classification, Object Detection and Segmentation
* [Language Only] Natural Language Understanding, Natural Language Generation
* [Video-Text] Text-to-Video Retrieval, Video Question Answering, Video Captioning
* [Image-Text] Image-Text Retrieval, Visual Question Answering, Image Captioning, Visual Grounding
* (EMNLP 2022) **FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning**, Suvir Mirchandani et al. [[Paper](https://arxiv.org/abs/2210.15028)]
* 📚 Downstream Tasks: Image-to-Text Retrieval & Text-to-Image Retrieval, Image Retrieval with Text Feedback, Category Recognition & Subcategory
Recognition, Image Captioning, Relative Image Captioning
* (arXiv preprint 2022) **PaLI: A Jointly-Scaled Multilingual Language-Image Model**, Xi Chen et al. [[Paper](https://arxiv.org/abs/2209.06794)]
* 📚 Downstream Tasks: Image Captioning, Visual Question Answering (VQA), Language-understanding Capabilities, Zero-shot Image Classification
* ⭐⭐(arXiv preprint 2022) **Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks**, Wenhui Wang et al. [[Paper](https://arxiv.org/abs/2208.10442)] [[Code](https://github.com/microsoft/unilm/tree/master/beit)]

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/image-as-a-foreign-language-beit-pretraining/semantic-segmentation-on-ade20k)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k?p=image-as-a-foreign-language-beit-pretraining)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/image-as-a-foreign-language-beit-pretraining/semantic-segmentation-on-ade20k-val)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k-val?p=image-as-a-foreign-language-beit-pretraining)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/image-as-a-foreign-language-beit-pretraining/cross-modal-retrieval-on-coco-2014)](https://paperswithcode.com/sota/cross-modal-retrieval-on-coco-2014?p=image-as-a-foreign-language-beit-pretraining)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/image-as-a-foreign-language-beit-pretraining/zero-shot-cross-modal-retrieval-on-flickr30k)](https://paperswithcode.com/sota/zero-shot-cross-modal-retrieval-on-flickr30k?p=image-as-a-foreign-language-beit-pretraining)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/image-as-a-foreign-language-beit-pretraining/cross-modal-retrieval-on-flickr30k)](https://paperswithcode.com/sota/cross-modal-retrieval-on-flickr30k?p=image-as-a-foreign-language-beit-pretraining)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/image-as-a-foreign-language-beit-pretraining/visual-reasoning-on-nlvr2-dev)](https://paperswithcode.com/sota/visual-reasoning-on-nlvr2-dev?p=image-as-a-foreign-language-beit-pretraining)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/image-as-a-foreign-language-beit-pretraining/visual-reasoning-on-nlvr2-test)](https://paperswithcode.com/sota/visual-reasoning-on-nlvr2-test?p=image-as-a-foreign-language-beit-pretraining)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/image-as-a-foreign-language-beit-pretraining/visual-question-answering-on-vqa-v2-test-dev)](https://paperswithcode.com/sota/visual-question-answering-on-vqa-v2-test-dev?p=image-as-a-foreign-language-beit-pretraining)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/image-as-a-foreign-language-beit-pretraining/visual-question-answering-on-vqa-v2-test-std)](https://paperswithcode.com/sota/visual-question-answering-on-vqa-v2-test-std?p=image-as-a-foreign-language-beit-pretraining)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/image-as-a-foreign-language-beit-pretraining/instance-segmentation-on-coco)](https://paperswithcode.com/sota/instance-segmentation-on-coco?p=image-as-a-foreign-language-beit-pretraining)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/image-as-a-foreign-language-beit-pretraining/object-detection-on-coco)](https://paperswithcode.com/sota/object-detection-on-coco?p=image-as-a-foreign-language-beit-pretraining)
* 📚 【Visual-Language】Visual Question Answering (VQA), Visual Reasoning, Image Captioning, Image-Text Retrieval
* 📚 【Visual】Object Detection, nstance Segmentation, Semantic Segmentation, Image Classification
* (ECCV 2022) **Exploiting Unlabeled Data with Vision and Language Models for Object Detection**, Shiyu Zhao et al. [[Paper](https://arxiv.org/abs/2207.08954)] [[Code](https://github.com/xiaofeng94/VL-PLM)]
* 📚 Downstream Tasks: Open-vocabulary object detection, Semi-supervised object detection, Pseudo label generation
* ⭐⭐[**CVPR 2022 Tutorial**] **Recent Advances in Vision-and-Language Pre-training** [[Project](https://vlp-tutorial.github.io/2022/)]
* ⭐⭐(arXiv preprint 2022) [💬Data Augmentation] **MixGen: A New Multi-Modal Data Augmentation**, Xiaoshuai Hao et al. [[Paper](https://arxiv.org/abs/2206.08358)]
* 📚 Downstream Tasks: Image-Text Retrieval, Visual Question Answering (VQA), Visual Grounding, Visual Reasoning, Visual Entailment
* ⭐⭐(ICML 2022) **Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts**, Yan Zeng et al. [[Paper](https://arxiv.org/abs/2111.08276)] [[Code](https://github.com/zengyan-97/X-VLM)]
* 🚀 SOTA(2022/06/16): Cross-Modal Retrieval on COCO 2014 & Flickr30k, Visual Grounding on RefCOCO+ val & RefCOCO+ testA, RefCOCO+ testB
* 📚 Downstream Tasks: Image-Text Retrieval, Visual Question Answering (VQA), Natural Language for Visual Reasoning (NLVR2), Visual Grounding, Image Captioning
* ⭐⭐(arXiv preprint 2022) **Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts**, Basil Mustafa et al. [[Paper](https://arxiv.org/abs/2206.02770)] [[Blog](https://ai.googleblog.com/2022/06/limoe-learning-multiple-modalities-with.html)]
* 📌 LIMoE: The first large-scale multimodal mixture of experts models.
* (CVPR 2022) **Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment**, Mingyang Zhou et al. [[Paper](https://arxiv.org/abs/2203.00242)] [[Code](https://github.com/zmykevin/UVLP)]
* 📚 Downstream Tasks: Visual Question Answering(VQA), Natural Language for Visual reasoning(NLVR2), Visual Entailment, Referring Expression(RefCOCO+)
* ⭐(arXiv preprint 2022) **One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code**, Yong Dai et al. [[Paper](https://arxiv.org/abs/2205.06126)]
* 📚 Downstream Tasks: Text Classification, Automatic Speech Recognition, Text-to-Image Retrieval, Text-to-Video Retrieval, Text-to-Code Retrieval
* (arXiv preprint 2022) **Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework**, Chunyu Xie et al. [[Paper](https://arxiv.org/abs/2205.03860)]
* 📚 Downstream Tasks: Image-text Retrieval, Chinese Image-text matching
* (arXiv preprint 2022) **Vision-Language Pre-Training with Triple Contrastive Learning**, Jinyu Yang et al. [[Paper](https://arxiv.org/abs/2202.10401)] [[Code](https://github.com/uta-smile/TCL)]
* 📚 Downstream Tasks: Image-text Retrieval, Visual Question Answering, Visual Entailment, Visual Reasoning
* (arXiv preprint 2022) **MVP: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic Alignment**, Zejun Li et al. [[Paper](https://arxiv.org/abs/2201.12596)]
* 📚 Downstream Tasks: Image-text Retrieval, Multi-Modal Classification, Visual Grounding
* (arXiv preprint 2022) **BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation**, Junnan Li et al. [[Paper](https://arxiv.org/abs/2201.12086)] [[Code](https://github.com/salesforce/BLIP)]
* 📚 Downstream Tasks: Image-text Retrieval, Image Captioning, Visual Question Answering, Visual Reasoning, Visual Dialog
* (ICML 2021) **ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision**, Wonjae Kim et al. [[Paper](https://arxiv.org/abs/2102.03334)]
* 📚 Downstream Tasks: Image Text Matching, Masked Language Modeling

## *3. Chronological Order*

* **2023**
* (arXiv preprint 2023) **Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation**, Zhiwei Zhang et al. [[Paper](https://arxiv.org/abs/2303.05983)] [[Project](https://matrix-alpha.github.io/)] [[Code](https://github.com/matrix-alpha/Accountable-Textual-Visual-Chat)]
* (arXiv preprint 2023) **Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models**, Gen Luo et al. [[Paper](https://arxiv.org/abs/2305.15023)] [[Project](https://luogen1996.github.io/lavin/)] [[Code](https://github.com/luogen1996/LaVIN)]
* ⭐⭐(arXiv preprint 2023) **Any-to-Any Generation via Composable Diffusion**, Zineng Tang et al. [[Paper](https://arxiv.org/abs/2305.11846)] [[Project](https://codi-gen.github.io/)] [[Code](https://github.com/microsoft/i-Code/tree/main/i-Code-V3)]
* 📚[Single-to-Single Generation] Text → Image, Audio → Image, Image → Video, Image → Audio, Audio → Text, Image → Text
* 📚[Multi-Outputs Joint Generation] Text → Video + Audio, Text → Text + Audio + Image, Text + Image → Text + Image
* 📚[Multiple Conditioning] Text + Audio → Image, Text + Image → Image, Text + Audio + Image → Image, Text + Audio → Video, Text + Image → Video, Video + Audio → Text, Image + Audio → Audio, Text + Image → Audio
* ⭐⭐(arXiv preprint 2023) **mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality**, Qinghao Ye et al. [[Paper](https://arxiv.org/abs/2304.14178)] [[Demo](https://www.modelscope.cn/studios/damo/mPLUG-Owl/summary)] [[Code](https://github.com/X-PLUG/mPLUG-Owl)]
* (arXiv preprint 2023) **Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models**, Zhiqiu Lin et al. [[Paper](https://arxiv.org/abs/2301.06267)] [[Project](https://linzhiqiu.github.io/papers/cross_modal/)] [[Code](https://github.com/linzhiqiu/cross_modal_adaptation)]


* **2022**
* (arXiv preprint 2022) [💬Visual Metaphors] **MetaCLUE: Towards Comprehensive Visual Metaphors Research**, Arjun R. Akula et al. [[Paper](https://arxiv.org/abs/2212.09898)] [[Project](https://metaclue.github.io/)]
* (arXiv preprint 2022) **MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks**, Letitia Parcalabescu et al. [[Paper](https://arxiv.org/abs/2212.08158)] [[Code](https://github.com/Heidelberg-NLP/MM-SHAP)]
* (arXiv preprint 2022) **Versatile Diffusion: Text, Images and Variations All in One Diffusion Model**, Xingqian Xu et al. [[Paper](https://arxiv.org/abs/2211.08332)] [[Code](https://github.com/SHI-Labs/Versatile-Diffusion)] [[Hugging Face](https://huggingface.co/spaces/shi-labs/Versatile-Diffusion)]
* 📚 Downstream Tasks: Text-to-Image, Image-Variation, Image-to-Text, Disentanglement, Text+Image-Guided Generation, Editable I2T2I
* (Machine Intelligence Research) [💬Vision-language transformer] **Masked Vision-Language Transformer in Fashion**, Ge-Peng Ji et al. [[Paper](https://arxiv.org/abs/2210.15110)] [[Code](https://github.com/GewelsJI/MVLT)]
* (arXiv 2022) [💬Multimodal Modeling] **MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning**, Zijia Zhao et al. [[Paper](https://arxiv.org/abs/2210.04183)]
* (arXiv 2022) [💬Navigation] **Iterative Vision-and-Language Navigation**, Jacob Krantz et al. [[Paper](https://arxiv.org/abs/2210.03087)]
* (arXiv 2022) [💬Video Chapter Generation] **Multi-modal Video Chapter Generation**, Xiao Cao et al. [[Paper](https://arxiv.org/abs/2209.12694)]
* (arXiv 2022) [💬Visual Question Answering (VQA)] **TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation**, Jun Wang et al. [[Paper](https://arxiv.org/abs/2208.01813)] [[Code](https://github.com/HenryJunW/TAG)]
* (AI Ethics and Society 2022) [💬Multi-modal & Bias] **American == White in Multimodal Language-and-Image AI**, Robert Wolfe et al. [[Paper](https://arxiv.org/abs/2207.00691)]
* (Interspeech 2022) [💬Audio-Visual Speech Separation] **Multi-Modal Multi-Correlation Learning for Audio-Visual Speech Separation**, Xiaoyu Wang et al. [[Paper](https://arxiv.org/abs/2207.01197)]
* (arXiv preprint 2022) [💬Multi-modal for Recommendation] **Personalized Showcases: Generating Multi-Modal Explanations for Recommendations**, An Yan et al. [[Paper](https://arxiv.org/abs/2207.00422)]
* (CVPR 2022) [💬Video Synthesis] **Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning**, Ligong Han et al. [[Paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Han_Show_Me_What_and_Tell_Me_How_Video_Synthesis_via_CVPR_2022_paper.pdf)] [[Code](https://github.com/snap-research/MMVID)] [[Project](https://snap-research.github.io/MMVID/)]
* (NAACL 2022) [💬Dialogue State Tracking] **Multimodal Dialogue State Tracking**, Hung Le et al. [[Paper](https://arxiv.org/abs/2206.07898)]
* (arXiv preprint 2022) [💬Multi-modal Multi-task] **MultiMAE: Multi-modal Multi-task Masked Autoencoders**, Roman Bachmann et al. [[Paper](https://arxiv.org/abs/2204.01678)] [[Code](https://github.com/EPFL-VILAB/MultiMAE)] [[Project](https://multimae.epfl.ch/)]
* (CVPR 2022) [💬Text-Video Retrieval] **X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval**, Satya Krishna Gorti et al. [[Paper](https://arxiv.org/abs/2203.15086)] [[Code](https://github.com/layer6ai-labs/xpool)] [[Project](https://layer6ai-labs.github.io/xpool/)]
* (NAACL 2022 2022) [💬Visual Commonsense] **Visual Commonsense in Pretrained Unimodal and Multimodal Models**, Chenyu Zhang et al. [[Paper](https://arxiv.org/abs/2205.01850)] [[Code](https://github.com/ChenyuHeidiZhang/VL-commonsense)]
* (arXiv preprint 2022) [💬Pretraining framework] **i-Code: An Integrative and Composable Multimodal Learning Framework**, Ziyi Yang et al. [[Paper](https://arxiv.org/abs/2205.01818)]
* (CVPR 2022) [💬Food Retrieval] **Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval**, Mustafa Shukor et al. [[Paper](https://arxiv.org/abs/2204.09730)] [[Code](https://github.com/mshukor/TFood)]
* (arXiv preprint 2022) [💬Image+Videos+3D Data Recognition] **Omnivore: A Single Model for Many Visual Modalities**, Rohit Girdhar et al. [[Paper](https://arxiv.org/abs/2201.08377)] [[Code](https://github.com/facebookresearch/omnivore)] [[Project](https://facebookresearch.github.io/omnivore/)]
* (arXiv preprint 2022) [💬Hyper-text Language-image Model] **CM3: A Causal Masked Multimodal Model of the Internet**, Armen Aghajanyan et al. [[Paper](https://arxiv.org/abs/2201.07520)]

* **2021**
* (arXiv preprint 2021) [💬Visual Synthesis] **NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion**, Chenfei Wu et al. [[Paper](https://arxiv.org/abs/2111.12417)] [[Code](https://github.com/microsoft/NUWA)]
![Figure from paper](pic/NUWA.gif)
> *(From: https://github.com/microsoft/NUWA [2021/11/30])*
* (ICCV 2021) [💬Video-Text Alignment] **TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment**, Jianwei Yang et al. [[Paper](https://arxiv.org/abs/2108.09980)]
* (arXiv preprint 2021) [💬Class-agnostic Object Detection] **Multi-modal Transformers Excel at Class-agnostic Object Detection**, Muhammad Maaz et al. [[Paper](https://arxiv.org/abs/2111.11430v1)] [[Code](https://github.com/mmaaz60/mvits_for_class_agnostic_od)]
* (ACMMM 2021) [💬Video-Text Retrieval] **HANet: Hierarchical Alignment Networks for Video-Text Retrieval**, Peng Wu et al. [[Paper](https://dl.acm.org/doi/abs/10.1145/3474085.3475515)] [[Code](https://github.com/Roc-Ng/HANet)]
* (ICCV 2021) [💬Video Recognition] **AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition**, Rameswar Panda et al. [[Paper](https://rpand002.github.io/data/ICCV_2021_adamml.pdf)] [[Project](https://rpand002.github.io/adamml.html)] [[Code](https://github.com/IBM/AdaMML)]
* (ICCV 2021) [💬Video Representation] **CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations**, Mohammadreza Zolfaghari et al. [[Paper](https://arxiv.org/abs/2109.14910)]
* (ICCV 2021 **Oral**) [💬Text-guided Image Manipulation] **StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery**, Or Patashnik et al. [[Paper](https://arxiv.org/abs/2103.17249)] [[Code](https://github.com/orpatashnik/StyleCLIP)] [[Play](https://replicate.ai/orpatashnik/styleclip)]
* (ICCV 2021) [💬Facial Editing] **Talk-to-Edit: Fine-Grained Facial Editing via Dialog**, Yuming Jiang et al. [[Paper](https://arxiv.org/abs/2109.04425)] [[Code](https://github.com/yumingj/Talk-to-Edit)] [[Project](https://www.mmlab-ntu.com/project/talkedit/)] [[Dataset Project](https://mmlab.ie.cuhk.edu.hk/projects/CelebA/CelebA_Dialog.html)] [[Dataset(CelebA-Dialog Dataset)](https://drive.google.com/drive/folders/18nejI_hrwNzWyoF6SW8bL27EYnM4STAs)]
* (arXiv preprint 2021) [💬Video Action Recognition] **ActionCLIP: A New Paradigm for Video Action Recognition**, Mengmeng Wang et al. [[Paper](https://arxiv.org/abs/2109.08472)]

* **2020**
* (EMNLP 2020) [💬Video+Language Pre-training] **HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training**, Linjie Li et al. [[Paper](https://arxiv.org/abs/2005.00200)] [[Code](https://github.com/linjieli222/HERO)]

## *3.Courses*

* [CMU MultiModal Machine Learning](https://cmu-multicomp-lab.github.io/mmml-course/fall2020/)

## *Contact Me*

* [Yutong ZHOU](https://github.com/Yutong-Zhou-cv) in [Interaction Laboratory, Ritsumeikan University.](https://github.com/Rits-Interaction-Laboratory) ଘ(੭*ˊᵕˋ)੭

* If you have any question, please feel free to contact Yutong ZHOU (E-mail: ).