https://github.com/nyandwi/multimodal-learning-research
A curated resources on what's happening in multimodal learning. Features recent papers, books, related lectures, and other relevant resources.
https://github.com/nyandwi/multimodal-learning-research
Last synced: 8 months ago
JSON representation
A curated resources on what's happening in multimodal learning. Features recent papers, books, related lectures, and other relevant resources.
- Host: GitHub
- URL: https://github.com/nyandwi/multimodal-learning-research
- Owner: Nyandwi
- Created: 2023-03-12T08:31:22.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-04-28T08:01:59.000Z (over 2 years ago)
- Last Synced: 2025-01-10T05:30:52.885Z (9 months ago)
- Size: 6.84 KB
- Stars: 14
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Multimodal Learning Research
Multimodal Learning(MML) is a branch of machine learning that deals with designing models that can learn from multiple modalities such as vision, language, robotic actions, etc...MML is a hot area in AI research. AI systems that learn from single modality have advanced rapidly in recent times. We have witnessed language models that can understand texts, image models that can recognize images, etc...Although those systems are not perfect yet, they reasonably generalize well on the modalities they were trained on. A key challenge now is how to design AI systems that can jointly learn and generalize across multiple modalities at a large scale. The kind of systems that can understand, texts, robotic actions, speech, etc...
This repository is a collection of progress happening in multimodal learning. It features lecture videos, papers, books, and blog posts. Contributions are welcome!
**What's in here:**
* [Courses & Lecture Videos](#books)* [Relevant Workshops](#relevant-workshops)
* [Survey Papers](#survey-papers)
* [Books](#books)
* [Papers](#papers-by-categories)
* [Blog posts](#blog-posts)
## Courses & Videos
* Multimodal Machine Learning, Carnegie Mellon University: [Lecture videos](https://www.youtube.com/playlist?list=PL-Fhd_vrvisNM7pbbevXKAbT_Xmub37fA) | [webpage](https://cmu-multicomp-lab.github.io/mmml-course/fall2022/schedule/) | [whitepaper](https://aclanthology.org/P17-5002.pdf)
* Multi-Modal Imaging with Deep Learning and Modeling, Institute for Pure & Applied Mathematics (IPAM): [Lecture videos](https://www.youtube.com/playlist?list=PLHyI3Fbmv0Sdctgfh7uLkabghB2H2yw3b)
* Topics in AI - Multimodal Learning with Vision, Language and Sound - University of British Columbia: [Course webpage and readings](https://www.cs.ubc.ca/~lsigal/teaching22_Term1.html)
* Advanced Topics in MultiModal Machine Learning, Carnegie Mellon University: [webpage](https://cmu-multicomp-lab.github.io/adv-mmml-course/spring2023/schedule/)
* Deep Learning for Multi-Modal Systems | Data Science Summer School 2022: [Lecture video](https://www.youtube.com/watch?v=hLxMf7EdyQs&t=1s) | [Webpage](https://ds3.ai/2022/deep-learning.html)
* Topics in Computer Vision (CSC2539) - Visual Recognition with Text, University of Toronto: [webpage](http://www.cs.utoronto.ca/~fidler/teaching/2017/CSC2539.html)
***********************
## Relevant Workshops
* Multimodal Machine Learning, CVPR 2022 Tutorial: [Videos](https://www.youtube.com/playlist?list=PLki3HkfgNEsKPcpj5Vv2P98SRAT9wxIDa)* Recent Advances in Vision and Language Pre-training - [Slides and videos](https://vlp-tutorial.github.io)
***********************
## Survey Papers
* Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions: [ArXiv](https://arxiv.org/abs/2209.03430) | 2022
* Vision-Language Pre-training: Basics, Recent Advances, and Future Trend: [ArXiv 2022](https://arxiv.org/abs/2210.09263)
* VLP: A Survey on Vision-Language Pre-training: [ArXiv](https://arxiv.org/abs/2202.09061) | 2022
* Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods: [Arxiv](https://arxiv.org/pdf/1907.09358.pdf) | 2021
* Multimodal Machine Learning: A Survey and Taxonomy: [Arxiv](https://arxiv.org/abs/1705.09406)| 2017
************
## Books* Multimodal Deep Learning: [Web](https://slds-lmu.github.io/seminar_multimodal_dl/index.html) | [Arxiv](https://arxiv.org/abs/2301.04856) | 2023
***********************
## Papers by CategoriesFeatures papers about multimodal representation learning and task-specific papers.
### General MML Representation Learning
* Learning Transferable Visual Models From Natural Language Supervision: [ArXiv](https://arxiv.org/abs/2103.00020) | [Code](https://github.com/OpenAI/CLIP) | [Blog](https://openai.com/research/clip) | [Colab](https://colab.research.google.com/github/openai/clip/blob/master/notebooks/Interacting_with_CLIP.ipynb) [CLIP on HF](https://huggingface.co/docs/transformers/model_doc/clip) Feb 2021
**************
* LLaVA - Visual Instruction Tuning: [ArXiv](https://arxiv.org/pdf/2304.08485.pdf) | [Page](https://llava-vl.github.io) | [Code](https://github.com/haotian-liu/LLaVA) | April 2023
* EVA-CLIP - Improved Training Techniques for CLIP at Scale: [ArXiv](https://arxiv.org/abs/2303.15389) | [Code](https://github.com/baaivision/EVA/tree/master/EVA-CLIP) | March 2023
### Task Specific
#### Text-Image generation
#### Text-Image generation
* Video LM: Align your Latents:
High-Resolution Video Synthesis with Latent Diffusion Models: [ArXiv](https://arxiv.org/abs/2304.08818) | [Page](https://research.nvidia.com/labs/toronto-ai/VideoLDM/) | April 2023#### Text-image Retrieval
#### Image Captioning
#### Visual Question Answering
### Video Learning
### Robotic Learning
* A Picture is Worth a Thousand Words: Language Models Plan from Pixels: [ArXiv](https://arxiv.org/pdf/2303.09031.pdf) | 2023* PaLM-E: An Embodied Multimodal Language Model[ArXiv](https://arxiv.org/pdf/2303.03378.pdf) | [Page](https://palm-e.github.io) | [Blog](https://ai.googleblog.com/2023/03/palm-e-embodied-multimodal-language.html) | 2023
### Applications Connecting Multimodals
* HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace - [ArXiv](https://arxiv.org/abs/2303.17580) | [Code](https://github.com/microsoft/JARVIS) | March 2023
* Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models - [ArXiv](https://arxiv.org/abs/2303.04671) | [Code](https://github.com/microsoft/visual-chatgpt) | [Colab](https://colab.research.google.com/drive/11BtP3h-w0dZjA-X8JsS9_eo8OeGYvxXB#scrollTo=8nCGkaV0_xBP) | [Spaces](https://huggingface.co/spaces/microsoft/visual_chatgpt)
* ViperGPT: Visual Inference via Python Execution for Reasoning - [Paper](https://arxiv.org/abs/2303.08128) | [Code](https://github.com/cvlab-columbia/viper) | [Page](https://viper.cs.columbia.edu) | March 2023
***********************
## Blog Posts
* Generalized Visual Language Models by Lilian Weng: [blog](https://lilianweng.github.io/posts/2022-06-09-vlm/) | 2022
******************
## Related Repositories
* [Awesome Vision and Language Pretraining](https://github.com/phellonchen/awesome-Vision-and-Language-Pre-training#visual-machine-reading-comprehension)