Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/fnzhan/Generative-AI
[TPAMI 2023] Multimodal Image Synthesis and Editing: The Generative AI Era
https://github.com/fnzhan/Generative-AI
aigc diffusion-model gans multimodality nerfs
Last synced: 4 days ago
JSON representation
[TPAMI 2023] Multimodal Image Synthesis and Editing: The Generative AI Era
- Host: GitHub
- URL: https://github.com/fnzhan/Generative-AI
- Owner: fnzhan
- Created: 2021-12-04T07:28:21.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2023-11-21T09:02:58.000Z (12 months ago)
- Last Synced: 2024-07-29T05:34:21.201Z (3 months ago)
- Topics: aigc, diffusion-model, gans, multimodality, nerfs
- Language: TeX
- Homepage:
- Size: 121 MB
- Stars: 774
- Watchers: 45
- Forks: 61
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
[![arXiv](https://img.shields.io/badge/arXiv-2107.05399-b31b1b.svg)](https://arxiv.org/abs/2112.13592)
[![Survey](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://GitHub.com/Naereen/StrapDown.js/graphs/commit-activity)
[![PR's Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat)](http://makeapullrequest.com)
[![GitHub license](https://badgen.net/github/license/Naereen/Strapdown.js)](https://github.com/Naereen/StrapDown.js/blob/master/LICENSE)This project is associated with our survey paper which comprehensively contextualizes the advance of Multimodal Image
Synthesis \& Editing (MISE) and visual AIGC by formulating taxonomies according to data modality and model architectures.**Multimodal Image Synthesis and Editing: The Generative AI Era [[Paper](https://arxiv.org/abs/2112.13592)] [[Project](https://fnzhan.com/Generative-AI/)]**
[Fangneng Zhan](https://fnzhan.com/), [Yingchen Yu](https://yingchen001.github.io/), [Rongliang Wu](https://scholar.google.com.sg/citations?user=SZkh3iAAAAAJ&hl=en), [Jiahui Zhang](https://scholar.google.com/citations?user=DXpYbWkAAAAJ&hl=zh-CN), [Shijian Lu](https://scholar.google.com.sg/citations?user=uYmK-A0AAAAJ&hl=en), [Lingjie Liu](https://lingjie0206.github.io/), [Adam Kortylewsk](https://generativevision.mpi-inf.mpg.de/),
[Christian Theobalt](https://people.mpi-inf.mpg.de/~theobalt/), [Eric Xing](http://www.cs.cmu.edu/~epxing/)
*IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023*
[![PR's Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat)](http://makeapullrequest.com)
You are welcome to promote papers via pull request.
The process to submit a pull request:
- a. Fork the project into your own repository.
- b. Add the Title, Author, Conference, Paper link, Project link, and Code link in `README.md` with below format:
```
**Title**
*Author*
Conference
[[Paper](Paper link)]
[[Code](Project link)]
[[Project](Code link)]
```
- c. Submit the pull request to this branch.
## Related Surveys & Projects
**Adversarial Text-to-Image Synthesis: A Review**
*Stanislav Frolov, Tobias Hinz, Federico Raue, Jörn Hees, Andreas Dengel*
Neural Networks 2021
[[Paper](https://arxiv.org/abs/2101.09983)]**GAN Inversion: A Survey**
*Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei Zhou, Ming-Hsuan Yang*
TPAMI 2022
[[Paper](https://arxiv.org/abs/2101.05278)]
[[Project](https://github.com/weihaox/awesome-gan-inversion)]**Deep Image Synthesis from Intuitive User Input: A Review and Perspectives**
*Yuan Xue, Yuan-Chen Guo, Han Zhang, Tao Xu, Song-Hai Zhang, Xiaolei Huang*
Computational Visual Media 2022
[[Paper](https://arxiv.org/abs/2107.04240)][Awesome-Text-to-Image](https://github.com/Yutong-Zhou-cv/awesome-Text-to-Image)
## Table of Contents (Work in Progress)
**Methods:**
- [Neural Rendering Methods](#Neural-rendering-Methods)
- [Diffusion-based Methods](#Diffusion-based-Methods)
- [Autoregressive Methods](#Autoregressive-Methods)
- [Image Quantizer](#Image-Quantizer)
- [GAN-based Methods](#GAN-based-Methods)
- [GAN-Inversion](#GAN-Inversion-Methods)
- [Other Methods](#Other-Methods)**Modalities & Datasets:**
- [Text Encoding](#Text-Encoding)
- [Audio Encoding](#Audio-Encoding)
- [Datasets](#Datasets)## Neural-Rendering-Methods
**ATT3D: Amortized Text-to-3D Object Synthesis**
*Jonathan Lorraine, Kevin Xie, Xiaohui Zeng, Chen-Hsuan Lin, Towaki Takikawa, Nicholas Sharp, Tsung-Yi Lin, Ming-Yu Liu, Sanja Fidler, James Lucas*
arxiv 2023
[[Paper](https://arxiv.org/abs/2306.07349)]**TADA! Text to Animatable Digital Avatars**
*Tingting Liao, Hongwei Yi, Yuliang Xiu, Jiaxaing Tang, Yangyi Huang, Justus Thies, Michael J. Black*
arxiv 2023
[[Paper](https://arxiv.org/abs/2308.10899)]**MATLABER: Material-Aware Text-to-3D via LAtent BRDF auto-EncodeR**
*Xudong Xu, Zhaoyang Lyu, Xingang Pan, Bo Dai*
arxiv 2023
[[Paper](https://arxiv.org/abs/2308.09278)]**IT3D: Improved Text-to-3D Generation with Explicit View Synthesis**
*Yiwen Chen, Chi Zhang, Xiaofeng Yang, Zhongang Cai, Gang Yu, Lei Yang, Guosheng Lin*
arxiv 2023
[[Paper](https://arxiv.org/abs/2308.11473)]**AvatarVerse: High-quality & Stable 3D Avatar Creation from Text and Pose**
*Huichao Zhang, Bowen Chen, Hao Yang, Liao Qu, Xu Wang, Li Chen, Chao Long, Feida Zhu, Kang Du, Min Zheng*
arxiv 2023
[[Paper](https://arxiv.org/abs/2308.03610)]
[[Project](https://avatarverse3d.github.io/)]**Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions**
*Ayaan Haque, Matthew Tancik, Alexei A. Efros, Aleksander Holynski, Angjoo Kanazawa*
ICCV 2023
[[Paper](https://arxiv.org/abs/2303.12789)]
[[Project](https://instruct-nerf2nerf.github.io)]
[[Code](https://github.com/ayaanzhaque/instruct-nerf2nerf)]**FaceCLIPNeRF: Text-driven 3D Face Manipulation using Deformable Neural Radiance Fields**
*Sungwon Hwang, Junha Hyung, Daejin Kim, Min-Jung Kim, Jaegul Choo*
ICCV 2023
[[Paper](https://arxiv.org/abs/2307.11418v3)]**Local 3D Editing via 3D Distillation of CLIP Knowledge**
*Junha Hyung, Sungwon Hwang, Daejin Kim, Hyunji Lee, Jaegul Choo*
CVPR 2023
[[Paper](https://arxiv.org/abs/2306.12570)]**RePaint-NeRF: NeRF Editting via Semantic Masks and Diffusion Models**
*Xingchen Zhou, Ying He, F. Richard Yu, Jianqiang Li, You Li*
IJCAI 2023
[[Paper](https://arxiv.org/abs/2306.05668)]**DreamTime: An Improved Optimization Strategy for Text-to-3D Content Creation**
*Yukun Huang, Jianan Wang, Yukai Shi, Xianbiao Qi, Zheng-Jun Zha, Lei Zhang*
arxiv 2023
[[Paper](https://arxiv.org/abs/2306.12422)]
[[Project](https://itsallagi.com/dreamtime-a-new-way-to-create-3d-content-from-text/)]**AvatarStudio: Text-driven Editing of 3D Dynamic Human Head Avatars**
*Mohit Mendiratta, Xingang Pan, Mohamed Elgharib, Kartik Teotia, Mallikarjun B R, Ayush Tewari, Vladislav Golyanik, Adam Kortylewski, Christian Theobalt*
arxiv 2023
[[Paper](https://arxiv.org/abs/2306.00547)]
[[Project](https://vcai.mpi-inf.mpg.de/projects/AvatarStudio/)]**Blended-NeRF: Zero-Shot Object Generation and Blending in Existing Neural Radiance Fields**
*Ori Gordon, Omri Avrahami, Dani Lischinski*
arxiv 2023
[[Paper](https://arxiv.org/abs/2306.12760)]
[[Project](https://www.vision.huji.ac.il/blended-nerf/)]**OR-NeRF: Object Removing from 3D Scenes Guided by Multiview Segmentation with Neural Radiance Fields**
*Youtan Yin, Zhoujie Fu, Fan Yang, Guosheng Lin*
arxiv 2023
[[Paper](https://arxiv.org/abs/2305.10503)]
[[Project](https://ornerf.github.io/)]
[[Code](https://github.com/cuteyyt/or-nerf)]**HiFA: High-fidelity Text-to-3D with Advanced Diffusion Guidance**
*Junzhe Zhu, Peiye Zhuang*
arxiv 2023
[[Paper](https://arxiv.org/abs/2305.18766)]
[[Project](https://hifa-team.github.io/HiFA-site/)]**ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation**
*Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, Jun Zhu*
arxiv 2023
[[Paper](https://arxiv.org/abs/2305.16213)]
[[Project](https://ml.cs.tsinghua.edu.cn/prolificdreamer/)]**Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields**
*Jingbo Zhang, Xiaoyu Li, Ziyu Wan, Can Wang, Jing Liao*
arxiv 2023
[[Paper](https://arxiv.org/abs/2305.11588)]
[[Project](https://eckertzhang.github.io/Text2NeRF.github.io/)]**DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models**
*Yukang Cao, Yan-Pei Cao, Kai Han, Ying Shan, Kwan-Yee K. Wong*
arxiv 2023
[[Paper](https://arxiv.org/abs/2304.00916)]
[[Project](https://yukangcao.github.io/DreamAvatar/)]**DITTO-NeRF: Diffusion-based Iterative Text To Omni-directional 3D Model**
*Hoigi Seo, Hayeon Kim, Gwanghyun Kim, Se Young Chun*
arxiv 2023
[[Paper](https://arxiv.org/abs/2304.02827)]
[[Project](https://janeyeon.github.io/ditto-nerf/)]
[[Code](https://github.com/janeyeon/ditto-nerf-code)]**CompoNeRF: Text-guided Multi-object Compositional NeRF with Editable 3D Scene Layout**
*Yiqi Lin, Haotian Bai, Sijia Li, Haonan Lu, Xiaodong Lin, Hui Xiong, Lin Wang*
arxiv 2023
[[Paper](https://arxiv.org/abs/2303.13843)]**Set-the-Scene: Global-Local Training for Generating Controllable NeRF Scenes**
*Dana Cohen-Bar, Elad Richardson, Gal Metzer, Raja Giryes, Daniel Cohen-Or*
arxiv 2023
[[Paper](https://arxiv.org/abs/2303.13450)]
[[Project](https://danacohen95.github.io/Set-the-Scene/)]
[[Code](https://github.com/DanaCohen95/Set-the-Scene)]**Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation**
*Junyoung Seo, Wooseok Jang, Min-Seop Kwak, Jaehoon Ko, Hyeonsu Kim, Junho Kim, Jin-Hwa Kim, Jiyoung Lee, Seungryong Kim*
arxiv 2023
[[Paper](https://arxiv.org/abs/2303.07937)]
[[Project](https://ku-cvlab.github.io/3DFuse/)]
[[Code](https://github.com/KU-CVLAB/3DFuse)]**Text-To-4D Dynamic Scene Generation**
*Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, Yaniv Taigman*
arxiv 2023
[[Paper](https://arxiv.org/abs/2301.11280)]
[[Project](https://make-a-video3d.github.io/)]**Magic3D: High-Resolution Text-to-3D Content Creation**
*Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, Tsung-Yi Lin*
CVPR 2023
[[Paper](https://arxiv.org/abs/2211.10440)]
[[Project](https://deepimagination.cc/Magic3D/)]**DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model**
*Gwanghyun Kim, Se Young Chun*
CVPR 2023
[[Paper](https://arxiv.org/abs/2211.16374)]
[[Code](https://github.com/gwang-kim/DATID-3D)]
[[Project](https://datid-3d.github.io/)]**Towards Photorealistic 3D Object Generation and Editing with Text-guided Diffusion Models**
*Gang Li, Heliang Zheng, Chaoyue Wang, Chang Li, Changwen Zheng, Dacheng Tao*
arxiv 2022
[[Paper](https://arxiv.org/abs/2211.14108)]
[[Project](https://3ddesigner-diffusion.github.io/)]**DreamFusion: Text-to-3D using 2D Diffusion**
*Ben Poole, Ajay Jain, Jonathan T. Barron, Ben Mildenhall*
arxiv 2022
[[Paper](https://arxiv.org/abs/2209.14988)]
[[Project](https://dreamfusion3d.github.io/)]**Zero-Shot Text-Guided Object Generation with Dream Fields**
*Ajay Jain, Ben Mildenhall, Jonathan T. Barron, Pieter Abbeel, Ben Poole*
CVPR 2022
[[Paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Jain_Zero-Shot_Text-Guided_Object_Generation_With_Dream_Fields_CVPR_2022_paper.pdf)]
[[Code](https://github.com/google-research/google-research/tree/master/dreamfields)]
[[Project](https://ajayj.com/dreamfields)]**IDE-3D: Interactive Disentangled Editing for High-Resolution 3D-aware Portrait Synthesis**
*Jingxiang Sun, Xuan Wang, Yichun Shi, Lizhen Wang, Jue Wang, Yebin Liu*
SIGGRAPH Asia 2022
[[Paper](https://arxiv.org/pdf/2205.15517.pdf)]
[[Code](https://github.com/MrTornado24/IDE-3D)]
[[Project](https://mrtornado24.github.io/IDE-3D/)]**Sem2NeRF: Converting Single-View Semantic Masks to Neural Radiance Fields**
*Yuedong Chen, Qianyi Wu, Chuanxia Zheng, Tat-Jen Cham, Jianfei Cai*
arxiv 2022
[[Paper](https://arxiv.org/abs/2203.10821)]
[[Code](https://github.com/donydchen/sem2nerf)]
[[Project](https://donydchen.github.io/sem2nerf/)]**CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields**
*Can Wang, Menglei Chai, Mingming He, Dongdong Chen, Jing Liao*
CVPR 2022
[[Paper](https://arxiv.org/abs/2112.05139)]
[[Code](https://github.com/cassiePython/CLIPNeRF)]
[[Project](https://cassiepython.github.io/clipnerf/)]**CG-NeRF: Conditional Generative Neural Radiance Fields**
*Kyungmin Jo, Gyumin Shim, Sanghun Jung, Soyoung Yang, Jaegul Choo*
arxiv 2021
[[Paper](https://arxiv.org/abs/2112.03517)]**Zero-Shot Text-Guided Object Generation with Dream Fields**
*Ajay Jain, Ben Mildenhall, Jonathan T. Barron, Pieter Abbeel, Ben Poole*
arxiv 2021
[[Paper](https://arxiv.org/abs/2112.01455)]
[[Project](https://ajayj.com/dreamfields)]**AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis**
*Yudong Guo, Keyu Chen, Sen Liang, Yong-Jin Liu, Hujun Bao, Juyong Zhang*
ICCV 2021
[[Paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Guo_AD-NeRF_Audio_Driven_Neural_Radiance_Fields_for_Talking_Head_Synthesis_ICCV_2021_paper.pdf)]
[[Code](https://github.com/YudongGuo/AD-NeRF)]
[[Project](https://yudongguo.github.io/ADNeRF/)]
[[Video](https://www.youtube.com/watch?v=TQO2EBYXLyU)]
## Diffusion-based-Methods
**BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing**
*Dongxu Li, Junnan Li, Steven C.H. Hoi*
Arxiv 2023
[[Paper](https://arxiv.org/pdf/2305.14720.pdf)]
[[Project](https://dxli94.github.io/BLIP-Diffusion-website/)]
[[Code](https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion)]**InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions**
*Qian Wang, Biao Zhang, Michael Birsak, Peter Wonka*
Arxiv 2023
[[Paper](https://arxiv.org/pdf/2305.18047.pdf)]
[[Project](https://qianwangx.github.io/InstructEdit/)]
[[Code](https://github.com/qianwangx/instructedit)]**DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation**
*Nataniel Ruiz, Yuanzhen Li, Varun Jampani Yael, Pritch Michael, Rubinstein Kfir Aberman*
CVPR 2023
[[Paper](https://arxiv.org/pdf/2208.12242.pdf)]
[[Project](https://dreambooth.github.io/)]
[[Code](https://github.com/google/dreambooth)]**Multi-Concept Customization of Text-to-Image Diffusion**
*Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, Jun-Yan Zhu*
CVPR 2023
[[Paper](https://openaccess.thecvf.com/content/CVPR2023/papers/Kumari_Multi-Concept_Customization_of_Text-to-Image_Diffusion_CVPR_2023_paper.pdf)]
[[Project](https://www.cs.cmu.edu/~custom-diffusion/)]
[[Code](https://github.com/adobe-research/custom-diffusion)]**Collaborative Diffusion for Multi-Modal Face Generation and Editing**
*Ziqi Huang, Kelvin C.K. Chan, Yuming Jiang, Ziwei Liu*
CVPR 2023
[[Paper](https://arxiv.org/pdf/2304.10530v1.pdf)]
[[Project](https://ziqihuangg.github.io/projects/collaborative-diffusion.html)]
[[Code](https://github.com/ziqihuangg/Collaborative-Diffusion)]**Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation**
*Narek Tumanyan, Michal Geyer, Shai Bagon, Tali Dekel*
CVPR 2023
[[Paper](https://openaccess.thecvf.com/content/CVPR2023/papers/Tumanyan_Plug-and-Play_Diffusion_Features_for_Text-Driven_Image-to-Image_Translation_CVPR_2023_paper.pdf)]
[[Project](https://pnp-diffusion.github.io/)]
[[Code](https://github.com/MichalGeyer/plug-and-play)]**SINE: SINgle Image Editing with Text-to-Image Diffusion Models**
*Zhixing Zhang, Ligong Han, Arnab Ghosh, Dimitris Metaxas, Jian Ren*
CVPR 2023
[[Paper](https://openaccess.thecvf.com/content/CVPR2023/papers/Zhang_SINE_SINgle_Image_Editing_With_Text-to-Image_Diffusion_Models_CVPR_2023_paper.pdf)]
[[Project](https://zhang-zx.github.io/SINE/)]
[[Code](https://github.com/zhang-zx/SINE)]**NULL-Text Inversion for Editing Real Images Using Guided Diffusion Models**
*Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, Daniel Cohen-Or*
CVPR 2023
[[Paper](https://openaccess.thecvf.com/content/CVPR2023/papers/Mokady_NULL-Text_Inversion_for_Editing_Real_Images_Using_Guided_Diffusion_Models_CVPR_2023_paper.pdf)]
[[Project](https://null-text-inversion.github.io/)]
[[Code](https://github.com/google/prompt-to-prompt/#null-text-inversion-for-editing-real-images)]**Paint by Example: Exemplar-Based Image Editing With Diffusion Models**
*Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, Fang Wen*
CVPR 2023
[[Paper](https://openaccess.thecvf.com/content/CVPR2023/papers/Yang_Paint_by_Example_Exemplar-Based_Image_Editing_With_Diffusion_Models_CVPR_2023_paper.pdf)]
[[Demo](https://huggingface.co/spaces/Fantasy-Studio/Paint-by-Example)]
[[Code](https://github.com/Fantasy-Studio/Paint-by-Example)]**SpaText: Spatio-Textual Representation for Controllable Image Generation**
*Omri Avrahami, Thomas Hayes, Oran Gafni, Sonal Gupta, Yaniv Taigman, Devi Parikh, Dani Lischinski, Ohad Fried, Xi Yin*
CVPR 2023
[[Paper](https://arxiv.org/pdf/2211.14305.pdf)]
[[Project](https://omriavrahami.com/spatext/)]**Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models**
*Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, Karsten Kreis*
CVPR 2023
[[Paper](https://arxiv.org/pdf/2304.08818.pdf)]
[[Project](https://research.nvidia.com/labs/toronto-ai/VideoLDM/)]**InstructPix2Pix Learning to Follow Image Editing Instructions**
*Tim Brooks, Aleksander Holynski, Alexei A. Efros*
CVPR 2023
[[Paper](https://openaccess.thecvf.com/content/CVPR2023/papers/Brooks_InstructPix2Pix_Learning_To_Follow_Image_Editing_Instructions_CVPR_2023_paper.pdf)]
[[Project](https://www.timothybrooks.com/instruct-pix2pix/)]
[[Code](https://github.com/timothybrooks/instruct-pix2pix)]**Unite and Conquer: Plug & Play Multi-Modal Synthesis using Diffusion Models**
*Nithin Gopalakrishnan Nair, Chaminda Bandara, Vishal M Patel*
CVPR 2023
[[Paper](https://openaccess.thecvf.com/content/CVPR2023/papers/Nair_Unite_and_Conquer_Plug__Play_Multi-Modal_Synthesis_Using_Diffusion_CVPR_2023_paper.pdf)]
[[Project](https://nithin-gk.github.io/projectpages/Multidiff/index.html)]
[[Code](https://github.com/Nithin-GK/UniteandConquer)]**DiffEdit: Diffusion-based semantic image editing with mask guidance**
*Guillaume Couairon, Jakob Verbeek, Holger Schwenk, Matthieu Cord*
CVPR 2023
[[Paper](https://arxiv.org/pdf/2210.11427.pdf)]**eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers**
*Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, Ming-Yu Liu*
Arxiv 2022
[[Paper](https://arxiv.org/pdf/2211.01324.pdf)]
[[Project](https://research.nvidia.com/labs/dir/eDiff-I/)]**Prompt-to-Prompt Image Editing with Cross-Attention Control**
*Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman1 Yael Pritch, Daniel Cohen-Or*
Arxiv 2022
[[Paper](https://prompt-to-prompt.github.io/ptp_files/Prompt-to-Prompt_preprint.pdf)]
[[Project](https://prompt-to-prompt.github.io/)]
[[Code](https://github.com/google/prompt-to-prompt)]**An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion**
*Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, Daniel Cohen-Or*
Arxiv 2022
[[Paper](https://arxiv.org/pdf/2208.01618.pdf)]
[[Project](https://textual-inversion.github.io/)]
[[Code](https://github.com/rinongal/textual_inversion)]**Text2Human: Text-Driven Controllable Human Image Generation**
*Yuming Jiang, Shuai Yang, Haonan Qiu, Wayne Wu, Chen Change Loy, Ziwei Liu*
SIGGRAPH 2022
[[Paper](https://arxiv.org/pdf/2205.15996.pdf)]
[[Project](https://yumingj.github.io/projects/Text2Human.html)]
[[Code](https://github.com/yumingj/Text2Human)]**[DALL-E 2] Hierarchical Text-Conditional Image Generation with CLIP Latents**
*Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen*
[[Paper](https://cdn.openai.com/papers/dall-e-2.pdf)]
[[Code](https://github.com/lucidrains/DALLE2-pytorch)]**High-Resolution Image Synthesis with Latent Diffusion Models**
*Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer*
CVPR 2022
[[Paper](https://arxiv.org/abs/2112.10752)]
[[Code](https://github.com/CompVis/latent-diffusion)]**v objective diffusion**
*Katherine Crowson*
[[Code](https://github.com/crowsonkb/v-diffusion-pytorch)]**GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models**
*Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, Mark Chen*
arxiv 2021
[[Paper](https://arxiv.org/abs/2112.10741)]
[[Code](https://github.com/openai/glide-text2im)]**Vector Quantized Diffusion Model for Text-to-Image Synthesis**
*Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, Baining Guo*
arxiv 2021
[[Paper](https://arxiv.org/abs/2111.14822)]
[[Code](https://github.com/microsoft/VQ-Diffusion)]**DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation**
*Gwanghyun Kim, Jong Chul Ye*
arxiv 2021
[[Paper](https://arxiv.org/abs/2110.02711)]**Blended Diffusion for Text-driven Editing of Natural Images**
*Omri Avrahami, Dani Lischinski, Ohad Fried*
CVPR 2022
[[Paper](https://arxiv.org/abs/2111.14818)]
[[Project](https://omriavrahami.com/blended-diffusion-page/)]
[[Code](https://github.com/omriav/blended-diffusion)]
## Autoregressive-Methods
**MaskGIT: Masked Generative Image Transformer**
*Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, William T. Freeman*
arxiv 2022
[[Paper](https://arxiv.org/abs/2202.04200)]**ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation**
*Han Zhang, Weichong Yin, Yewei Fang, Lanxin Li, Boqiang Duan, Zhihua Wu, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang*
arxiv 2021
[[Paper](https://arxiv.org/abs/2112.15283)]
[[Project](https://wenxin.baidu.com/wenxin/ernie-vilg)]**NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion**
*Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, Nan Duan*
arxiv 2021
[[Paper](https://arxiv.org/abs/2111.12417)]
[[Code](https://github.com/microsoft/NUWA)]
[[Video](https://youtu.be/C9CTnZJ9ZE0)]**L-Verse: Bidirectional Generation Between Image and Text**
*Taehoon Kim, Gwangmo Song, Sihaeng Lee, Sangyun Kim, Yewon Seo, Soonyoung Lee, Seung Hwan Kim, Honglak Lee, Kyunghoon Bae*
arxiv 2021
[[Paper](https://arxiv.org/abs/2111.11133)]
[[Code](https://github.com/tgisaturday/L-Verse)]**M6-UFC: Unifying Multi-Modal Controls for Conditional Image Synthesis**
*Zhu Zhang, Jianxin Ma, Chang Zhou, Rui Men, Zhikang Li, Ming Ding, Jie Tang, Jingren Zhou, Hongxia Yang*
NeurIPS 2021
[[Paper](https://arxiv.org/abs/2105.14211v3)]**ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis**
*Patrick Esser, Robin Rombach, Andreas Blattmann, Björn Ommer*
NeurIPS 2021
[[Paper](https://openreview.net/pdf?id=-1AAgrS5FF)]
[[Code](https://github.com/CompVis/imagebart)]
[[Project](https://compvis.github.io/imagebart/)]**A Picture is Worth a Thousand Words: A Unified System for Diverse Captions and Rich Images Generation**
*Yupan Huang, Bei Liu, Jianlong Fu, Yutong Lu*
ACM MM 2021
[[Paper](https://arxiv.org/abs/2110.09756)]
[[Code](https://github.com/researchmm/generate-it)]**Unifying Multimodal Transformer for Bi-directional Image and Text Generation**
*Yupan Huang, Hongwei Xue, Bei Liu, Yutong Lu*
ACM MM 2021
[[Paper](https://arxiv.org/abs/2110.09753)]
[[Code](https://github.com/researchmm/generate-it)]**Taming Transformers for High-Resolution Image Synthesis**
*Patrick Esser, Robin Rombach, Björn Ommer*
CVPR 2021
[[Paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Esser_Taming_Transformers_for_High-Resolution_Image_Synthesis_CVPR_2021_paper.pdf)]
[[Code](https://github.com/CompVis/taming-transformers)]
[[Project](https://compvis.github.io/taming-transformers/)]**RuDOLPH: One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP**
*Alex Shonenkov and Michael Konstantinov*
arxiv 2022
[[Code](https://github.com/sberbank-ai/ru-dolph)]**Generate Images from Texts in Russian (ruDALL-E)**
[[Code](https://github.com/sberbank-ai/ru-dalle)]
[[Project](https://rudalle.ru/en/)]**Zero-Shot Text-to-Image Generation**
*Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever*
arxiv 2021
[[Paper](https://arxiv.org/abs/2102.12092)]
[[Code](https://github.com/openai/DALL-E)]
[[Project](https://openai.com/blog/dall-e/)]**Compositional Transformers for Scene Generation**
*Drew A. Hudson, C. Lawrence Zitnick*
NeurIPS 2021
[[Paper](https://openreview.net/pdf?id=YQeWoRnwTnE)]
[[Code](https://github.com/dorarad/gansformer)]**X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers**
*Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, Aniruddha Kembhavi*
EMNLP 2020
[[Paper](https://arxiv.org/abs/2009.11278)]
[[Code](https://github.com/allenai/x-lxmert)]**One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning**
*Suzhen Wang, Lincheng Li, Yu Ding, Xin Yu*
AAAI 2022
[[Paper](https://arxiv.org/abs/2112.02749)]
### Image-Quantizer
**[TE-VQGAN] Translation-equivariant Image Quantizer for Bi-directional Image-Text Generation**
*Woncheol Shin, Gyubok Lee, Jiyoung Lee, Joonseok Lee, Edward Choi*
arxiv 2021
[[Paper](https://arxiv.org/abs/2110.04627)]
[[Code](https://github.com/wcshin-git/TE-VQGAN)]**[ViT-VQGAN] Vector-quantized Image Modeling with Improved VQGAN**
*Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, Yonghui Wu*
arxiv 2021
[[Paper](https://arxiv.org/abs/2110.04627)]**[PeCo] PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers**
*Xiaoyi Dong, Jianmin Bao, Ting Zhang, Dongdong Chen, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Yu*
arxiv 2021
[[Paper](https://arxiv.org/abs/2111.12710)]**[VQ-GAN] Taming Transformers for High-Resolution Image Synthesis**
*Patrick Esser, Robin Rombach, Björn Ommer*
CVPR 2021
[[Paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Esser_Taming_Transformers_for_High-Resolution_Image_Synthesis_CVPR_2021_paper.pdf)]
[[Code](https://github.com/CompVis/taming-transformers)]**[Gumbel-VQ] vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations**
*Alexei Baevski, Steffen Schneider, Michael Auli*
ICLR 2020
[[Paper](https://openreview.net/pdf?id=rylwJxrYDS)]
[[Code](https://github.com/pytorch/fairseq/blob/main/examples/wav2vec/README.md)]**[EM VQ-VAE] Theory and Experiments on Vector Quantized Autoencoders**
*Aurko Roy, Ashish Vaswani, Arvind Neelakantan, Niki Parmar*
arxiv 2018
[[Paper](https://arxiv.org/abs/1805.11063)]
[[Code](https://github.com/jaywalnut310/Vector-Quantized-Autoencoders)]**[VQ-VAE] Neural Discrete Representation Learning**
*Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu*
NIPS 2017
[[Paper](https://proceedings.neurips.cc/paper/2017/file/7a98af17e63a0ac09ce2e96d03992fbc-Paper.pdf)]
[[Code](https://github.com/ritheshkumar95/pytorch-vqvae)]**[VQ-VAE2 or EMA-VQ] Generating Diverse High-Fidelity Images with VQ-VAE-2**
*Ali Razavi, Aaron van den Oord, Oriol Vinyals*
NIPS 2019
[[Paper](https://proceedings.neurips.cc/paper/2019/file/5f8e2fa1718d1bbcadf1cd9c7a54fb8c-Paper.pdf)]
[[Code](https://github.com/rosinality/vq-vae-2-pytorch)]**[Discrete VAE] Discrete Variational Autoencoders**
*Jason Tyler Rolfe*
ICLR 2017
[[Paper](https://arxiv.org/abs/1609.02200)]
[[Code](https://github.com/openai/DALL-E)]**[DVAE++] DVAE++: Discrete Variational Autoencoders with Overlapping Transformations**
*Arash Vahdat, William G. Macready, Zhengbing Bian, Amir Khoshaman, Evgeny Andriyash*
ICML 2018
[[Paper](https://arxiv.org/abs/1802.04920)]
[[Code](https://github.com/xmax1/dvae)]**[DVAE#] DVAE#: Discrete Variational Autoencoders with Relaxed Boltzmann Priors**
*Arash Vahdat, Evgeny Andriyash, William G. Macready*
NIPS 2018
[[Paper](https://arxiv.org/abs/1805.07445)]
[[Code](https://github.com/xmax1/dvae)]
## GAN-based-Methods
**GauGAN2**
*NVIDIA*
[[Project](http://gaugan.org/gaugan2/)]
[[Video](https://www.youtube.com/watch?v=p9MAvRpT6Cg)]**Multimodal Conditional Image Synthesis with Product-of-Experts GANs**
*Xun Huang, Arun Mallya, Ting-Chun Wang, Ming-Yu Liu*
arxiv 2021
[[Paper](https://arxiv.org/abs/2112.05130)]**RiFeGAN2: Rich Feature Generation for Text-to-Image Synthesis from Constrained Prior Knowledge**
*Jun Cheng, Fuxiang Wu, Yanling Tian, Lei Wang, Dapeng Tao*
TCSVT 2021
[[Paper](https://ieeexplore.ieee.org/abstract/document/9656731/authors#authors)]**TRGAN: Text to Image Generation Through Optimizing Initial Image**
*Liang Zhao, Xinwei Li, Pingda Huang, Zhikui Chen, Yanqi Dai, Tianyu Li*
ICONIP 2021
[[Paper](https://link.springer.com/chapter/10.1007/978-3-030-92307-5_76)]**Audio-Driven Emotional Video Portraits [Audio2Image]**
*Xinya Ji, Hang Zhou, Kaisiyuan Wang, Wayne Wu, Chen Change Loy, Xun Cao, Feng Xu*
CVPR 2021
[[Paper](https://arxiv.org/abs/2104.07452)]
[[Code](https://github.com/jixinya/EVP/)]
[[Project](https://jixinya.github.io/projects/evp/)]**SketchyCOCO: Image Generation from Freehand Scene Sketches**
*Chengying Gao, Qi Liu, Qi Xu, Limin Wang, Jianzhuang Liu, Changqing Zou*
CVPR 2020
[[Paper](https://arxiv.org/pdf/2003.02683.pdf)]
[[Code](https://github.com/sysu-imsl/SketchyCOCO)]
[[Project](https://mikexuq.github.io/test_building_pages/index.html)]**Direct Speech-to-Image Translation [Audio2Image]**
*Jiguo Li, Xinfeng Zhang, Chuanmin Jia, Jizheng Xu, Li Zhang, Yue Wang, Siwei Ma, Wen Gao*
JSTSP 2020
[[Paper](https://ieeexplore.ieee.org/document/9067083/authors#authors)]
[[Code](https://github.com/smallflyingpig/speech-to-image-translation-without-text)]
[[Project](https://smallflyingpig.github.io/speech-to-image/main)]**MirrorGAN: Learning Text-to-image Generation by Redescription [Text2Image]**
*Tingting Qiao, Jing Zhang, Duanqing Xu, Dacheng Tao*
CVPR 2019
[[Paper](https://arxiv.org/abs/1903.05854)]
[[Code](https://github.com/qiaott/MirrorGAN)]**AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks [Text2Image]**
*Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, Xiaodong He*
CVPR 2018
[[Paper](https://openaccess.thecvf.com/content_cvpr_2018/papers/Xu_AttnGAN_Fine-Grained_Text_CVPR_2018_paper.pdf)]
[[Code](https://github.com/taoxugit/AttnGAN)]**Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space**
*Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovitskiy, Jason Yosinski*
CVPR 2017
[[Paper](https://openaccess.thecvf.com/content_cvpr_2017/papers/Nguyen_Plug__Play_CVPR_2017_paper.pdf)]
[[Code](https://github.com/Evolving-AI-Lab/ppgn)]**StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks [Text2Image]**
*Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, Dimitris Metaxas*
TPAMI 2018
[[Paper](https://arxiv.org/abs/1710.10916)]
[[Code](https://github.com/hanzhanggit/StackGAN-v2)]**StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks [Text2Image]**
*Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, Dimitris Metaxas*
ICCV 2017
[[Paper](https://arxiv.org/abs/1612.03242)]
[[Code](https://github.com/hanzhanggit/StackGAN)]
### GAN-Inversion-Methods
**Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold**
*Xingang Pan, Ayush Tewari, Thomas Leimkühler, Lingjie Liu, Abhimitra Meka, Christian Theobalt*
SIGGRAPH 2023
[[Paper](https://arxiv.org/abs/2305.10973)]
[[Code](https://github.com/XingangPan/DragGAN)]**HairCLIP: Design Your Hair by Text and Reference Image**
*Tianyi Wei, Dongdong Chen, Wenbo Zhou, Jing Liao, Zhentao Tan, Lu Yuan, Weiming Zhang, Nenghai Yu*
arxiv 2021
[[Paper](https://arxiv.org/abs/2112.05142)]
[[Code](https://github.com/wty-ustc/HairCLIP)]**FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+ GAN Space Optimization**
*Xingchao Liu, Chengyue Gong, Lemeng Wu, Shujian Zhang, Hao Su, Qiang Liu*
arxiv 2021
[[Paper](https://arxiv.org/abs/2112.01573)]
[[Code](https://github.com/gnobitab/FuseDream)]**StyleMC: Multi-Channel Based Fast Text-Guided Image Generation and Manipulation**
*Umut Kocasari, Alara Dirik, Mert Tiftikci, Pinar Yanardag*
WACV 2022
[[Paper](https://arxiv.org/abs/2112.08493)]
[[Code](https://github.com/catlab-team/stylemc)]
[[Project](https://catlab-team.github.io/stylemc/)]**Cycle-Consistent Inverse GAN for Text-to-Image Synthesis**
*Hao Wang, Guosheng Lin, Steven C. H. Hoi, Chunyan Miao*
ACM MM 2021
[[Paper](https://dl.acm.org/doi/10.1145/3474085.3475226)]**StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery**
*Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, Dani Lischinski*
ICCV 2021
[[Paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Patashnik_StyleCLIP_Text-Driven_Manipulation_of_StyleGAN_Imagery_ICCV_2021_paper.pdf)]
[[Code](https://github.com/orpatashnik/StyleCLIP)]
[[Video](https://www.youtube.com/watch?v=PhR1gpXDu0w)]**Talk-to-Edit: Fine-Grained Facial Editing via Dialog**
*Yuming Jiang, Ziqi Huang, Xingang Pan, Chen Change Loy, Ziwei Liu*
ICCV 2021
[[Paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Jiang_Talk-To-Edit_Fine-Grained_Facial_Editing_via_Dialog_ICCV_2021_paper.pdf)]
[[Code](https://github.com/yumingj/Talk-to-Edit)]
[[Project](https://www.mmlab-ntu.com/project/talkedit/)]**TediGAN: Text-Guided Diverse Face Image Generation and Manipulation**
*Weihao Xia, Yujiu Yang, Jing-Hao Xue, Baoyuan Wu*
CVPR 2021
[[Paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Xia_TediGAN_Text-Guided_Diverse_Face_Image_Generation_and_Manipulation_CVPR_2021_paper.pdf)]
[[Code](https://github.com/IIGROUP/TediGAN)]
[[Video](https://www.youtube.com/watch?v=L8Na2f5viAM)]**Paint by Word**
*David Bau, Alex Andonian, Audrey Cui, YeonHwan Park, Ali Jahanian, Aude Oliva, Antonio Torralba*
arxiv 2021
[[Paper](https://arxiv.org/abs/2112.01573)]
## Other-Methods
**Language-Driven Image Style Transfer**
*Tsu-Jui Fu, Xin Eric Wang, William Yang Wang*
arxiv 2021
[[Paper](https://arxiv.org/abs/2106.00178)]**CLIPstyler: Image Style Transfer with a Single Text Condition**
*Gihyun Kwon, Jong Chul Ye*
arxiv 2021
[[Paper](https://arxiv.org/abs/2112.00374)]
[[Code](https://github.com/paper11667/CLIPstyler)]**Wakey-Wakey: Animate Text by Mimicking Characters in a GIF**
*Liwenhan Xie, Zhaoyu Zhou, Kerun Yu, Yun Wang, Huamin Qu, Siming Chen*
UIST 2023
[[Paper]](https://arxiv.org/pdf/2308.00224.pdf)
[[Code]](https://github.com/KeriYuu/Wakey-Wakey)
[[Project]](https://shellywhen.github.io/projects/Wakey-Wakey)
## Text-Encoding
**FLAVA: A Foundational Language And Vision Alignment Model**
*Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, Douwe Kiela*
arxiv 2021
[[Paper](https://arxiv.org/abs/2112.04482)]**Learning Transferable Visual Models From Natural Language Supervision (CLIP)**
*Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever*
arxiv 2021
[[Paper](https://arxiv.org/abs/2103.00020)]
[[Code](https://github.com/OpenAI/CLIP)]
## Audio-Encoding
**Wav2CLIP: Learning Robust Audio Representations From CLIP (Wav2CLIP)**
*Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, Juan Pablo Bello*
ICASSP 2022
[[Paper](https://arxiv.org/abs/2110.11499)]
[[Code](https://github.com/descriptinc/lyrebird-wav2clip)]## Datasets
Multimodal CelebA-HQ (https://github.com/IIGROUP/MM-CelebA-HQ-Dataset)
DeepFashion MultiModal (https://github.com/yumingj/DeepFashion-MultiModal)
## Citation
If you use this code for your research, please cite our papers.
```bibtex
@inproceedings{zhan2023mise,
title={Multimodal Image Synthesis and Editing: The Generative AI Era},
author={Zhan, Fangneng and Yu, Yingchen and Wu, Rongliang and Zhang, Jiahui and Lu, Shijian and Liu, Lingjie and Kortylewski, Adam and Theobalt, Christian and Xing, Eric},
booktitle={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2023},
publisher={IEEE}
}
```