Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/fnzhan/Generative-AI

[TPAMI 2023] Multimodal Image Synthesis and Editing: The Generative AI Era
https://github.com/fnzhan/Generative-AI
aigc diffusion-model gans multimodality nerfs
Last synced: 4 days ago
JSON representation
[TPAMI 2023] Multimodal Image Synthesis and Editing: The Generative AI Era
Host: GitHub
URL: https://github.com/fnzhan/Generative-AI
Owner: fnzhan
Created: 2021-12-04T07:28:21.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2023-11-21T09:02:58.000Z (12 months ago)
Last Synced: 2024-07-29T05:34:21.201Z (3 months ago)
Topics: aigc, diffusion-model, gans, multimodality, nerfs
Language: TeX
Homepage:
Size: 121 MB
Stars: 774
Watchers: 45
Forks: 61
Open Issues: 1
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        






[![arXiv](https://img.shields.io/badge/arXiv-2107.05399-b31b1b.svg)](https://arxiv.org/abs/2112.13592)

[![Survey](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome) 

[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://GitHub.com/Naereen/StrapDown.js/graphs/commit-activity) 

[![PR's Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat)](http://makeapullrequest.com) 

[![GitHub license](https://badgen.net/github/license/Naereen/Strapdown.js)](https://github.com/Naereen/StrapDown.js/blob/master/LICENSE)



This project is associated with our survey paper which comprehensively contextualizes the advance of Multimodal Image 

Synthesis \& Editing (MISE) and visual AIGC by formulating taxonomies according to data modality and model architectures.

 **Multimodal Image Synthesis and Editing: The Generative AI Era [[Paper](https://arxiv.org/abs/2112.13592)]  [[Project](https://fnzhan.com/Generative-AI/)]**  


[Fangneng Zhan](https://fnzhan.com/), [Yingchen Yu](https://yingchen001.github.io/), [Rongliang Wu](https://scholar.google.com.sg/citations?user=SZkh3iAAAAAJ&hl=en), [Jiahui Zhang](https://scholar.google.com/citations?user=DXpYbWkAAAAJ&hl=zh-CN), [Shijian Lu](https://scholar.google.com.sg/citations?user=uYmK-A0AAAAJ&hl=en), [Lingjie Liu](https://lingjie0206.github.io/), [Adam Kortylewsk](https://generativevision.mpi-inf.mpg.de/), 
 [Christian Theobalt](https://people.mpi-inf.mpg.de/~theobalt/), [Eric Xing](http://www.cs.cmu.edu/~epxing/) 


*IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023*




[![PR's Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat)](http://makeapullrequest.com) 

You are welcome to promote papers via pull request. 


The process to submit a pull request:

- a. Fork the project into your own repository.

- b. Add the Title, Author, Conference, Paper link, Project link, and Code link in `README.md` with below format:

```

**Title**


*Author*


Conference

[[Paper](Paper link)]

[[Code](Project link)]

[[Project](Code link)]

```

- c. Submit the pull request to this branch.




## Related Surveys & Projects

**Adversarial Text-to-Image Synthesis: A Review**


*Stanislav Frolov, Tobias Hinz, Federico Raue, Jörn Hees, Andreas Dengel*


Neural Networks 2021

[[Paper](https://arxiv.org/abs/2101.09983)]

**GAN Inversion: A Survey**


*Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei Zhou, Ming-Hsuan Yang*


TPAMI 2022 

[[Paper](https://arxiv.org/abs/2101.05278)]

[[Project](https://github.com/weihaox/awesome-gan-inversion)]

**Deep Image Synthesis from Intuitive User Input: A Review and Perspectives**


*Yuan Xue, Yuan-Chen Guo, Han Zhang, Tao Xu, Song-Hai Zhang, Xiaolei Huang*


Computational Visual Media 2022

[[Paper](https://arxiv.org/abs/2107.04240)]

[Awesome-Text-to-Image](https://github.com/Yutong-Zhou-cv/awesome-Text-to-Image)




## Table of Contents (Work in Progress)

**Methods:**

- [Neural Rendering Methods](#Neural-rendering-Methods)

- [Diffusion-based Methods](#Diffusion-based-Methods)

- [Autoregressive Methods](#Autoregressive-Methods) 

  - [Image Quantizer](#Image-Quantizer)

- [GAN-based Methods](#GAN-based-Methods)

  - [GAN-Inversion](#GAN-Inversion-Methods)

- [Other Methods](#Other-Methods)

**Modalities & Datasets:**

- [Text Encoding](#Text-Encoding)

- [Audio Encoding](#Audio-Encoding)

- [Datasets](#Datasets)

## Neural-Rendering-Methods

**ATT3D: Amortized Text-to-3D Object Synthesis**


*Jonathan Lorraine, Kevin Xie, Xiaohui Zeng, Chen-Hsuan Lin, Towaki Takikawa, Nicholas Sharp, Tsung-Yi Lin, Ming-Yu Liu, Sanja Fidler, James Lucas*


arxiv 2023

[[Paper](https://arxiv.org/abs/2306.07349)]

**TADA! Text to Animatable Digital Avatars**


*Tingting Liao, Hongwei Yi, Yuliang Xiu, Jiaxaing Tang, Yangyi Huang, Justus Thies, Michael J. Black*


arxiv 2023

[[Paper](https://arxiv.org/abs/2308.10899)]

**MATLABER: Material-Aware Text-to-3D via LAtent BRDF auto-EncodeR**


*Xudong Xu, Zhaoyang Lyu, Xingang Pan, Bo Dai*


arxiv 2023

[[Paper](https://arxiv.org/abs/2308.09278)]

**IT3D: Improved Text-to-3D Generation with Explicit View Synthesis**


*Yiwen Chen, Chi Zhang, Xiaofeng Yang, Zhongang Cai, Gang Yu, Lei Yang, Guosheng Lin*


arxiv 2023

[[Paper](https://arxiv.org/abs/2308.11473)]

**AvatarVerse: High-quality & Stable 3D Avatar Creation from Text and Pose**


*Huichao Zhang, Bowen Chen, Hao Yang, Liao Qu, Xu Wang, Li Chen, Chao Long, Feida Zhu, Kang Du, Min Zheng*


arxiv 2023

[[Paper](https://arxiv.org/abs/2308.03610)]

[[Project](https://avatarverse3d.github.io/)]

**Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions**


*Ayaan Haque, Matthew Tancik, Alexei A. Efros, Aleksander Holynski, Angjoo Kanazawa*


ICCV 2023

[[Paper](https://arxiv.org/abs/2303.12789)]

[[Project](https://instruct-nerf2nerf.github.io)]

[[Code](https://github.com/ayaanzhaque/instruct-nerf2nerf)]

**FaceCLIPNeRF: Text-driven 3D Face Manipulation using Deformable Neural Radiance Fields**


*Sungwon Hwang, Junha Hyung, Daejin Kim, Min-Jung Kim, Jaegul Choo*


ICCV 2023

[[Paper](https://arxiv.org/abs/2307.11418v3)]

**Local 3D Editing via 3D Distillation of CLIP Knowledge**


*Junha Hyung, Sungwon Hwang, Daejin Kim, Hyunji Lee, Jaegul Choo*


CVPR 2023

[[Paper](https://arxiv.org/abs/2306.12570)]

**RePaint-NeRF: NeRF Editting via Semantic Masks and Diffusion Models**


*Xingchen Zhou, Ying He, F. Richard Yu, Jianqiang Li, You Li*


IJCAI 2023

[[Paper](https://arxiv.org/abs/2306.05668)]

**DreamTime: An Improved Optimization Strategy for Text-to-3D Content Creation**


*Yukun Huang, Jianan Wang, Yukai Shi, Xianbiao Qi, Zheng-Jun Zha, Lei Zhang*


arxiv 2023

[[Paper](https://arxiv.org/abs/2306.12422)]

[[Project](https://itsallagi.com/dreamtime-a-new-way-to-create-3d-content-from-text/)]

**AvatarStudio: Text-driven Editing of 3D Dynamic Human Head Avatars**


*Mohit Mendiratta, Xingang Pan, Mohamed Elgharib, Kartik Teotia, Mallikarjun B R, Ayush Tewari, Vladislav Golyanik, Adam Kortylewski, Christian Theobalt*


arxiv 2023

[[Paper](https://arxiv.org/abs/2306.00547)]

[[Project](https://vcai.mpi-inf.mpg.de/projects/AvatarStudio/)]

**Blended-NeRF: Zero-Shot Object Generation and Blending in Existing Neural Radiance Fields**


*Ori Gordon, Omri Avrahami, Dani Lischinski*


arxiv 2023

[[Paper](https://arxiv.org/abs/2306.12760)]

[[Project](https://www.vision.huji.ac.il/blended-nerf/)]

**OR-NeRF: Object Removing from 3D Scenes Guided by Multiview Segmentation with Neural Radiance Fields**


*Youtan Yin, Zhoujie Fu, Fan Yang, Guosheng Lin*


arxiv 2023

[[Paper](https://arxiv.org/abs/2305.10503)]

[[Project](https://ornerf.github.io/)]

[[Code](https://github.com/cuteyyt/or-nerf)]

**HiFA: High-fidelity Text-to-3D with Advanced Diffusion Guidance**


*Junzhe Zhu, Peiye Zhuang*


arxiv 2023

[[Paper](https://arxiv.org/abs/2305.18766)]

[[Project](https://hifa-team.github.io/HiFA-site/)]

**ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation**


*Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, Jun Zhu*


arxiv 2023

[[Paper](https://arxiv.org/abs/2305.16213)]

[[Project](https://ml.cs.tsinghua.edu.cn/prolificdreamer/)]

**Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields**


*Jingbo Zhang, Xiaoyu Li, Ziyu Wan, Can Wang, Jing Liao*


arxiv 2023

[[Paper](https://arxiv.org/abs/2305.11588)]

[[Project](https://eckertzhang.github.io/Text2NeRF.github.io/)]

**DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models**


*Yukang Cao, Yan-Pei Cao, Kai Han, Ying Shan, Kwan-Yee K. Wong*


arxiv 2023

[[Paper](https://arxiv.org/abs/2304.00916)]

[[Project](https://yukangcao.github.io/DreamAvatar/)]

**DITTO-NeRF: Diffusion-based Iterative Text To Omni-directional 3D Model**


*Hoigi Seo, Hayeon Kim, Gwanghyun Kim, Se Young Chun*


arxiv 2023

[[Paper](https://arxiv.org/abs/2304.02827)]

[[Project](https://janeyeon.github.io/ditto-nerf/)]

[[Code](https://github.com/janeyeon/ditto-nerf-code)]

**CompoNeRF: Text-guided Multi-object Compositional NeRF with Editable 3D Scene Layout**


*Yiqi Lin, Haotian Bai, Sijia Li, Haonan Lu, Xiaodong Lin, Hui Xiong, Lin Wang*


arxiv 2023

[[Paper](https://arxiv.org/abs/2303.13843)]

**Set-the-Scene: Global-Local Training for Generating Controllable NeRF Scenes**


*Dana Cohen-Bar, Elad Richardson, Gal Metzer, Raja Giryes, Daniel Cohen-Or*


arxiv 2023

[[Paper](https://arxiv.org/abs/2303.13450)]

[[Project](https://danacohen95.github.io/Set-the-Scene/)]

[[Code](https://github.com/DanaCohen95/Set-the-Scene)]

**Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation**


*Junyoung Seo, Wooseok Jang, Min-Seop Kwak, Jaehoon Ko, Hyeonsu Kim, Junho Kim, Jin-Hwa Kim, Jiyoung Lee, Seungryong Kim*


arxiv 2023

[[Paper](https://arxiv.org/abs/2303.07937)]

[[Project](https://ku-cvlab.github.io/3DFuse/)]

[[Code](https://github.com/KU-CVLAB/3DFuse)]

**Text-To-4D Dynamic Scene Generation**


*Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, Yaniv Taigman*


arxiv 2023

[[Paper](https://arxiv.org/abs/2301.11280)]

[[Project](https://make-a-video3d.github.io/)]

**Magic3D: High-Resolution Text-to-3D Content Creation**


*Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, Tsung-Yi Lin*


CVPR 2023

[[Paper](https://arxiv.org/abs/2211.10440)]

[[Project](https://deepimagination.cc/Magic3D/)]

**DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model**


*Gwanghyun Kim, Se Young Chun*


CVPR 2023

[[Paper](https://arxiv.org/abs/2211.16374)]

[[Code](https://github.com/gwang-kim/DATID-3D)]

[[Project](https://datid-3d.github.io/)]

**Towards Photorealistic 3D Object Generation and Editing with Text-guided Diffusion Models**


*Gang Li, Heliang Zheng, Chaoyue Wang, Chang Li, Changwen Zheng, Dacheng Tao*


arxiv 2022

[[Paper](https://arxiv.org/abs/2211.14108)]

[[Project](https://3ddesigner-diffusion.github.io/)]

**DreamFusion: Text-to-3D using 2D Diffusion**


*Ben Poole, Ajay Jain, Jonathan T. Barron, Ben Mildenhall*


arxiv 2022

[[Paper](https://arxiv.org/abs/2209.14988)]

[[Project](https://dreamfusion3d.github.io/)]

**Zero-Shot Text-Guided Object Generation with Dream Fields**


*Ajay Jain, Ben Mildenhall, Jonathan T. Barron, Pieter Abbeel, Ben Poole*


CVPR 2022

[[Paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Jain_Zero-Shot_Text-Guided_Object_Generation_With_Dream_Fields_CVPR_2022_paper.pdf)]

[[Code](https://github.com/google-research/google-research/tree/master/dreamfields)]

[[Project](https://ajayj.com/dreamfields)]

**IDE-3D: Interactive Disentangled Editing for High-Resolution 3D-aware Portrait Synthesis**


*Jingxiang Sun, Xuan Wang, Yichun Shi, Lizhen Wang, Jue Wang, Yebin Liu*


SIGGRAPH Asia 2022

[[Paper](https://arxiv.org/pdf/2205.15517.pdf)]

[[Code](https://github.com/MrTornado24/IDE-3D)]

[[Project](https://mrtornado24.github.io/IDE-3D/)]

**Sem2NeRF: Converting Single-View Semantic Masks to Neural Radiance Fields**


*Yuedong Chen, Qianyi Wu, Chuanxia Zheng, Tat-Jen Cham, Jianfei Cai*


arxiv 2022

[[Paper](https://arxiv.org/abs/2203.10821)]

[[Code](https://github.com/donydchen/sem2nerf)]

[[Project](https://donydchen.github.io/sem2nerf/)]

**CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields**


*Can Wang, Menglei Chai, Mingming He, Dongdong Chen, Jing Liao*


CVPR 2022

[[Paper](https://arxiv.org/abs/2112.05139)]

[[Code](https://github.com/cassiePython/CLIPNeRF)]

[[Project](https://cassiepython.github.io/clipnerf/)]

**CG-NeRF: Conditional Generative Neural Radiance Fields**


*Kyungmin Jo, Gyumin Shim, Sanghun Jung, Soyoung Yang, Jaegul Choo*


arxiv 2021

[[Paper](https://arxiv.org/abs/2112.03517)]

**Zero-Shot Text-Guided Object Generation with Dream Fields**


*Ajay Jain, Ben Mildenhall, Jonathan T. Barron, Pieter Abbeel, Ben Poole*


arxiv 2021

[[Paper](https://arxiv.org/abs/2112.01455)]

[[Project](https://ajayj.com/dreamfields)]

**AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis**


*Yudong Guo, Keyu Chen, Sen Liang, Yong-Jin Liu, Hujun Bao, Juyong Zhang*


ICCV 2021

[[Paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Guo_AD-NeRF_Audio_Driven_Neural_Radiance_Fields_for_Talking_Head_Synthesis_ICCV_2021_paper.pdf)]

[[Code](https://github.com/YudongGuo/AD-NeRF)]

[[Project](https://yudongguo.github.io/ADNeRF/)]

[[Video](https://www.youtube.com/watch?v=TQO2EBYXLyU)]




## Diffusion-based-Methods

**BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing**


*Dongxu Li, Junnan Li, Steven C.H. Hoi*


Arxiv 2023

[[Paper](https://arxiv.org/pdf/2305.14720.pdf)]

[[Project](https://dxli94.github.io/BLIP-Diffusion-website/)]

[[Code](https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion)]

**InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions**


*Qian Wang, Biao Zhang, Michael Birsak, Peter Wonka*


Arxiv 2023

[[Paper](https://arxiv.org/pdf/2305.18047.pdf)]

[[Project](https://qianwangx.github.io/InstructEdit/)]

[[Code](https://github.com/qianwangx/instructedit)]

**DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation**


*Nataniel Ruiz, Yuanzhen Li, Varun Jampani Yael, Pritch Michael, Rubinstein Kfir Aberman*


CVPR 2023

[[Paper](https://arxiv.org/pdf/2208.12242.pdf)]

[[Project](https://dreambooth.github.io/)]

[[Code](https://github.com/google/dreambooth)]

**Multi-Concept Customization of Text-to-Image Diffusion**


*Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, Jun-Yan Zhu*


CVPR 2023

[[Paper](https://openaccess.thecvf.com/content/CVPR2023/papers/Kumari_Multi-Concept_Customization_of_Text-to-Image_Diffusion_CVPR_2023_paper.pdf)]

[[Project](https://www.cs.cmu.edu/~custom-diffusion/)]

[[Code](https://github.com/adobe-research/custom-diffusion)]

**Collaborative Diffusion for Multi-Modal Face Generation and Editing**


*Ziqi Huang, Kelvin C.K. Chan, Yuming Jiang, Ziwei Liu*


CVPR 2023

[[Paper](https://arxiv.org/pdf/2304.10530v1.pdf)]

[[Project](https://ziqihuangg.github.io/projects/collaborative-diffusion.html)]

[[Code](https://github.com/ziqihuangg/Collaborative-Diffusion)]

**Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation**


*Narek Tumanyan, Michal Geyer, Shai Bagon, Tali Dekel*


CVPR 2023

[[Paper](https://openaccess.thecvf.com/content/CVPR2023/papers/Tumanyan_Plug-and-Play_Diffusion_Features_for_Text-Driven_Image-to-Image_Translation_CVPR_2023_paper.pdf)]

[[Project](https://pnp-diffusion.github.io/)]

[[Code](https://github.com/MichalGeyer/plug-and-play)]

**SINE: SINgle Image Editing with Text-to-Image Diffusion Models**


*Zhixing Zhang, Ligong Han, Arnab Ghosh, Dimitris Metaxas, Jian Ren*


CVPR 2023

[[Paper](https://openaccess.thecvf.com/content/CVPR2023/papers/Zhang_SINE_SINgle_Image_Editing_With_Text-to-Image_Diffusion_Models_CVPR_2023_paper.pdf)]

[[Project](https://zhang-zx.github.io/SINE/)]

[[Code](https://github.com/zhang-zx/SINE)]

**NULL-Text Inversion for Editing Real Images Using Guided Diffusion Models**


*Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, Daniel Cohen-Or*


CVPR 2023

[[Paper](https://openaccess.thecvf.com/content/CVPR2023/papers/Mokady_NULL-Text_Inversion_for_Editing_Real_Images_Using_Guided_Diffusion_Models_CVPR_2023_paper.pdf)]

[[Project](https://null-text-inversion.github.io/)]

[[Code](https://github.com/google/prompt-to-prompt/#null-text-inversion-for-editing-real-images)]

**Paint by Example: Exemplar-Based Image Editing With Diffusion Models**


*Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, Fang Wen*


CVPR 2023

[[Paper](https://openaccess.thecvf.com/content/CVPR2023/papers/Yang_Paint_by_Example_Exemplar-Based_Image_Editing_With_Diffusion_Models_CVPR_2023_paper.pdf)]

[[Demo](https://huggingface.co/spaces/Fantasy-Studio/Paint-by-Example)]

[[Code](https://github.com/Fantasy-Studio/Paint-by-Example)]

**SpaText: Spatio-Textual Representation for Controllable Image Generation**


*Omri Avrahami, Thomas Hayes, Oran Gafni, Sonal Gupta, Yaniv Taigman, Devi Parikh, Dani Lischinski, Ohad Fried, Xi Yin*


CVPR 2023

[[Paper](https://arxiv.org/pdf/2211.14305.pdf)]

[[Project](https://omriavrahami.com/spatext/)]

**Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models**


*Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, Karsten Kreis*


CVPR 2023

[[Paper](https://arxiv.org/pdf/2304.08818.pdf)]

[[Project](https://research.nvidia.com/labs/toronto-ai/VideoLDM/)]

**InstructPix2Pix Learning to Follow Image Editing Instructions**


*Tim Brooks, Aleksander Holynski, Alexei A. Efros*


CVPR 2023

[[Paper](https://openaccess.thecvf.com/content/CVPR2023/papers/Brooks_InstructPix2Pix_Learning_To_Follow_Image_Editing_Instructions_CVPR_2023_paper.pdf)]

[[Project](https://www.timothybrooks.com/instruct-pix2pix/)]

[[Code](https://github.com/timothybrooks/instruct-pix2pix)]

**Unite and Conquer: Plug & Play Multi-Modal Synthesis using Diffusion Models**


*Nithin Gopalakrishnan Nair, Chaminda Bandara, Vishal M Patel*


CVPR 2023

[[Paper](https://openaccess.thecvf.com/content/CVPR2023/papers/Nair_Unite_and_Conquer_Plug__Play_Multi-Modal_Synthesis_Using_Diffusion_CVPR_2023_paper.pdf)]

[[Project](https://nithin-gk.github.io/projectpages/Multidiff/index.html)]

[[Code](https://github.com/Nithin-GK/UniteandConquer)]

**DiffEdit: Diffusion-based semantic image editing with mask guidance**


*Guillaume Couairon, Jakob Verbeek, Holger Schwenk, Matthieu Cord*


CVPR 2023

[[Paper](https://arxiv.org/pdf/2210.11427.pdf)]

**eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers**


*Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, Ming-Yu Liu*


Arxiv 2022

[[Paper](https://arxiv.org/pdf/2211.01324.pdf)]

[[Project](https://research.nvidia.com/labs/dir/eDiff-I/)]

**Prompt-to-Prompt Image Editing with Cross-Attention Control**


*Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman1 Yael Pritch, Daniel Cohen-Or*


Arxiv 2022

[[Paper](https://prompt-to-prompt.github.io/ptp_files/Prompt-to-Prompt_preprint.pdf)]

[[Project](https://prompt-to-prompt.github.io/)]

[[Code](https://github.com/google/prompt-to-prompt)]

**An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion**


*Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, Daniel Cohen-Or*


Arxiv 2022

[[Paper](https://arxiv.org/pdf/2208.01618.pdf)]

[[Project](https://textual-inversion.github.io/)]

[[Code](https://github.com/rinongal/textual_inversion)]

**Text2Human: Text-Driven Controllable Human Image Generation**


*Yuming Jiang, Shuai Yang, Haonan Qiu, Wayne Wu, Chen Change Loy, Ziwei Liu*


SIGGRAPH 2022

[[Paper](https://arxiv.org/pdf/2205.15996.pdf)]

[[Project](https://yumingj.github.io/projects/Text2Human.html)]

[[Code](https://github.com/yumingj/Text2Human)]

**[DALL-E 2] Hierarchical Text-Conditional Image Generation with CLIP Latents**


*Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen*


[[Paper](https://cdn.openai.com/papers/dall-e-2.pdf)]

[[Code](https://github.com/lucidrains/DALLE2-pytorch)]

**High-Resolution Image Synthesis with Latent Diffusion Models**


*Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer*


CVPR 2022

[[Paper](https://arxiv.org/abs/2112.10752)]

[[Code](https://github.com/CompVis/latent-diffusion)]

**v objective diffusion**


*Katherine Crowson*


[[Code](https://github.com/crowsonkb/v-diffusion-pytorch)]

**GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models**


*Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, Mark Chen*


arxiv 2021

[[Paper](https://arxiv.org/abs/2112.10741)]

[[Code](https://github.com/openai/glide-text2im)]

**Vector Quantized Diffusion Model for Text-to-Image Synthesis**


*Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, Baining Guo*


arxiv 2021

[[Paper](https://arxiv.org/abs/2111.14822)]

[[Code](https://github.com/microsoft/VQ-Diffusion)]

**DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation**


*Gwanghyun Kim, Jong Chul Ye*


arxiv 2021

[[Paper](https://arxiv.org/abs/2110.02711)]

**Blended Diffusion for Text-driven Editing of Natural Images**


*Omri Avrahami, Dani Lischinski, Ohad Fried*


CVPR 2022

[[Paper](https://arxiv.org/abs/2111.14818)]

[[Project](https://omriavrahami.com/blended-diffusion-page/)]

[[Code](https://github.com/omriav/blended-diffusion)]




## Autoregressive-Methods

**MaskGIT: Masked Generative Image Transformer**


*Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, William T. Freeman*


arxiv 2022

[[Paper](https://arxiv.org/abs/2202.04200)] 

**ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation**


*Han Zhang, Weichong Yin, Yewei Fang, Lanxin Li, Boqiang Duan, Zhihua Wu, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang*


arxiv 2021

[[Paper](https://arxiv.org/abs/2112.15283)] 

[[Project](https://wenxin.baidu.com/wenxin/ernie-vilg)]

**NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion**


*Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, Nan Duan*


arxiv 2021

[[Paper](https://arxiv.org/abs/2111.12417)] 

[[Code](https://github.com/microsoft/NUWA)] 

[[Video](https://youtu.be/C9CTnZJ9ZE0)]

**L-Verse: Bidirectional Generation Between Image and Text**


*Taehoon Kim, Gwangmo Song, Sihaeng Lee, Sangyun Kim, Yewon Seo, Soonyoung Lee, Seung Hwan Kim, Honglak Lee, Kyunghoon Bae*


arxiv 2021

[[Paper](https://arxiv.org/abs/2111.11133)] 

[[Code](https://github.com/tgisaturday/L-Verse)] 

**M6-UFC: Unifying Multi-Modal Controls for Conditional Image Synthesis**


*Zhu Zhang, Jianxin Ma, Chang Zhou, Rui Men, Zhikang Li, Ming Ding, Jie Tang, Jingren Zhou, Hongxia Yang*


NeurIPS 2021

[[Paper](https://arxiv.org/abs/2105.14211v3)] 

**ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis**


*Patrick Esser, Robin Rombach, Andreas Blattmann, Björn Ommer*


NeurIPS 2021

[[Paper](https://openreview.net/pdf?id=-1AAgrS5FF)] 

[[Code](https://github.com/CompVis/imagebart)] 

[[Project](https://compvis.github.io/imagebart/)]

**A Picture is Worth a Thousand Words: A Unified System for Diverse Captions and Rich Images Generation**


*Yupan Huang, Bei Liu, Jianlong Fu, Yutong Lu*


ACM MM 2021

[[Paper](https://arxiv.org/abs/2110.09756)] 

[[Code](https://github.com/researchmm/generate-it)] 

**Unifying Multimodal Transformer for Bi-directional Image and Text Generation**


*Yupan Huang, Hongwei Xue, Bei Liu, Yutong Lu*


ACM MM 2021

[[Paper](https://arxiv.org/abs/2110.09753)] 

[[Code](https://github.com/researchmm/generate-it)] 

**Taming Transformers for High-Resolution Image Synthesis**


*Patrick Esser, Robin Rombach, Björn Ommer*


CVPR 2021

[[Paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Esser_Taming_Transformers_for_High-Resolution_Image_Synthesis_CVPR_2021_paper.pdf)] 

[[Code](https://github.com/CompVis/taming-transformers)] 

[[Project](https://compvis.github.io/taming-transformers/)]

**RuDOLPH: One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP**


*Alex Shonenkov and Michael Konstantinov*


arxiv 2022

[[Code](https://github.com/sberbank-ai/ru-dolph)]

**Generate Images from Texts in Russian (ruDALL-E)**


[[Code](https://github.com/sberbank-ai/ru-dalle)]

[[Project](https://rudalle.ru/en/)]

**Zero-Shot Text-to-Image Generation**


*Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever*


arxiv 2021

[[Paper](https://arxiv.org/abs/2102.12092)]

[[Code](https://github.com/openai/DALL-E)]

[[Project](https://openai.com/blog/dall-e/)]

**Compositional Transformers for Scene Generation**


*Drew A. Hudson, C. Lawrence Zitnick*


NeurIPS 2021

[[Paper](https://openreview.net/pdf?id=YQeWoRnwTnE)]

[[Code](https://github.com/dorarad/gansformer)]

**X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers**


*Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, Aniruddha Kembhavi*


EMNLP 2020

[[Paper](https://arxiv.org/abs/2009.11278)] 

[[Code](https://github.com/allenai/x-lxmert)] 

**One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning**


*Suzhen Wang, Lincheng Li, Yu Ding, Xin Yu*


AAAI 2022

[[Paper](https://arxiv.org/abs/2112.02749)]




### Image-Quantizer

**[TE-VQGAN] Translation-equivariant Image Quantizer for Bi-directional Image-Text Generation**


*Woncheol Shin, Gyubok Lee, Jiyoung Lee, Joonseok Lee, Edward Choi*


arxiv 2021

[[Paper](https://arxiv.org/abs/2110.04627)]

[[Code](https://github.com/wcshin-git/TE-VQGAN)]

**[ViT-VQGAN] Vector-quantized Image Modeling with Improved VQGAN**


*Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, Yonghui Wu*


arxiv 2021

[[Paper](https://arxiv.org/abs/2110.04627)]

**[PeCo] PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers**


*Xiaoyi Dong, Jianmin Bao, Ting Zhang, Dongdong Chen, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Yu*


arxiv 2021

[[Paper](https://arxiv.org/abs/2111.12710)]

**[VQ-GAN] Taming Transformers for High-Resolution Image Synthesis**


*Patrick Esser, Robin Rombach, Björn Ommer*


CVPR 2021

[[Paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Esser_Taming_Transformers_for_High-Resolution_Image_Synthesis_CVPR_2021_paper.pdf)]

[[Code](https://github.com/CompVis/taming-transformers)]

**[Gumbel-VQ] vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations**


*Alexei Baevski, Steffen Schneider, Michael Auli*


ICLR 2020

[[Paper](https://openreview.net/pdf?id=rylwJxrYDS)]

[[Code](https://github.com/pytorch/fairseq/blob/main/examples/wav2vec/README.md)]

**[EM VQ-VAE] Theory and Experiments on Vector Quantized Autoencoders**


*Aurko Roy, Ashish Vaswani, Arvind Neelakantan, Niki Parmar*


arxiv 2018

[[Paper](https://arxiv.org/abs/1805.11063)]

[[Code](https://github.com/jaywalnut310/Vector-Quantized-Autoencoders)]

**[VQ-VAE] Neural Discrete Representation Learning**


*Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu*


NIPS 2017

[[Paper](https://proceedings.neurips.cc/paper/2017/file/7a98af17e63a0ac09ce2e96d03992fbc-Paper.pdf)]

[[Code](https://github.com/ritheshkumar95/pytorch-vqvae)]

**[VQ-VAE2 or EMA-VQ] Generating Diverse High-Fidelity Images with VQ-VAE-2**


*Ali Razavi, Aaron van den Oord, Oriol Vinyals*


NIPS 2019

[[Paper](https://proceedings.neurips.cc/paper/2019/file/5f8e2fa1718d1bbcadf1cd9c7a54fb8c-Paper.pdf)]

[[Code](https://github.com/rosinality/vq-vae-2-pytorch)]

**[Discrete VAE] Discrete Variational Autoencoders**


*Jason Tyler Rolfe*


ICLR 2017

[[Paper](https://arxiv.org/abs/1609.02200)]

[[Code](https://github.com/openai/DALL-E)]

**[DVAE++] DVAE++: Discrete Variational Autoencoders with Overlapping Transformations**


*Arash Vahdat, William G. Macready, Zhengbing Bian, Amir Khoshaman, Evgeny Andriyash*


ICML 2018

[[Paper](https://arxiv.org/abs/1802.04920)]

[[Code](https://github.com/xmax1/dvae)]

**[DVAE#] DVAE#: Discrete Variational Autoencoders with Relaxed Boltzmann Priors**


*Arash Vahdat, Evgeny Andriyash, William G. Macready*


NIPS 2018

[[Paper](https://arxiv.org/abs/1805.07445)]

[[Code](https://github.com/xmax1/dvae)]




## GAN-based-Methods

**GauGAN2**


*NVIDIA*


[[Project](http://gaugan.org/gaugan2/)]

[[Video](https://www.youtube.com/watch?v=p9MAvRpT6Cg)]

**Multimodal Conditional Image Synthesis with Product-of-Experts GANs**


*Xun Huang, Arun Mallya, Ting-Chun Wang, Ming-Yu Liu*


arxiv 2021

[[Paper](https://arxiv.org/abs/2112.05130)]

**RiFeGAN2: Rich Feature Generation for Text-to-Image Synthesis from Constrained Prior Knowledge**


*Jun Cheng, Fuxiang Wu, Yanling Tian, Lei Wang, Dapeng Tao*


TCSVT 2021

[[Paper](https://ieeexplore.ieee.org/abstract/document/9656731/authors#authors)]

**TRGAN: Text to Image Generation Through Optimizing Initial Image**


*Liang Zhao, Xinwei Li, Pingda Huang, Zhikui Chen, Yanqi Dai, Tianyu Li*


ICONIP 2021

[[Paper](https://link.springer.com/chapter/10.1007/978-3-030-92307-5_76)]

**Audio-Driven Emotional Video Portraits [Audio2Image]**


*Xinya Ji, Hang Zhou, Kaisiyuan Wang, Wayne Wu, Chen Change Loy, Xun Cao, Feng Xu*


CVPR 2021

[[Paper](https://arxiv.org/abs/2104.07452)]

[[Code](https://github.com/jixinya/EVP/)]

[[Project](https://jixinya.github.io/projects/evp/)]

**SketchyCOCO: Image Generation from Freehand Scene Sketches**


*Chengying Gao, Qi Liu, Qi Xu, Limin Wang, Jianzhuang Liu, Changqing Zou*


CVPR 2020

[[Paper](https://arxiv.org/pdf/2003.02683.pdf)]

[[Code](https://github.com/sysu-imsl/SketchyCOCO)]

[[Project](https://mikexuq.github.io/test_building_pages/index.html)]

**Direct Speech-to-Image Translation [Audio2Image]**


*Jiguo Li, Xinfeng Zhang, Chuanmin Jia, Jizheng Xu, Li Zhang, Yue Wang, Siwei Ma, Wen Gao*


JSTSP 2020

[[Paper](https://ieeexplore.ieee.org/document/9067083/authors#authors)]

[[Code](https://github.com/smallflyingpig/speech-to-image-translation-without-text)]

[[Project](https://smallflyingpig.github.io/speech-to-image/main)]

**MirrorGAN: Learning Text-to-image Generation by Redescription [Text2Image]**


*Tingting Qiao, Jing Zhang, Duanqing Xu, Dacheng Tao*


CVPR 2019

[[Paper](https://arxiv.org/abs/1903.05854)]

[[Code](https://github.com/qiaott/MirrorGAN)]

**AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks [Text2Image]**


*Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, Xiaodong He*


CVPR 2018

[[Paper](https://openaccess.thecvf.com/content_cvpr_2018/papers/Xu_AttnGAN_Fine-Grained_Text_CVPR_2018_paper.pdf)]

[[Code](https://github.com/taoxugit/AttnGAN)]

**Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space**


*Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovitskiy, Jason Yosinski*


CVPR 2017

[[Paper](https://openaccess.thecvf.com/content_cvpr_2017/papers/Nguyen_Plug__Play_CVPR_2017_paper.pdf)]

[[Code](https://github.com/Evolving-AI-Lab/ppgn)]

**StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks [Text2Image]**


*Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, Dimitris Metaxas*


TPAMI 2018

[[Paper](https://arxiv.org/abs/1710.10916)]

[[Code](https://github.com/hanzhanggit/StackGAN-v2)]

**StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks [Text2Image]**


*Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, Dimitris Metaxas*


ICCV 2017

[[Paper](https://arxiv.org/abs/1612.03242)]

[[Code](https://github.com/hanzhanggit/StackGAN)]




### GAN-Inversion-Methods 

**Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold**


*Xingang Pan, Ayush Tewari, Thomas Leimkühler, Lingjie Liu, Abhimitra Meka, Christian Theobalt*


SIGGRAPH 2023

[[Paper](https://arxiv.org/abs/2305.10973)]

[[Code](https://github.com/XingangPan/DragGAN)]

**HairCLIP: Design Your Hair by Text and Reference Image**


*Tianyi Wei, Dongdong Chen, Wenbo Zhou, Jing Liao, Zhentao Tan, Lu Yuan, Weiming Zhang, Nenghai Yu*


arxiv 2021

[[Paper](https://arxiv.org/abs/2112.05142)]

[[Code](https://github.com/wty-ustc/HairCLIP)]

**FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+ GAN Space Optimization**


*Xingchao Liu, Chengyue Gong, Lemeng Wu, Shujian Zhang, Hao Su, Qiang Liu*


arxiv 2021

[[Paper](https://arxiv.org/abs/2112.01573)]

[[Code](https://github.com/gnobitab/FuseDream)]

**StyleMC: Multi-Channel Based Fast Text-Guided Image Generation and Manipulation**


*Umut Kocasari, Alara Dirik, Mert Tiftikci, Pinar Yanardag*


WACV 2022

[[Paper](https://arxiv.org/abs/2112.08493)]

[[Code](https://github.com/catlab-team/stylemc)]

[[Project](https://catlab-team.github.io/stylemc/)]

**Cycle-Consistent Inverse GAN for Text-to-Image Synthesis**


*Hao Wang, Guosheng Lin, Steven C. H. Hoi, Chunyan Miao*


ACM MM 2021

[[Paper](https://dl.acm.org/doi/10.1145/3474085.3475226)]

**StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery**


*Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, Dani Lischinski*


ICCV 2021

[[Paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Patashnik_StyleCLIP_Text-Driven_Manipulation_of_StyleGAN_Imagery_ICCV_2021_paper.pdf)]

[[Code](https://github.com/orpatashnik/StyleCLIP)]

[[Video](https://www.youtube.com/watch?v=PhR1gpXDu0w)]

**Talk-to-Edit: Fine-Grained Facial Editing via Dialog**


*Yuming Jiang, Ziqi Huang, Xingang Pan, Chen Change Loy, Ziwei Liu*


ICCV 2021

[[Paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Jiang_Talk-To-Edit_Fine-Grained_Facial_Editing_via_Dialog_ICCV_2021_paper.pdf)]

[[Code](https://github.com/yumingj/Talk-to-Edit)]

[[Project](https://www.mmlab-ntu.com/project/talkedit/)]

**TediGAN: Text-Guided Diverse Face Image Generation and Manipulation**


*Weihao Xia, Yujiu Yang, Jing-Hao Xue, Baoyuan Wu*


CVPR 2021

[[Paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Xia_TediGAN_Text-Guided_Diverse_Face_Image_Generation_and_Manipulation_CVPR_2021_paper.pdf)]

[[Code](https://github.com/IIGROUP/TediGAN)]

[[Video](https://www.youtube.com/watch?v=L8Na2f5viAM)]

**Paint by Word**


*David Bau, Alex Andonian, Audrey Cui, YeonHwan Park, Ali Jahanian, Aude Oliva, Antonio Torralba*


arxiv 2021

[[Paper](https://arxiv.org/abs/2112.01573)]




## Other-Methods

**Language-Driven Image Style Transfer**


*Tsu-Jui Fu, Xin Eric Wang, William Yang Wang*


arxiv 2021

[[Paper](https://arxiv.org/abs/2106.00178)]

**CLIPstyler: Image Style Transfer with a Single Text Condition**


*Gihyun Kwon, Jong Chul Ye*


arxiv 2021

[[Paper](https://arxiv.org/abs/2112.00374)]

[[Code](https://github.com/paper11667/CLIPstyler)]

**Wakey-Wakey: Animate Text by Mimicking Characters in a GIF**


*Liwenhan Xie, Zhaoyu Zhou, Kerun Yu, Yun Wang, Huamin Qu, Siming Chen*


UIST 2023

[[Paper]](https://arxiv.org/pdf/2308.00224.pdf)

[[Code]](https://github.com/KeriYuu/Wakey-Wakey)

[[Project]](https://shellywhen.github.io/projects/Wakey-Wakey)







## Text-Encoding

**FLAVA: A Foundational Language And Vision Alignment Model**


*Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, Douwe Kiela*


arxiv 2021

[[Paper](https://arxiv.org/abs/2112.04482)]

**Learning Transferable Visual Models From Natural Language Supervision (CLIP)**


*Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever*


arxiv 2021

[[Paper](https://arxiv.org/abs/2103.00020)]

[[Code](https://github.com/OpenAI/CLIP)]




## Audio-Encoding

**Wav2CLIP: Learning Robust Audio Representations From CLIP (Wav2CLIP)**


*Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, Juan Pablo Bello*


ICASSP 2022

[[Paper](https://arxiv.org/abs/2110.11499)]

[[Code](https://github.com/descriptinc/lyrebird-wav2clip)]

## Datasets

Multimodal CelebA-HQ (https://github.com/IIGROUP/MM-CelebA-HQ-Dataset)

DeepFashion MultiModal (https://github.com/yumingj/DeepFashion-MultiModal)

## Citation

If you use this code for your research, please cite our papers.

```bibtex

@inproceedings{zhan2023mise,

  title={Multimodal Image Synthesis and Editing: The Generative AI Era},

  author={Zhan, Fangneng and Yu, Yingchen and Wu, Rongliang and Zhang, Jiahui and Lu, Shijian and Liu, Lingjie and Kortylewski, Adam and Theobalt, Christian and Xing, Eric},

  booktitle={IEEE Transactions on Pattern Analysis and Machine Intelligence},

  year={2023},

  publisher={IEEE}

}

```