Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Awesome-LLMs-meet-Multimodal-Generation
🔥🔥🔥 A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).
https://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation
Last synced: about 19 hours ago
JSON representation
-
Image Generation
-
🔅 LLM-based
- ![Code - CVC/SEED)
- ![citation - Denoising-Autoregressive-Transformer-for-Gu-Wang/61bfd967a6ba21e50276e52f353fa74dd68990a6)
- ![citation - A-Joint-Image-Video-Tokenizer-for-Wang-Jiang/8613c1081a6ab34e2f980e35c06a1af461d7314e)
- ![Code
- ![citation - Model-Beats-Diffusion%3A-Llama-for-Sun-Jiang/b15e6e2b1d81bc110f8fc98c3caf2e25e2512539)
- ![Code - research/InstantUnify)
- ![citation - Mixed-Modal-Early-Fusion-Foundation-Team/32112b798f70faab00e14806f51d46058cf5e597?utm_source=direct_link)
- ![citation
- ![citation - %3A-Personalized-Multimodal-Generation-with-Large-Shen-Zhang/cfb9eba1b5c55bb0052df41eaaff8716f9c420bd)
- ![citation
- ![Code
- ![Code - Diffusion/ELLA)
- ![citation
- ![Code
- ![citation
- ![citation - Tokenizing-Strokes-for-Vector-Graphic-Tang-Wu/b2f6830afe63eb477294f17f0d3a6923135950f9)
- ![Code - vector)
- ![citation
- ![Code - CVC/VL-GPT)
- ![citation
- ![citation
- ![citation
- ![citation
- ![Code
- ![citation
- ![citation
- ![citation
- ![citation
- ![Code - wu/SLD)
- ![citation
- ![citation
- ![citation
- ![Code
- ![citation
- ![Code
- ![citation
- ![Code - ur/Idea2Img)
- ![citation
- ![citation
- ![Code - Lai/Mini-DALLE3)
- ![citation
- ![citation
- ![Code - ai-lab/MiniGPT-5)
- ![citation
- ![citation
- ![Code
- ![citation
- ![citation
- ![citation
- ![Code - wang/SwitchGPT)
- ![citation
- ![Code - T2I/LayoutLLM-T2I)
- ![citation
- ![citation
- ![citation
- ![citation
- ![citation
- ![citation
- ![citation
- ![Code - min/VPGen)
- ![citation
- ![Code - groundedDiffusion)
- ![citation
- ![Code - Anything-Pipeline)
- ![citation
- ![Code - group/SUR-adapter)
- ![citation
- ![citation
- ![Code - DiffusionMaster)
- ![citation
- ![Code
-
Datasets
-
Non-LLM-based (Clip/T5)
- ![citation - Free-Lunch-towards-Style-Preserving-Wang-Spinelli/6b5fc164c4f21e4a4f151df60bfd5e32b061a903)
- ![Code - research/InstantStyle)
- ![citation - Zero-shot-Identity-Preserving-Generation-Wang-Bai/0f9b66c9208b11369e9d94d85b7dc23bcc5115e9)
- ![Code - research/InstantID)
- ![citation
- ![Code - alpha/PixArt-alpha)
- ![citation
- ![citation
- ![Code - research/custom-diffusion)\
- ![citation
- ![Code
- ![citation
- ![citation
-
-
Video Generation
-
Non-LLM-based
- ![citation - Priors-Makes-Text-to-Video-Synthesis-Cheng-Peng/0dd0e0bdff37973e102a042f82cd882b890476cc)
- ![citation - Training-free-Diffuser-Adaptation-for-Yang-Zhang/2917706886df4e3bf57acd0b41bd4e396be77506#cited-papers)
- ![citation - A-High-Performance-Long-Video-Method-Xu-Zou/40122a222374504fda4997ef6204dcdcee1678da)
- ![Code - apps/EasyAnimate)
- ![citation - VAE%3A-A-Compatible-Video-VAE-for-Latent-Video-Zhao-Zhang/70569a07d841f86faf8914aea435a1696f911a32)
- ![Code - CVC/CV-VAE)
- ![citation - Diffusion-Models-are-Training-free-Motion-and-Xiao-Zhou/4f3e62c0fea3dc43f345e775192c972760b9d113)
- ![citation - Video%3A-Scaled-Spatiotemporal-Transformers-for-Menapace-Siarohin/97cb2eb0d0517e34bf4202f0593600bb6fa043cd)
- ![citation - Overcoming-Data-Limitations-for-Chen-Zhang/492bc8339d8aac442c4ec13f8c1d59e980a3af2f)
- ![citation
- ![citation
- ![citation
- ![citation - Animating-Open-domain-Images-with-Xing-Xia/083bab4a967c2221d9f4da9110fe37d8ca679078)
- ![citation
- ![citation
- ![Code - CVC/Animate-A-Story)
- ![citation
- ![Code - CVC/Make-Your-Video)
- ![citation
- ![citation - Time-Controllable-Denoising-for-Image-and-Zhang-Jiang/3f3746c3c64212e97c877bd3d862b578fa24632c)
- ![citation
- ![Code - qiu/FreeTraj)
- ![Code - ml/cond-image-leakage/)
- ![Code - stu/ImageConductor)
- ![Code
- ![Code - CVC/VideoCrafter)
- ![Code - HPC-AI-Lab/OpenDiT)
- ![Code - org/Pandora)
- ![citation - Motion-Aware-Customized-Text-to-Video-Wu-Li/7178bbc5e8d2d9b11c890c60486ba2cc2b79b784)
- ![citation - Tuning-Free-Trajectory-Control-in-Video-Qiu-Chen/1868d2c2f56a92044908a789049fdd44094fc8f2)
- ![citation - and-Solving-Conditional-Image-Leakage-Zhao-Zhu/ebf4f746d24d79d61c070f8c354b3371f461aafb)
- ![citation - Conductor%3A-Precision-Control-for-Interactive-Li-Wang/b0bd64273dc8075db530fd696ee7eecb179bb908)
- ![citation - Building-Automatic-Metrics-to-Simulate-He-Jiang/1680eedc706ef081c0b103457bb52c071ab924b8)
- ![Code - AI-Lab/VideoScore/)
- ![citation - Real-World-Visuomotor-Policy-Learning-Liang-Liu/b0ac4f62f55bcf0427008e18f1b4b5bf7ee43df2)
-
Datasets
- ![citation
- ![citation
- ![citation
- ![citation
- ![Code - Text/CelebV-Text)
- ![citation
- ![Code - vila-100m/README.md)
- ![citation
- ![Code
- ![citation
- ![citation
- ![citation
- ![Code - ov-file)
- ![citation
- ![citation
- ![Code - dataset)
- ![citation
- ![citation
- ![citation
- ![Code - vtt-it)
- ![Code - research-datasets/videoCC-data)
- ![citation - 1M%3A-A-Large-Scale-Dataset-for-Text-to-video-Tan-Yang/8af933a6e0d45e041a1ca35d461aad92022aa957)
- ![Code
- ![citation - A-Video-Is-Worth-Thousands-of-Words-Yang-Huang/2fb2a76be0f261fb2660457a6cec8a8384b33a19)
- ![citation - A-Multimodal-Trailer-Video-Dataset-with-Chi-Wang/0d8ea8cf8fcadc0eb52304258e254e01c62dfe52)
- ![Code
-
🔅 LLM-based
- ![citation - Generating-Minute-level-Long-Videos-with-Wang-Xiong/1ac7fc5a55ce5843fb8a19d9f62b623e822bb7de)
- ![citation - 2%3A-LLM-Enhanced-World-Models-for-Video-Zhao-Wang/b34fb645165da381e27077282d69ff224dd2d5f5)
- ![citation - Language-Driven-Video-Inpainting-via-Large-Wu-Li/02d96eb0da4a282831f14923d1a65976952b7177)
- ![citation
- ![citation
- ![citation
- ![citation
- ![citation
- ![citation
- ![Code
- ![citation
- ![citation
- ![citation
- ![Code - groundedVideoDiffusion)
- ![citation
- ![Code - hanlin/VideoDirectorGPT)
- ![citation
- ![citation
- ![citation
- ![Code - CVLAB/DirecT2V)
- ![citation
-
-
3D Generation
-
🔅 LLM-based
-
Non-LLM-based (Clip/T5)
- ![citation
- ![Code - lin/DreamPolisher)
- ![citation
- ![Code - sg/Consistent3D)
- ![citation
- ![Code - research/AToM)
- ![citation
- ![Code
- ![citation
- ![Code
- ![citation
- ![Code
- ![citation
- ![Code
- ![citation
- ![Code
- ![citation
- ![Code
- ![citation
- ![Code
- ![citation
- ![Code - Research/LucidDreamer)
- ![citation
- ![Code
- ![citation
- ![Code
- ![citation
- ![citation
- ![Code
- ![citation
- ![Code - 98/SweetDreamer)
- ![citation
- ![Code
- ![citation
- ![Code - Lab/Classifier-Score-Distillation)
- ![citation
- ![Code
- ![citation
- ![Code
- ![citation
- ![Code
- ![citation
- ![Code - nerf)
- ![citation
- ![Code - text-to-3D)
- ![citation
- ![citation
- ![Code - Neg/Perp-Neg-stablediffusion)
- ![citation
- ![citation
- ![citation
- ![Code - ttic/sjc)
- ![citation
- ![Code - nju/describe3d)
- ![citation
- ![citation
- ![citation
- ![Code - CVLAB/3DFuse-threestudio)
- ![citation
- ![Code
- ![citation
- ![citation
- ![Code - Lab-SCUT/Fantasia3D)
- ![citation
- ![Code
- ![citation
- ![Code - xiaoma666/X-Mesh)
- ![citation
- ![Code
- ![citation
- ![Code - project/threestudio)
- ![citation
- ![Code - Forge)
- ![citation
- ![citation
- ![Code
- ![citation
- ![Code - Lab-SCUT/tango)
- ![citation
- ![Code - Mesh)
- ![citation
- ![Code
-
Datasets
- ![citation
- ![Code - xl)
- ![citation
-
-
3D Editing
-
🔅 LLM-based
-
Non-LLM-based (Clip/T5)
- ![citation
- ![Code
- ![citation
- ![Code - paintbrush)
- ![citation
- ![citation
- ![Code
- ![citation
- ![Code
- ![citation
- ![citation
- ![Code
- ![citation
- ![Code
- ![citation
- ![citation
- ![Code
- ![citation
- ![Code - aneja/ClipFace)
-
-
Audio Generation
-
🔅 LLM-based
- ![citation
- ![Code - songcomposer/songcomposer)
- ![citation
- ![Code - lin/ChatMusician)
- ![citation
- ![citation
- ![citation
- ![Code - AI/LLaSM)
- ![citation
-
Non-LLM-based
- ![citation
- ![citation
- ![citation
- ![citation
- ![citation
- ![citation
- ![Code
- ![citation
- ![citation
- ![Code - ov-file)
- ![citation
- ![Code - lab/tango)
- ![citation
- ![citation
-
Datasets
- ![citation
- ![citation
- ![citation
- ![citation
- ![citation
- ![Code - jamendo-dataset)
- ![citation
- ![citation
- ![citation
- ![citation
- ![citation
-
-
Audio Understanding
-
Non-LLM-based (Clip/T5)
- ![Code - io-2)
- ![citation
- ![citation
- ![citation
- ![Code
- ![citation
- ![Code
- ![citation
- ![citation
- ![Code
- ![citation
- ![citation
- ![Code
- ![citation
- ![Code - research/llark)
- ![citation
- ![citation
- ![citation
- ![citation
- ![Code - code/MU-LLaMA)
- ![citation
- ![citation
- ![citation
- ![Demo - Audio/AudioGPT)
- ![citation
- ![citation
-
-
Audio Editing
-
🔅 LLM-based
- ![citation
- ![citation
- ![Code
- ![Code - copilot/)
-
Non-LLM-based (Clip/T5)
- ![citation
- ![citation
- ![citation
- ![citation
- ![Code - agent)
- ![citation
- ![Code - Gravitas/AutoGPT)
- ![citation
-
-
Generation with Multiple Modalities
-
🔅 LLM-based
- ![citation - Conditional-Multimodal-Content-Generation-Wang-Duan/f9151b94ff5476cf155c085cf4c3280715cf9bde)
- ![citation
- ![citation - Tokenize-and-Embed-ALL-for-Multi-modal-Large-Yang-Zhang/59d716b442ab760a78f58de6748c0fa1d507bfc1)
- ![citation
- ![citation
-
Non-LLM-based
- ![citation
- ![citation - and-Hearing%3A-Open-domain-Visual-Audio-with-Xing-He/d9822d11ae4ead1f1d32c43124a6a0eb80ea4f0c)
- ![citation - Benchmarking-Text-to-Audible-Video-Mao-Shen/4ba90678411ddc0a2eb997e1184b059bdc955fd5)
-
-
Image Editing
-
🔅 LLM-based
- ![citation
- ![Code
- ![citation
- ![citation
- ![Code
- ![citation
- ![Code
- ![citation
- ![Code
- ![citation
- ![Code - pix2pix)
- ![Code
- ![citation - Exploring-Complex-Instruction-Based-with-Huang-Xie/388b0f44faf0a14cc402c2554ec36a868cf59129)
- ![citation
- ![citation - Revolutionizing-Text-based-Image-Editing-for-Zhang-Kang/8342dfd84eb91cab27404497ef0570b8d9ec55d5)
-
Non-LLM-based (Clip/T5)
- ![citation
- ![Code - E/DragonDiffusion)
- ![citation
- ![citation
- ![citation
- ![citation
- ![Code - diffusion)
- ![citation
- ![Code
- ![citation
- ![Code
- ![citation
- ![Code - zero) -->
- ![citation
- ![citation
- ![Code - zx/SINE)
- ![citation
- ![citation
- ![Code - and-play)
- ![citation
- ![citation
- ![Code - to-prompt)
- ![citation
- ![citation
- ![Code - cd/DiffEdit-stable-diffusion) -->
- ![citation
- ![Code - kim/DiffusionCLIP)
- ![citation
- ![Code
-
-
Video Editing
-
🔅 LLM-based
- ![citation
- ![Code - science/instruct-video-to-video)
- ![citation
-
Non-LLM-based (Clip/T5)
- ![Code
- ![citation
- ![citation
- ![citation
- ![citation
- ![citation
- ![citation
- ![citation
- ![citation
- ![citation
- ![Code - A-Protagonist)
- ![citation
- ![citation
- ![citation
- ![citation
- ![citation
- ![Code
- ![citation
- ![citation - Audio-Driven-Video-Scene-Editing-Shen-Quan/6fa898ed5e58ade17a020e3251687b811ff1d023)
- ![citation
-
-
Image Understanding
-
Non-LLM-based (Clip/T5)
-
-
Video Understanding
-
Non-LLM-based (Clip/T5)
- ![citation
- ![citation - 2%3A-Advancing-Spatial-Temporal-Modeling-Cheng-Leng/7115e1f6cdd91ef09737d5a13664d9489fe27e08)
- ![citation - %3A-Parameter-free-LLaVA-Extension-from-Images-Xu-Zhao/9d29da83aba362c728c36f4dea9dde678ae3e2b2)
- ![Code - research/PLLaVA?tab=readme-ov-file)
- ![citation
- ![citation
- ![Code - YuanGroup/Video-Bench)
- ![citation
- ![Code - oryx/Video-LLaVA)
- ![citation
- ![citation
- ![citation
- ![citation
- ![citation
- ![citation
- ![citation
- ![citation
- ![Code
- ![citation
- ![citation - XL%3A-Extra-Long-Vision-Language-Model-for-Shu-Zhang/ad6f68db45aaebc0e61b342d03da4c2702ce5697)
- ![Code - XL/tree/main)
- ![citation - MLLM%3A-On-Demand-Spatial-Temporal-Understanding-Liu-Dong/85c514db4e90e1fd4200d858353f27a3cc2c29ad)
- ![Code - mllm/Oryx?tab=readme-ov-file)
- ![Code
- ![Code
-
-
3D Understanding
-
Non-LLM-based (Clip/T5)
-
-
Attack
-
Non-LLM-based (Clip/T5)
- ![citation
- ![Code - ssp)
- ![citation
- ![Code - Reminder)
- ![citation
- ![citation
- ![Code - hijacks)
- ![citation
- ![Code - attacks/llm-attacks)
- ![citation
- ![Code
- ![citation
- ![Code - llms)
- ![citation
- ![citation
- ![Code
- ![citation
- ![Code - Wallace/universal-triggers)
- ![citation
- ![Code - squad)
-
-
Defense and Detect
-
Non-LLM-based (Clip/T5)
- ![citation
- ![citation
- ![Code - pretrain-code)
- ![citation
- ![citation
- ![Code - llm)
- ![citation
- ![Code
- ![citation
- ![Code - diffusion)
- ![citation
- ![Code
- ![citation
- ![Code - research/safe-latent-diffusion)
- ![citation
- ![Code - Training-Data-from-Large-Langauge-Models)
- ![citation
-
-
Alignment
-
Non-LLM-based (Clip/T5)
-
-
Datasets
-
Non-LLM-based (Clip/T5)
- ![citation
- ![Code - Safety-Collection)
- ![citation
- ![Code
- ![citation
- ![Code
- ![citation
- ![Code - SafetyBench)
- ![citation
- ![citation
- ![citation
- ![Code
- ![citation
- ![Code - access-control)
- ![citation
- ![Code - coai/SafetyBench)
- ![citation
- ![Code - coai/Safety-Prompts)
-
-
3D, Video and Audio Safety
-
Non-LLM-based (Clip/T5)
- ![citation
- ![citation
- ![Code - Research/Adv3D)
- ![citation
- ![Code
- ![citation
- ![Code - Multi-modal-Multi-scale-Transformers-for-Deepfake-Detection)
- ![citation
-
-
LLM
-
Non-LLM-based (Clip/T5)
- ![citation
- ![citation
- ![citation
- ![citation
- ![Code - ov-file#timeline-of-llms)
- ![Project_Page - llms.github.io/)
-
-
Vision
-
Non-LLM-based (Clip/T5)
- ![citation
- ![Code - Models-in-Vision-A-Survey)
- ![citation
- ![Star History Chart - history.com/#YingqingHe/Awesome-LLMs-meet-Multimodal-Generation&Date)
- ![Code - Models-in-Vision-Survey)
- TPAMI 2023
-
-
Multiple modalities
-
Non-LLM-based (Clip/T5)
- ![citation - A-Multimodal-Autoregressive-Model-for-Piergiovanni-Noble/a4e7199e725b34ae5ddd574057f60ebb1a2011b7)
-
Categories
Image Generation
106
3D Generation
90
Video Generation
82
Image Editing
44
Audio Generation
34
Audio Understanding
26
Video Understanding
25
Video Editing
23
3D Editing
21
Attack
20
Datasets
18
Defense and Detect
17
Image Understanding
13
Audio Editing
12
3D Understanding
10
3D, Video and Audio Safety
8
Generation with Multiple Modalities
8
Vision
6
Alignment
6
LLM
6
Multiple modalities
1
Sub Categories