Awesome-LLMs-meet-Multimodal-Generation

🔥🔥🔥 A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).
https://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation

Last synced: 4 days ago
JSON representation

Image Generation
- 🔅 LLM-based
  - ![Code - CVC/SEED)
  - ![Code - ov-file)
  - ![citation - Denoising-Autoregressive-Transformer-for-Gu-Wang/61bfd967a6ba21e50276e52f353fa74dd68990a6)
  - ![Code
  - ![citation - Model-Beats-Diffusion%3A-Llama-for-Sun-Jiang/b15e6e2b1d81bc110f8fc98c3caf2e25e2512539)
  - ![Code - Prompt)
  - ![Code
  - ![Code - han-lab/vila-u)
  - ![citation - o%3A-One-Single-Transformer-to-Unify-Multimodal-Xie-Mao/9dc337bd18bc0201839942931b12aff8ec1c93f5)
  - ![citation - X%3A-Multimodal-Models-with-Unified-and-Ge-Zhao/ef03c907ee24e2e6c0f2d24639551c82862d1080)
  - ![Code - research/InstantUnify)
  - ![citation - Mixed-Modal-Early-Fusion-Foundation-Team/32112b798f70faab00e14806f51d46058cf5e597?utm_source=direct_link)
  - ![citation
  - ![citation - %3A-Personalized-Multimodal-Generation-with-Large-Shen-Zhang/cfb9eba1b5c55bb0052df41eaaff8716f9c420bd)
  - ![citation
  - ![Code
  - ![Code - Diffusion/ELLA)
  - ![Code
  - ![Code - vector)
  - ![citation
  - ![Code - CVC/VL-GPT)
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![Code
  - ![citation
  - ![citation
  - ![Code - wu/SLD)
  - ![citation
  - ![citation
  - ![citation
  - ![Code
  - ![citation
  - ![Code
  - ![citation
  - ![Code - ur/Idea2Img)
  - ![citation
  - ![citation
  - ![Code - Lai/Mini-DALLE3)
  - ![citation
  - ![citation
  - ![Code - ai-lab/MiniGPT-5)
  - ![citation
  - ![citation
  - ![Code
  - ![citation
  - ![citation
  - ![citation
  - ![Code - wang/SwitchGPT)
  - ![citation
  - ![Code - T2I/LayoutLLM-T2I)
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![Code - min/VPGen)
  - ![citation
  - ![Code - groundedDiffusion)
  - ![citation
  - ![Code - Anything-Pipeline)
  - ![citation
  - ![Code - group/SUR-adapter)
  - ![citation
  - ![citation
  - ![Code - DiffusionMaster)
  - ![citation
  - ![Code
  - ![citation
  - ![Code
  - ![citation - Tokenizing-Strokes-for-Vector-Graphic-Tang-Wu/b2f6830afe63eb477294f17f0d3a6923135950f9)
  - ![citation
  - ![citation - A-Joint-Image-Video-Tokenizer-for-Wang-Jiang/8613c1081a6ab34e2f980e35c06a1af461d7314e)
  - ![citation
  - ![citation
  - ![citation
- Datasets
  - ![Code
  - ![citation
  - ![citation
  - ![citation
  - ![Code - release)
  - ![citation
  - ![Code
  - ![citation
  - ![citation
  - ![Code
  - ![citation
  - ![Code - dataset)
  - ![citation
  - ![citation
  - ![citation
  - ![Code
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![citation
- Non-LLM-based (Clip/T5)
  - ![citation - Free-Lunch-towards-Style-Preserving-Wang-Spinelli/6b5fc164c4f21e4a4f151df60bfd5e32b061a903)
  - ![Code - research/InstantStyle)
  - ![citation - Zero-shot-Identity-Preserving-Generation-Wang-Bai/0f9b66c9208b11369e9d94d85b7dc23bcc5115e9)
  - ![Code - research/InstantID)
  - ![Code
  - ![citation
  - ![Code - alpha/PixArt-alpha)
  - ![citation
  - ![citation
  - ![Code - research/custom-diffusion)\
  - ![citation
  - ![Code
  - ![citation
  - ![citation
  - ![Code - diffusion)
Video Generation
- Non-LLM-based
  - ![citation - Priors-Makes-Text-to-Video-Synthesis-Cheng-Peng/0dd0e0bdff37973e102a042f82cd882b890476cc)
  - ![citation - Training-free-Diffuser-Adaptation-for-Yang-Zhang/2917706886df4e3bf57acd0b41bd4e396be77506#cited-papers)
  - ![citation - A-High-Performance-Long-Video-Method-Xu-Zou/40122a222374504fda4997ef6204dcdcee1678da)
  - ![Code - apps/EasyAnimate)
  - ![citation - VAE%3A-A-Compatible-Video-VAE-for-Latent-Video-Zhao-Zhang/70569a07d841f86faf8914aea435a1696f911a32)
  - ![Code - CVC/CV-VAE)
  - ![citation - Diffusion-Models-are-Training-free-Motion-and-Xiao-Zhou/4f3e62c0fea3dc43f345e775192c972760b9d113)
  - ![citation - Video%3A-Scaled-Spatiotemporal-Transformers-for-Menapace-Siarohin/97cb2eb0d0517e34bf4202f0593600bb6fa043cd)
  - ![citation - Overcoming-Data-Limitations-for-Chen-Zhang/492bc8339d8aac442c4ec13f8c1d59e980a3af2f)
  - ![citation
  - ![citation
  - ![citation - Animating-Open-domain-Images-with-Xing-Xia/083bab4a967c2221d9f4da9110fe37d8ca679078)
  - ![citation
  - ![citation
  - ![Code - CVC/Animate-A-Story)
  - ![citation
  - ![Code - CVC/Make-Your-Video)
  - ![citation
  - ![citation - Time-Controllable-Denoising-for-Image-and-Zhang-Jiang/3f3746c3c64212e97c877bd3d862b578fa24632c)
  - ![citation
  - ![Code - ml/cond-image-leakage/)
  - ![Code - HPC-AI-Lab/OpenDiT)
  - ![Code - org/Pandora)
  - ![citation - Motion-Aware-Customized-Text-to-Video-Wu-Li/7178bbc5e8d2d9b11c890c60486ba2cc2b79b784)
  - ![citation - Tuning-Free-Trajectory-Control-in-Video-Qiu-Chen/1868d2c2f56a92044908a789049fdd44094fc8f2)
  - ![citation - and-Solving-Conditional-Image-Leakage-Zhao-Zhu/ebf4f746d24d79d61c070f8c354b3371f461aafb)
  - ![citation - Conductor%3A-Precision-Control-for-Interactive-Li-Wang/b0bd64273dc8075db530fd696ee7eecb179bb908)
  - ![citation - Building-Automatic-Metrics-to-Simulate-He-Jiang/1680eedc706ef081c0b103457bb52c071ab924b8)
  - ![Code - AI-Lab/VideoScore/)
  - ![citation - Real-World-Visuomotor-Policy-Learning-Liang-Liu/b0ac4f62f55bcf0427008e18f1b4b5bf7ee43df2)
  - ![Code
  - ![Code
  - ![Code - CVC/VideoCrafter)
  - ![Code
  - ![Code
- Benchmarks
  - ![citation
  - ![Code
  - ![citation - Benchmarking-Text-to-Audible-Video-Mao-Shen/4ba90678411ddc0a2eb997e1184b059bdc955fd5)
  - ![Code - CVC/VideoGen-Eval)
  - ![Code
  - ![Code - Generates-Videos-with-Stunning-Geometrical-Consistency)
- Datasets
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![Code - Text/CelebV-Text)
  - ![citation
  - ![Code - vila-100m/README.md)
  - ![citation
  - ![Code
  - ![citation
  - ![citation
  - ![citation
  - ![Code - ov-file)
  - ![citation
  - ![citation
  - ![Code - dataset)
  - ![citation
  - ![citation
  - ![citation
  - ![Code - vtt-it)
  - ![Code - ov-file)
  - ![citation - 1M%3A-A-Large-Scale-Dataset-for-Text-to-video-Tan-Yang/8af933a6e0d45e041a1ca35d461aad92022aa957)
  - ![Code
  - ![citation - A-Video-Is-Worth-Thousands-of-Words-Yang-Huang/2fb2a76be0f261fb2660457a6cec8a8384b33a19)
  - ![citation - A-Multimodal-Trailer-Video-Dataset-with-Chi-Wang/0d8ea8cf8fcadc0eb52304258e254e01c62dfe52)
  - ![Code
  - ![Code - research-datasets/videoCC-data)
  - ![Code - vtt-it)
- Video VAE/Tokenizers
  - ![Code - nics/DLFR-VAE)
  - ![Code
  - ![TokenBench - ov-file)
  - ![citation
  - ![Code - ov-file)
  - ![citation - VAE%3A-Enhancing-Video-VAE-by-Wavelet-Driven-Flow-Li-Lin/58d9eaa0868e971687c20d0588de3058b7780b51)
  - ![citation - Video-VAE-for-Latent-Video-Diffusion-Model-Wu-Zhu/4e073da5a37753fba320719baaa17ca593e6a094)
  - ![citation - VAE%3A-A-Compatible-Video-VAE-for-Latent-Video-Zhao-Zhang/70569a07d841f86faf8914aea435a1696f911a32)
  - ![Code - CVC/CV-VAE)
  - ![citation
  - ![Code - Tokenizer)
  - ![Code - CVC/CV-VAE)
  - ![Code - Tokenizer)
- 🔅 LLM-based
  - ![citation - Generating-Minute-level-Long-Videos-with-Wang-Xiong/1ac7fc5a55ce5843fb8a19d9f62b623e822bb7de)
  - ![citation - Model-on-Million-Length-Video-And-Language-Liu-Yan/db7498f569be9852a04b2bb5bd68bd2885820bea)
  - ![citation - 2%3A-LLM-Enhanced-World-Models-for-Video-Zhao-Wang/b34fb645165da381e27077282d69ff224dd2d5f5)
  - ![citation - Language-Driven-Video-Inpainting-via-Large-Wu-Li/02d96eb0da4a282831f14923d1a65976952b7177)
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![Code
  - ![citation
  - ![Code - groundedVideoDiffusion)
  - ![Code - hanlin/VideoDirectorGPT)
  - ![citation
  - ![citation
  - ![citation
  - ![Code - CVLAB/DirecT2V)
  - ![citation
  - ![Code - CVLAB/DirecT2V)
  - ![citation
  - ![citation
  - ![citation
- Audio-Video
  - ![Code
  - ![citation
  - ![Code - Lightyear/LVAS)
  - ![citation - A-Unified-Diffusion-Transformer-for-Zhao-Feng/7a4436209d8cf0d868b13000c8abff63d72daf0f)
  - ![Paper
  - ![Code
  - ![Code
  - ![Code
  - ![citation
  - ![citation
  - ![Code
3D Generation
- 🔅 LLM-based
  - ![citation
  - ![citation
  - ![Code
  - ![Code
  - ![citation
- Non-LLM-based (Clip/T5)
  - ![citation
  - ![Code - lin/DreamPolisher)
  - ![citation
  - ![Code - sg/Consistent3D)
  - ![citation
  - ![Code - research/AToM)
  - ![citation
  - ![Code
  - ![citation
  - ![Code
  - ![citation
  - ![Code
  - ![citation
  - ![Code
  - ![citation
  - ![Code
  - ![citation
  - ![Code
  - ![citation
  - ![Code
  - ![citation
  - ![Code - Research/LucidDreamer)
  - ![citation
  - ![Code
  - ![citation
  - ![Code
  - ![citation
  - ![citation
  - ![Code
  - ![citation
  - ![Code - 98/SweetDreamer)
  - ![citation
  - ![Code
  - ![citation
  - ![Code - Lab/Classifier-Score-Distillation)
  - ![citation
  - ![Code
  - ![citation
  - ![Code
  - ![citation
  - ![Code
  - ![citation
  - ![Code - nerf)
  - ![citation
  - ![Code - text-to-3D)
  - ![citation
  - ![citation
  - ![Code - Neg/Perp-Neg-stablediffusion)
  - ![citation
  - ![citation
  - ![citation
  - ![Code - ttic/sjc)
  - ![citation
  - ![Code - nju/describe3d)
  - ![citation
  - ![citation
  - ![citation
  - ![Code - CVLAB/3DFuse-threestudio)
  - ![citation
  - ![Code
  - ![citation
  - ![citation
  - ![Code - Lab-SCUT/Fantasia3D)
  - ![citation
  - ![Code
  - ![citation
  - ![Code - xiaoma666/X-Mesh)
  - ![citation
  - ![Code
  - ![citation
  - ![Code - project/threestudio)
  - ![citation
  - ![Code - Forge)
  - ![citation
  - ![citation
  - ![Code
  - ![citation
  - ![Code - Lab-SCUT/tango)
  - ![citation
  - ![Code - Mesh)
  - ![citation
  - ![Code
  - ![Code - nerf)
- Datasets
  - ![citation
  - ![Code - xl)
  - ![citation
3D Editing
- 🔅 LLM-based
  - ![Code
  - ![Code
  - ![citation
  - ![Code
- Non-LLM-based (Clip/T5)
  - ![citation
  - ![Code
  - ![citation
  - ![Code - paintbrush)
  - ![citation
  - ![citation
  - ![Code
  - ![citation
  - ![Code
  - ![citation
  - ![citation
  - ![Code
  - ![citation
  - ![Code
  - ![citation
  - ![citation
  - ![Code
  - ![citation
  - ![Code - aneja/ClipFace)
  - ![Code - aneja/ClipFace)
  - ![Code - aneja/ClipFace)
  - ![Code - aneja/ClipFace)
Audio Generation
- 🔅 LLM-based
  - ![citation
  - ![Code - songcomposer/songcomposer)
  - ![citation
  - ![Code - lin/ChatMusician)
  - ![citation
  - ![citation
  - ![Code - AI/LLaSM)
  - ![citation
  - ![citation
- Non-LLM-based
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![Code
  - ![citation
  - ![citation
  - ![Code - ov-file)
  - ![citation
  - ![Code - lab/tango)
  - ![citation
  - ![citation
- Datasets
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![Code - jamendo-dataset)
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![citation
Audio Understanding
- Non-LLM-based (Clip/T5)
  - ![Code - io-2)
  - ![Code
  - ![Code
  - ![citation
  - ![citation
  - ![Code
  - ![citation
  - ![Code - research/llark)
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![Code - code/MU-LLaMA)
  - ![citation
  - ![citation
  - ![Code - io-2)
  - ![Code
  - ![Code
  - ![Code - Audio/AudioGPT)
  - ![Code
  - ![Code
  - ![citation
  - ![citation
  - ![Code - io-2)
  - ![Code - io-2)
  - ![Code
  - ![Code - Audio/AudioGPT)
  - ![citation
  - ![Code
  - ![Code - Audio/AudioGPT)
  - ![citation
  - ![Code
  - ![citation
  - ![citation
  - ![citation
  - ![Code
  - ![citation
  - ![citation
  - ![Demo - Audio/AudioGPT)
  - ![Code
Audio Editing
- 🔅 LLM-based
  - ![Code
  - ![Code - copilot/)
  - ![Code
  - ![Code
  - ![Code
  - ![citation
  - ![citation
  - ![Code
- Non-LLM-based (Clip/T5)
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![Code - agent)
  - ![citation
  - ![Code - Gravitas/AutoGPT)
  - ![Code
  - ![Code
  - ![citation
  - ![Code
  - ![Code
  - ![Code
Generation with Multiple Modalities
- 🔅 LLM-based
  - ![citation - Conditional-Multimodal-Content-Generation-Wang-Duan/f9151b94ff5476cf155c085cf4c3280715cf9bde)
  - ![citation
  - ![citation - Tokenize-and-Embed-ALL-for-Multi-modal-Large-Yang-Zhang/59d716b442ab760a78f58de6748c0fa1d507bfc1)
  - ![citation
  - ![Code - GPT/NExT-GPT)
  - ![Code - GPT/NExT-GPT)
  - ![citation
- Non-LLM-based
  - ![citation
  - ![citation - and-Hearing%3A-Open-domain-Visual-Audio-with-Xing-He/d9822d11ae4ead1f1d32c43124a6a0eb80ea4f0c)
  - ![Code - and-Hearing)
  - ![Code - and-Hearing)
Image Editing
- 🔅 LLM-based
  - ![Code
  - ![Code
  - ![Code
  - ![citation
  - ![Code
  - ![Code - pix2pix)
  - ![Code
  - ![Code
  - ![Code
  - ![citation - Exploring-Complex-Instruction-Based-with-Huang-Xie/388b0f44faf0a14cc402c2554ec36a868cf59129)
  - ![citation
  - ![citation - Revolutionizing-Text-based-Image-Editing-for-Zhang-Kang/8342dfd84eb91cab27404497ef0570b8d9ec55d5)
  - ![Code
  - ![Code
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![citation
- Non-LLM-based (Clip/T5)
  - ![citation
  - ![Code - E/DragonDiffusion)
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![Code - diffusion)
  - ![citation
  - ![Code
  - ![citation
  - ![Code
  - ![citation
  - ![Code - zero) -->
  - ![citation
  - ![citation
  - ![Code - zx/SINE)
  - ![citation
  - ![citation
  - ![Code - and-play)
  - ![citation
  - ![citation
  - ![Code - to-prompt)
  - ![citation
  - ![citation
  - ![Code - cd/DiffEdit-stable-diffusion) -->
  - ![citation
  - ![Code - kim/DiffusionCLIP)
  - ![citation
  - ![Code
  - ![Code - zero) -->
  - ![Code - zero) -->
Video Editing
- 🔅 LLM-based
  - ![citation
  - ![Code - science/instruct-video-to-video)
  - ![citation
- Non-LLM-based (Clip/T5)
  - ![Code
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![Code - A-Protagonist)
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![Code
  - ![citation
  - ![Code
  - ![citation - Audio-Driven-Video-Scene-Editing-Shen-Quan/6fa898ed5e58ade17a020e3251687b811ff1d023)
  - ![citation
  - ![Code
  - ![Code
  - ![Code
  - ![Code
Image Understanding
- Non-LLM-based (Clip/T5)
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![Code - AILab/OphGLM)
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![Code - research/LLaMA-VID)
  - ![Code - VL)
  - ![Code - CAIR/MiniGPT-4)
  - ![Code - XComposer)
  - ![Code - VL)
  - ![Code - CAIR/MiniGPT-4)
  - ![Code - liu/LLaVA)
  - ![citation
  - ![Code - VL)
  - ![Code - CAIR/MiniGPT-4)
  - ![Code - liu/LLaVA)
  - ![citation
  - ![Code - liu/LLaVA)
  - ![Code - XComposer)
  - ![Code - CAIR/MiniGPT-4)
  - ![Code - liu/LLaVA)
  - ![Code - CAIR/MiniGPT-4)
Video Understanding
- Non-LLM-based (Clip/T5)
  - ![citation
  - ![citation - 2%3A-Advancing-Spatial-Temporal-Modeling-Cheng-Leng/7115e1f6cdd91ef09737d5a13664d9489fe27e08)
  - ![citation - %3A-Parameter-free-LLaVA-Extension-from-Images-Xu-Zhao/9d29da83aba362c728c36f4dea9dde678ae3e2b2)
  - ![Code - research/PLLaVA?tab=readme-ov-file)
  - ![citation
  - ![citation
  - ![Code - YuanGroup/Video-Bench)
  - ![citation
  - ![Code - oryx/Video-LLaVA)
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![Code
  - ![citation
  - ![Code - research/LLaMA-VID)
  - ![citation - XL%3A-Extra-Long-Vision-Language-Model-for-Shu-Zhang/ad6f68db45aaebc0e61b342d03da4c2702ce5697)
  - ![Code - XL/tree/main)
  - ![citation - MLLM%3A-On-Demand-Spatial-Temporal-Understanding-Liu-Dong/85c514db4e90e1fd4200d858353f27a3cc2c29ad)
  - ![Code - mllm/Oryx?tab=readme-ov-file)
  - ![Code - research/LLaMA-VID)
  - ![Code
  - ![Code - research/LLaMA-VID)
3D Understanding
- Non-LLM-based (Clip/T5)
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![Code
  - ![citation
  - ![Code
  - ![citation
  - ![Code
Attack
- Non-LLM-based (Clip/T5)
  - ![citation
  - ![Code - ssp)
  - ![citation
  - ![Code - Reminder)
  - ![citation
  - ![citation
  - ![Code - hijacks)
  - ![citation
  - ![Code - attacks/llm-attacks)
  - ![citation
  - ![Code
  - ![citation
  - ![Code - llms)
  - ![citation
  - ![citation
  - ![Code
  - ![citation
  - ![Code - Wallace/universal-triggers)
  - ![citation
  - ![Code - squad)
Defense and Detect
- Non-LLM-based (Clip/T5)
  - ![citation
  - ![citation
  - ![Code - pretrain-code)
  - ![citation
  - ![citation
  - ![Code - llm)
  - ![citation
  - ![Code
  - ![citation
  - ![Code - diffusion)
  - ![citation
  - ![Code
  - ![citation
  - ![Code - research/safe-latent-diffusion)
  - ![citation
  - ![Code - Training-Data-from-Large-Langauge-Models)
  - ![citation
Alignment
- Non-LLM-based (Clip/T5)
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![Code
Datasets
- Non-LLM-based (Clip/T5)
  - ![citation
  - ![Code - Safety-Collection)
  - ![citation
  - ![Code
  - ![citation
  - ![Code
  - ![citation
  - ![Code - SafetyBench)
  - ![citation
  - ![citation
  - ![citation
  - ![Code
  - ![citation
  - ![Code - access-control)
  - ![citation
  - ![Code - coai/SafetyBench)
  - ![citation
  - ![Code - coai/Safety-Prompts)
3D, Video and Audio Safety
- Non-LLM-based (Clip/T5)
  - ![citation
  - ![citation
  - ![Code - Research/Adv3D)
  - ![citation
  - ![Code
  - ![citation
  - ![Code - Multi-modal-Multi-scale-Transformers-for-Deepfake-Detection)
  - ![citation
LLM
- Non-LLM-based (Clip/T5)
  - ![citation
  - ![citation
  - ![citation
  - ![citation
  - ![Code - ov-file#timeline-of-llms)
  - ![Project_Page - llms.github.io/)
Vision
- Non-LLM-based (Clip/T5)
  - ![citation
  - ![Code - Models-in-Vision-A-Survey)
  - ![citation
  - ![Star History Chart - history.com/#YingqingHe/Awesome-LLMs-meet-Multimodal-Generation&Date)
  - ![Star History Chart - history.com/#YingqingHe/Awesome-LLMs-meet-Multimodal-Generation&Date)
  - ![Code - Models-in-Vision-Survey)
  - TPAMI 2023
Multiple modalities
- Non-LLM-based (Clip/T5)
  - ![citation - A-Multimodal-Autoregressive-Model-for-Piergiovanni-Noble/a4e7199e725b34ae5ddd574057f60ebb1a2011b7)

Awesome-LLMs-meet-Multimodal-Generation

Image Generation

🔅 LLM-based

Datasets

Non-LLM-based (Clip/T5)

Video Generation

Non-LLM-based

Benchmarks

Datasets

Video VAE/Tokenizers

🔅 LLM-based

Audio-Video

3D Generation

🔅 LLM-based

Non-LLM-based (Clip/T5)

Datasets

3D Editing

🔅 LLM-based

Non-LLM-based (Clip/T5)

Audio Generation

🔅 LLM-based

Non-LLM-based

Datasets

Audio Understanding

Non-LLM-based (Clip/T5)

Audio Editing

🔅 LLM-based

Non-LLM-based (Clip/T5)

Generation with Multiple Modalities

🔅 LLM-based

Non-LLM-based

Image Editing

🔅 LLM-based

Non-LLM-based (Clip/T5)

Video Editing

🔅 LLM-based

Non-LLM-based (Clip/T5)

Image Understanding

Non-LLM-based (Clip/T5)

Video Understanding

Non-LLM-based (Clip/T5)

3D Understanding

Non-LLM-based (Clip/T5)

Attack

Non-LLM-based (Clip/T5)

Defense and Detect

Non-LLM-based (Clip/T5)

Alignment

Non-LLM-based (Clip/T5)

Datasets

Non-LLM-based (Clip/T5)

3D, Video and Audio Safety

Non-LLM-based (Clip/T5)

LLM

Non-LLM-based (Clip/T5)

Vision

Non-LLM-based (Clip/T5)

Multiple modalities

Non-LLM-based (Clip/T5)