An open API service indexing awesome lists of open source software.

ai-game-devtools

Here we will keep track of the latest AI Game Development Tools, including LLM, World Model, Agent, Code, Image, Texture, Shader, 3D Model, Animation, Video, Audio, Music, Singing Voice and Analytics. 🔥
https://github.com/Yuan-ManX/ai-game-devtools

Last synced: 5 days ago
JSON representation

  • <span id="avatar">Avatar</span>

    • <span id="tool">LLM (LLM & Tool)</span>

      • Ready Player Me
      • HeadSculpt
      • Text2Control3D - Guided Text-to-Image Diffusion Model. |[arXiv](https://arxiv.org/abs/2309.03550) | | Avatar |
      • VLOGGER
      • Wild2Avatar
      • RodinHD - Fidelity 3D Avatar Generation with Diffusion Models. |[arXiv](https://arxiv.org/abs/2407.06938) | | Avatar |
      • MotionGPT - language generation model using LLMs. |[arXiv](https://arxiv.org/abs/2306.14795) | | Avatar |
      • Duix - Silicon-Based Digital Human SDK 🌐🤖 | | | Avatar |
      • LivePortrait
      • MusePose - Driven Image-to-Video Framework for Virtual Human Generation. | | | Avatar |
      • MuseV - length and High Fidelity Virtual Human Video Generation with Visual Conditioned Parallel Denoising. | | | Avatar |
      • AniPortrait - Driven Synthesis of Photorealistic Portrait Animations. |[arXiv](https://arxiv.org/abs/2403.17694) | | Avatar |
      • GeneFace++ - Time 3D Talking Face Generation. | | | Avatar |
      • MuseTalk - Time High Quality Lip Synchorization with Latent Space Inpainting. | | | Avatar |
      • ChatdollKit
      • Hallo - Driven Visual Synthesis for Portrait Image Animation. |[arXiv](https://arxiv.org/abs/2406.08801) | | Avatar |
      • DreamTalk
      • CALM
      • EMOPortraits - enhanced Multimodal One-shot Head Avatars. | | | Avatar |
      • E3 Gen
      • GeneAvatar - Aware Volumetric Head Avatar Editing from a Single Image. |[arXiv](https://arxiv.org/abs/2404.02152) | | Avatar |
      • IntrinsicAvatar
      • Linly-Talker
      • Portrait4D - Shot 4D Head Avatar Synthesis using Synthetic Data. |[arXiv](https://arxiv.org/abs/2311.18729) | | Avatar |
      • StyleAvatar3D - Text Diffusion Models for High-Fidelity 3D Avatar Generation. |[arXiv](https://arxiv.org/abs/2305.19012) | | Avatar |
      • Topo4D - Preserving Gaussian Splatting for High-Fidelity 4D Head Capture. |[arXiv](https://arxiv.org/abs/2406.00440) | | Avatar |
      • UnityAIWithChatGPT
      • Vid2Avatar - supervised Scene Decomposition. |[arXiv](https://arxiv.org/abs/2302.11566) | | Avatar |
      • ExAvatar - Expressive Whole-Body 3D Gaussian Avatar. |[arXiv](https://arxiv.org/abs/2407.21686) | | Avatar |
      • Hallo2 - Duration and High-Resolution Audio-Driven Portrait Image Animation. |[arXiv](https://arxiv.org/abs/2410.07718) | | Avatar |
      • Ditto - Space Diffusion for Controllable Realtime Talking Head Synthesis. |[arXiv](https://arxiv.org/abs/2411.19509) | | Avatar |
      • EmoVOCA - Driven Emotional 3D Talking Heads. |[arXiv](https://arxiv.org/abs/2403.12886) | | Avatar |
      • HunyuanVideo-Avatar - Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters. |[arXiv](https://arxiv.org/abs/2505.20156) | | Avatar |
      • HunyuanPortrait
      • StableAvatar - Length Audio-Driven Avatar Video Generation. |[arXiv](https://arxiv.org/abs/2508.08248) | | Avatar |
      • EchoMimic - Driven Portrait Animations through Editable Landmark Conditions. |[arXiv](https://arxiv.org/abs/2407.08136) | | Avatar |
    • <span id="tool">Tool (AI LLM)</span>

      • ChatAvatar
      • EchoMimic - Driven Portrait Animations through Editable Landmark Conditions. |[arXiv](https://arxiv.org/abs/2407.08136) | | Avatar |
  • <span id="model">3D Model</span>

    • <span id="tool">LLM (LLM & Tool)</span>

      • Luma AI
      • CSM
      • Instruct-NeRF2NeRF
      • ProlificDreamer - Fidelity and diverse Text-to-3D generation with Variational score Distillation. |[arXiv](https://arxiv.org/abs/2305.16213) | | Model |
      • One-2-3-45 - Shape Optimization. |[arXiv](https://arxiv.org/abs/2306.16928) | | Model |
      • 3Dpresso
      • Meshy
      • LATTE3D - scale Amortized Text-To-Enhanced3D Synthesis. |[arXiv](https://arxiv.org/abs/2403.15385) | | 3D |
      • Blockade Labs - the ultimate AI-powered solution for generating incredible 360° skybox experiences from text prompts. | | | Model |
      • Dash
      • GenieLabs - UGC. | | | 3D |
      • lumine AI - Powered Creativity. | | | 3D |
      • Sloyd
      • SV3D - view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion. |[arXiv](https://arxiv.org/abs/2403.12008) | | 3D |
      • Tafi
      • 3D-GPT
      • Voxcraft - to-Use 3D Models with AI. | | | 3D |
      • TripoSR - of-the-art open-source model for fast feedforward 3D reconstruction from a single image. |[arXiv](https://arxiv.org/abs/2403.02151) | | Model |
      • Any2Point - modality Large Models for Efficient 3D Understanding. |[arXiv](https://arxiv.org/abs/2404.07989) | | 3D |
      • Anything-3D - Anything + 3D. Let's lift the anything to 3D. |[arXiv](https://arxiv.org/abs/2304.10261) | | Model |
      • Point·E
      • Shap-E
      • Wonder3D - Domain Diffusion. |[arXiv](https://arxiv.org/abs/2310.15008) | | 3D |
      • NVIDIA Instant NeRF
      • HiFA - fidelity Text-to-3D with advance Diffusion guidance. | | | Model |
      • Infinigen
      • threestudio
      • BlenderGPT - 4. | | Blender | Model |
      • Stable Dreamfusion - to-3D model Dreamfusion, powered by the Stable Diffusion text-to-2D model. | | | Model |
      • UnityGaussianSplatting
      • GaussianDreamer
      • Unique3D - Quality and Efficient 3D Mesh Generation from a Single Image. |[arXiv](https://arxiv.org/abs/2405.20343) | | 3D |
      • Blender-GPT - in-one Blender assistant powered by GPT3/4 + Whisper integration. | | Blender | Model |
      • CF-3DGS - Free 3D Gaussian Splatting. |[arXiv](https://arxiv.org/abs/2312.07504) | | 3D |
      • CharacterGen - View Pose Canonicalization. |[arXiv](https://arxiv.org/abs/2402.17214) | | 3D |
      • chatGPT-maya
      • DreamGaussian4D
      • DUSt3R
      • GALA3D - to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting. |[arXiv](https://arxiv.org/abs/2402.07207) | | 3D |
      • GaussCtrl - View Consistent Text-Driven 3D Gaussian Splatting Editing. |[arXiv](https://arxiv.org/abs/2403.08733) | | 3D |
      • GaussianCube
      • HoloDreamer
      • Interactive3D
      • Isotropic3D - to-3D Generation Based on a Single CLIP Embedding. | | | 3D |
      • LION
      • Make-It-3D - Fidelity 3D Creation from A Single Image with Diffusion Prior. |[arXiv](https://arxiv.org/abs/2303.14184) | | Model |
      • MVDream - view Diffusion for 3D Generation. |[arXiv](https://arxiv.org/abs/2308.16512) | | 3D |
      • Paint3D - Less Texture Diffusion Models. |[arXiv](https://arxiv.org/abs/2312.13913) | | 3D |
      • PAniC-3D - view 3D Reconstruction from Portraits of Anime Characters. |[arXiv](https://arxiv.org/abs/2303.14587) | | Model |
      • 3DTopia - to-3D Generation within 5 Minutes. |[arXiv](https://arxiv.org/abs/2403.02234) | | 3D |
      • ViVid-1-to-3
      • Zero-1-to-3 - shot One Image to 3D Object. |[arXiv](https://arxiv.org/abs/2303.11328) | | Model |
      • SF3D - unwrapping and Illumination Disentanglement. |[arXiv](https://arxiv.org/abs/2408.00653) | | 3D |
      • 3DTopia-XL - XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion. |[arXiv](https://arxiv.org/abs/2409.12957) | | 3D |
      • Animate3D - view Video Diffusion. |[arXiv](https://arxiv.org/abs/2407.11398) | | 3D |
      • Hunyuan3D-1.0 - 1.0: A Unified Framework for Text-to-3D and Image-to-3D Generation. |[arXiv](https://arxiv.org/abs/2411.02293) | | 3D |
      • Edify 3D - Quality 3D Asset Generation. |[arXiv](https://arxiv.org/abs/2411.07135) | | 3D |
      • BlenderMCP
      • Direct3D-S2 - S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention. |[arXiv](https://arxiv.org/abs/2505.17412) | | 3D |
      • Step1X-3D - 3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets. |[arXiv](https://arxiv.org/abs/2505.07747) | | 3D |
      • Hunyuan3D 2.1 - Fidelity 3D Assets with Production-Ready PBR Material. |[arXiv](https://arxiv.org/abs/2506.15442) | | 3D |
      • PhysRig - Based Rigging for Realistic Articulated Object Modeling. |[arXiv](https://arxiv.org/abs/2506.20936) | | Model |
      • CityDreamer
      • DreamCatalyst - Quality 3D Editing via Controlling Editability and Identity Preservation. |[arXiv](https://arxiv.org/abs/2407.11394) | | 3D |
      • Hunyuan3D 2.0
      • 3D-LLM
    • <span id="tool">Tool (AI LLM)</span>

  • <span id="image">Image</span>

    • <span id="tool">LLM (LLM & Tool)</span>

      • Lexica
      • Photoroom
      • Midjourney
      • Imagen
      • ClipDrop
      • Segment Anything
      • DeepAI
      • Prompt.Art
      • StyleDrop - To-Image Generation in Any Style. |[arXiv](https://arxiv.org/abs/2306.00983) | | Image |
      • PhotoMaker
      • Draw Things - assisted image generation in Your Pocket. | | | Image |
      • Diffuse to Choose - All. |[arXiv](https://arxiv.org/abs/2401.13795) | | Image |
      • DALL·E 2
      • AutoStudio - turn Interactive Image Generation. |[arXiv](https://arxiv.org/abs/2406.01388) | | Image |
      • Ideogram
      • SDXL-Lightning
      • Vispunk Visions - to-Image generation platform. | | | Image |
      • Img2Prompt
      • Stable Diffusion - to-image diffusion model. | | | Image |
      • sd-webui-controlnet
      • Fooocus
      • Stable Diffusion web UI
      • Stable Diffusion web UI - based UI for Stable Diffusion. | | | Image |
      • StableStudio
      • DragGAN - based Manipulation on the Generative Image Manifold. |[arXiv](https://arxiv.org/abs/2305.10973) | | Image |
      • Stable Cascade
      • Grounded-Segment-Anything
      • Depth Anything V2
      • StreamDiffusion - Level Solution for Real-Time Interactive Generation. | | | Image |
      • ComfyUI
      • Kolors - to-Image Synthesis. | | | Image |
      • AnyText
      • EasyPhoto
      • Disco Diffusion
      • ControlNet
      • Omost
      • SEED-Story - Story: Multimodal Long Story Generation with Large Language Model. |[arXiv](https://arxiv.org/abs/2407.08683) | | Image |
      • Outfit Anyone - high quality virtual try-on for Any Clothing and Any Person. | | | Image |
      • CatVTON - On with Diffusion Models. |[arXiv](https://arxiv.org/abs/2407.15886) | | Image |
      • MimicBrush - shot Image Editing with Reference Imitation. |[arXiv](https://arxiv.org/abs/2406.07547) | | Image |
      • PuLID
      • Rich-Text-to-Image - to-Image Generation with Rich Text. |[arXiv](https://arxiv.org/abs/2304.06720) | | Image |
      • LaVi-Bridge - to-Image Generation. |[arXiv](https://arxiv.org/abs/2403.07860) | | Image |
      • ConceptLab
      • RPG-DiffusionMaster - to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (PRG). | | | Image |
      • MIGC - Instance Generation Controller for Text-to-Image Synthesis. |[arXiv](https://arxiv.org/abs/2402.05408) | | Image |
      • img2img-turbo - Step Image-to-Image with SD-Turbo. | | | Image |
      • stable-diffusion.cpp
      • DeepFloyd IF
      • Stable.art
      • Hua
      • CLIPasso
      • Openpose Editor - diffusion-webui. | | | Image |
      • PaintsUndo
      • Stable Diffusion WebUI Chinese - diffusion-webui. | | | Image |
      • SyncDreamer - consistent Images from a Single-view Image. |[arXiv](https://arxiv.org/abs/2309.03453) | | Image |
      • Hunyuan-DiT - Resolution Diffusion Transformer with Fine-Grained Chinese Understanding. |[arXiv](https://arxiv.org/abs/2405.08748) | | Image |
      • IC-Light - Light is a project to manipulate the illumination of images. | | | Image |
      • Depth map library and poser - diffusion-webui. | | | Image |
      • InternLM-XComposer2 - XComposer2 is a groundbreaking vision-language large model (VLLM) excelling in free-form text-image composition and comprehension. |[arXiv](https://arxiv.org/abs/2401.16420) | | Image |
      • AnyDoor - shot Object-level Image Customization. |[arXiv](https://arxiv.org/abs/2307.09481) | | Image |
      • Blender-ControlNet
      • BriVL
      • DWPose - body Pose Estimation with Two-stages Distillation. |[arXiv](https://arxiv.org/abs/2307.15880) | | Image |
      • Follow-Your-Click - domain Regional Image Animation via Short Prompts. |[arXiv](https://arxiv.org/abs/2403.08268) | | Image |
      • GIFfusion
      • KOALA - Attention Matters in Knowledge Distillation of Latent Diffusion Models for Memory-Efficient and Fast Image Synthesis. | | | Image |
      • LlamaGen
      • SDXS - Time One-Step Latent Diffusion Models with Image Conditions. | | | Image |
      • UltraEdit - based Fine-Grained Image Editing at Scale. |[arXiv](https://arxiv.org/abs/2407.05282) | | Image |
      • UltraPixel - High-Resolution Image Synthesis to New Peaks. |[arXiv](https://arxiv.org/abs/2407.02158) | | Image |
      • Unity ML Stable Diffusion
      • Flux - to-image and image-to-image with our Flux latent rectified flow transformers. | | | Image |
      • Lumina-mGPT - mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining. |[arXiv](https://arxiv.org/abs/2408.02657) | | Image |
      • HivisionIDPhotos
      • CSGO - Style Composition in Text-to-Image Generation. |[arXiv](https://arxiv.org/abs/2408.16766) | | Image |
      • OmniGen
      • StoryMaker - to-image Generation. |[arXiv](https://arxiv.org/abs/2409.12576) | | Image |
      • Stable Diffusion 3.5
      • Infinity - Resolution Image Synthesis. |[arXiv](https://arxiv.org/abs/2412.04431) | | Image |
      • KREA - powered design tool. | | | Image |
      • Lumina-Image 2.0 - Image 2.0 : A Unified and Efficient Image Generative Model. | | | Image |
      • MakeAnything - Domain Procedural Sequence Generation. |[arXiv](https://arxiv.org/abs/2502.01572) | | Image |
      • Stable Diffusion XL Turbo - Time Text-to-Image Generation. | | | Image |
      • Stable Doodle - to-image tool that converts a simple drawing into a dynamic image. | | | Image |
      • Komiko - powered storytelling platform that lets you create original characters, comics, and animations with ease. | | | Comic |
      • BAGEL - Unified Model for Multimodal Understanding and Generation. BAGEL is an open‑source multimodal foundation model with 7B active parameters (14B total) trained on large‑scale interleaved multimodal data. |[arXiv](https://arxiv.org/abs/2505.14683) | | Image |
      • OmniGen2
      • PosterCraft - Quality Aesthetic Poster Generation in a Unified Framework. |[arXiv](https://arxiv.org/abs/2506.10741) | | Image |
      • SkyworkUniPic - Unified Autoregressive Modeling for Visual Understanding and Generation. | | | Image |
      • Qwen-Image-Edit - Image model, Qwen-Image-Edit successfully extends Qwen-Image’s unique text rendering capabilities to image editing tasks, enabling precise text editing. |[arXiv](https://arxiv.org/abs/2508.02324) | | Image |
      • NextStep-1 - 1: Toward Autoregressive Image Generation with Continuous Tokens at Scale. |[arXiv](https://arxiv.org/abs/2508.10711) | | Image |
      • USO - Driven Generation via Disentangled and Reward Learning. |[arXiv](https://arxiv.org/abs/2508.18966) | | Image |
      • HunyuanImage-2.1 - 2.1: An Efficient Diffusion Model for High-Resolution (2K) Text-to-Image Generation​. | | | Image |
      • IRG - Interleaving Reasoning for Better Text-to-Image Generation. |[arXiv](https://arxiv.org/abs/2509.06945) | | Image |
      • PromptEnhancer - to-Image Models via Chain-of-Thought Prompt Rewriting. |[arXiv](https://www.arxiv.org/abs/2509.04545) | | Image |
      • MetaShoot
      • HunyuanImage-3.0 - 3.0: A Powerful Native Multimodal Model for Image Generation​. | | | Image |
      • Segment Anything Model 2 (SAM 2)
    • <span id="tool">Tool (AI LLM)</span>

  • <span id="video">Video</span>

    • <span id="tool">LLM (LLM & Tool)</span>

      • MagicVideo
      • Descript
      • Emu Video - to-Video Generation by Explicit Image Conditioning. | | | Video |
      • Video LDMs - resolution Video Synthesis with Latent Diffusion Models. |[arXiv](https://arxiv.org/abs/2304.08818) | | Video |
      • Make-A-Video - A-Video is a state-of-the-art AI system that generates videos from text. |[arXiv](https://arxiv.org/abs/2209.14792) | | Video |
      • Imagen Video - resolution models. | | | Video |
      • VideoCrafter1 - Quality Video Generation. |[arXiv](https://arxiv.org/abs/2310.19512) | | Video |
      • MobileVidFactory - Based Social Media Video Generation for Mobile Devices from Text. | | | Video |
      • MovieFactory
      • VideoFactory - to-Video Generation. | | | Video |
      • Pollinations
      • UniVG - modal Video Generation. | | | Video |
      • Follow Your Pose - Guided Text-to-Video Generation using Pose-Free Videos. |[arXiv](https://arxiv.org/abs/2304.01186) | | Video |
      • Make-Your-Video
      • VideoComposer
      • VideoCrafter2 - Quality Video Diffusion Models. |[arXiv](https://arxiv.org/abs/2401.09047) | | Video |
      • I2VGen-XL - Quality Image-to-Video Synthesis via Cascaded Diffusion Models. |[arXiv](https://arxiv.org/abs/2311.04145) | | Video |
      • MotionCtrl
      • HiGen - temporal Decoupling for Text-to-Video generation. | | | Video |
      • TF-T2V - to-Video Generation with Text-free Videos. |[arXiv](https://arxiv.org/abs/2312.15770) | | Video |
      • VideoPoet - shot video generation. |[arXiv](https://arxiv.org/abs/2312.14125) | | Video |
      • GenTron
      • W.A.L.T
      • VideoDrafter - Consistent Multi-Scene Video Generation with LLM. |[arXiv](https://arxiv.org/abs/2401.01256) | | Video |
      • VideoLCM
      • NeverEnds
      • Neural Frames
      • Lumiere - Time Diffusion Model for Video Generation. |[arXiv](https://arxiv.org/abs/2401.12945) | | Video |
      • InstructVideo
      • Phenaki
      • Boximator
      • Sora
      • Genie
      • AtomoVideo - to-Video Generation. |[arXiv](https://arxiv.org/abs/2403.01800) | | Video |
      • TwelveLabs
      • Anything in Any Scene
      • Assistive
      • CogVideo
      • Decohere
      • DomoAI
      • DynamiCrafter - domain Images with Video Diffusion Priors. |[arXiv](https://arxiv.org/abs/2310.12190) | | Video |
      • Etna
      • Fairy - Guided Video-to-Video Synthesis. | | | Video |
      • Generative Dynamics
      • Genmo
      • LTX Studio - driven filmmaking platform for creators, marketers, filmmakers and studios. | | | Video |
      • MagicVideo-V2 - Stage High-Aesthetic Video Generation. |[arXiv](https://arxiv.org/abs/2401.04468) | | Video |
      • Magic Hour
      • MAGVIT-v2
      • MAGVIT
      • Make Pixels Dance - Dynamic Video Generation. |[arXiv](https://arxiv.org/abs/2311.10982) | | Video |
      • Morph Studio - to-Video AI Magic, manifest your creativity through your prompt. | | | Video |
      • TATS - Agnostic VQGAN and Time-Sensitive Transformer. | | | Video |
      • Vispunk Motion
      • Zeroscope - to-Video. | | | Video |
      • Video-ChatGPT - ChatGPT is a video conversation model capable of generating meaningful conversation about videos. |[arXiv](https://arxiv.org/abs/2306.05424) | | Video |
      • Stable Video Diffusion - to-Video. | | | Video |
      • Video-LLaVA
      • Track-Anything - Anything is a flexible and interactive tool for video object tracking and segmentation, based on Segment Anything and XMem. |[arXiv](https://arxiv.org/abs/2304.11968) | | Video |
      • Open-Sora - Sora Plan. | | | Video |
      • MoneyPrinterTurbo
      • StreamingT2V
      • ShortGPT
      • BackgroundRemover
      • Tune-A-Video - Shot Tuning of Image Diffusion Models for Text-to-Video Generation. |[arXiv](https://arxiv.org/abs/2212.11565) | | Video |
      • Open-Sora
      • StoryDiffusion - Attention for Long-Range Image and Video Generation. |[arXiv](https://arxiv.org/abs/2405.01434) | | Video |
      • Text2Video-Zero - to-Image Diffusion Models are Zero-Shot Video Generators. |[arXiv](https://arxiv.org/abs/2303.13439) | | Video |
      • MOFA-Video - to-Video Diffusion Model. |[arXiv](https://arxiv.org/abs/2405.20222) | | Video |
      • LVDM - Fidelity Long Video Generation. |[arXiv](https://arxiv.org/abs/2211.13221) | | Video |
      • LaVie - Quality Video Generation with Cascaded Latent Diffusion Models. |[arXiv](https://arxiv.org/abs/2309.15103) | | Video |
      • Show-1 - to-Video Generation. |[arXiv](https://arxiv.org/abs/2309.15818) | | Video |
      • ART•V - Regressive Text-to-Video Generation with Diffusion Models. |[arXiv](https://arxiv.org/abs/2311.18834) | | Video |
      • StyleCrafter - to-Video Generation with Style Adapter. |[arXiv](https://arxiv.org/abs/2312.00330) | | Video |
      • VGen
      • VideoElevator - to-Image Diffusion Models. |[arXiv](https://arxiv.org/abs/2403.05438) | | Video |
      • Reuse and Diffuse - to-Video Generation. |[arXiv](https://arxiv.org/abs/2309.03549) | | Video |
      • MicroCinema - and-Conquer Approach for Text-to-Video Generation. |[arXiv](https://arxiv.org/abs/2311.18829) | | Video |
      • StableVideo - driven Consistency-aware Diffusion Video Editing. | | | Video |
      • MotionDirector - to-Video Diffusion Models. |[arXiv](https://arxiv.org/abs/2310.08465) | | Video |
      • CogVideoX - source version of the video generation model, which is homologous to 清影. | | | Video |
      • CogVLM - source visual language model (VLM). | | | Visual |
      • V-JEPA
      • dolphin
      • Hotshot-XL - XL is an AI text-to-GIF model trained to work alongside Stable Diffusion XL. | | | Video |
      • Mov2mov - diffusion-webui. | | | Video |
      • Diffutoon - Resolution Editable Toon Shading via Diffusion Models. |[arXiv](https://arxiv.org/abs/2401.16224) | | Video |
      • 360DVD - Degree Video Diffusion Model. |[arXiv](https://arxiv.org/abs/2401.06578) | | Video |
      • CoDeF
      • EDGE - plausible dances while remaining faithful to arbitrary input music. |[arXiv](https://arxiv.org/abs/2211.10658) | | Video |
      • EMO - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions. |[arXiv](https://arxiv.org/abs/2402.17485) | | Video |
      • Mora
      • Motionshop
      • Snap Video - to-Video Synthesis. |[arXiv](https://arxiv.org/abs/2402.14797) | | Video |
      • SoraWebui - source Sora web client, enabling users to easily create videos from text with OpenAI's Sora model. | | | Video |
      • VideoGen - Guided Latent Diffusion Approach for High Definition Text-to-Video Generation. |[arXiv](https://arxiv.org/abs/2309.00398) | | Video |
      • VideoMamba
      • Video-of-Thought - of-Thought: Step-by-Step Video Reasoning from Perception to Cognition. | | | Video |
      • VisualRWKV - enhanced version of the RWKV language model, enabling RWKV to handle various visual tasks. | | | Visual |
      • Tora - oriented Diffusion Transformer for Video Generation. |[arXiv](https://arxiv.org/abs/2407.21705) | | Video |
      • FullJourney
      • Follow-Your-Canvas - Your-Canvas: Higher-Resolution Video Outpainting with Extensive Content Generation. |[arXiv](https://arxiv.org/abs/2409.01055) | | Video |
      • DreamCinema
      • ViewCrafter - fidelity Novel View Synthesis. |[arXiv](https://arxiv.org/abs/2409.02048) | | Video |
      • Vchitect-2.0 - 2.0: Parallel Transformer for Scaling Up Video Diffusion Models. | | | Video |
      • MIMO
      • LTX-Video - Video is the first DiT-based video generation model that can generate high-quality videos in real-time. It can generate 24 FPS videos at 768x512 resolution, faster than it takes to watch them. | | | Video |
      • Ruyi - to-video model capable of generating cinematic-quality videos at a resolution of 768, with a frame rate of 24 frames per second, totaling 5 seconds and 120 frames. | | | Video |
      • SkyReels-A1 - A1: Expressive Portrait Animation in Video Diffusion Transformers. |[arXiv](https://arxiv.org/abs/2502.10841) | | Video |
      • SkyReels-V1 - Centric Video Foundation Model. | | | Video |
      • Step-Video-T2V - Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model. |[arXiv](https://arxiv.org/abs/2502.10248) | | Video |
      • Wan2.1 - Scale Video Generative Models. | | | Video |
      • Pika Labs - making experience with AI. | | | Video |
      • MoviiGen 1.1 - Quality Video Generative Models. MoviiGen 1.1 is a cutting-edge video generation model that excels in cinematic aesthetics and visual quality. This model is a fine-tuning model based on the Wan2.1. Based on comprehensive evaluations by 11 professional filmmakers and AIGC creators, including industry experts, across 60 aesthetic dimensions, MoviiGen 1.1 demonstrates superior performance in key cinematic aspects. | | | Video |
      • Wan2.2 - Scale Video Generative Models. |[arXiv](https://arxiv.org/abs/2503.20314) | | Video |
      • Waver - generation, universal foundation model family for unified image and video generation, built on rectified flow Transformers and engineered for industry-grade performance. |[arXiv](https://arxiv.org/abs/2508.15761) | | Video |
      • InfiniteTalk - driven Video Generation for Sparse-Frame Video Dubbing. |[arXiv](https://arxiv.org/abs/2508.14033) | | Video |
      • HuMo - Centric Video Generation via Collaborative Multi-Modal Conditioning. |[arXiv](https://arxiv.org/abs/2509.08519) | | Video |
      • LongLive - time Interactive Long Video Generation. |[arXiv](https://arxiv.org/abs/2509.22622) | | Video |
      • Lynx - Fidelity Personalized Video Generation. |[arXiv](https://arxiv.org/abs/2509.15496) | | Video |
      • Ovi - Modal Fusion for Audio-Video Generation. |[arXiv](https://arxiv.org/abs/2510.01284) | | Video |
      • CoNR - drawn anime character sheets(ACS). |[arXiv](https://arxiv.org/abs/2207.05378) | | Video |
      • HunyuanVideo
      • Mini-Gemini - modality Vision Language Models. | | | Vision |
      • Mochi 1 - of-the-art video generation model with high-fidelity motion and strong prompt adherence in preliminary evaluation. | | | Video |
      • MotionClone - Free Motion Cloning for Controllable Video Generation. |[arXiv](https://arxiv.org/abs/2406.05338) | | Video |
    • <span id="tool">Tool (AI LLM)</span>

      • Gen-2 - modal AI system that can generate novel videos with text, images, or video clips. | | | Video |
      • FullJourney
      • Moonvalley - to-video generative AI model. | | | Video |
      • Pixeling - realistic, and extremely controllable visual content including images, videos and 3D models. | | | Video |
      • PixVerse - taking videos with AI. | | | Video |
  • <span id="music">Music</span>

    • <span id="tool">LLM (LLM & Tool)</span>

    • <span id="tool">Tool (AI LLM)</span>

      • JEN-1 - Guided Universal Music Generation with Omnidirectional Diffusion Models. | | | Music |
  • Project List

    • <span id="tool">LLM (LLM & Tool)</span>

      • Mixtral 8x7B - of-Experts. |[arXiv](https://arxiv.org/abs/2401.04088) | | Tool |
      • HuggingChat
      • Pi
      • NovelAI
      • Dora
      • Grok-1 - of-Experts model, Grok-1. | | | Tool |
      • GPT-4o - 4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. | | | Tool |
      • CogVLM - source visual language foundation model. |[arXiv](https://arxiv.org/abs/2311.03079) | | Tool |
      • Moshi
      • Nemotron-4 - billion-parameter large multilingual language model trained on 8 trillion text tokens. |[arXiv](https://arxiv.org/abs/2402.16819) | | Tool |
      • ShareGPT4V - Modal Models with Better Captions. | | | Tool |
      • NExT-GPT - to-Any Multimodal Large Language Model. | | | Tool |
      • Stanford Alpaca - following LLaMA Model. | | | LLM |
      • GPT4All
      • ChatYuan
      • StableLM
      • Novel - style WYSIWYG editor with AI-powered autocompletions. | | | Writer |
      • MLC LLM
      • LLaSM
      • Text generation web UI - J, OPT, and GALACTICA. | | | Tool |
      • MiniGPT-4 - language Understanding with Advanced Large Language Models. |[arXiv](https://arxiv.org/abs/2304.10592) | | Tool |
      • MetaGPT - Agent Framework | | | Tool |
      • BabyAGI - powered task management system. | | | Tool |
      • AgentGPT
      • CoreNet
      • Open-Assistant - based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so. | | | Tool |
      • Panda - 7B, -13B, -33B, -65B for continuous pre-training in the Chinese field. | | | Tool |
      • ChatRWKV
      • Lit-LLaMA - Adapter fine-tuning, pre-training. | | | Tool |
      • Assistant CLI
      • SearchGPT
      • InternLM - sourced a 7 billion parameter base model, a chat model tailored for practical scenarios and the training system. |[arXiv](https://arxiv.org/abs/2403.17297) | | Tool |
      • gemma.cpp
      • llm.c
      • GPTScript
      • WebGPT
      • 👶🤖🖥️ BabyAGI UI
      • LaMini-LM - LM is a collection of small-sized, efficient language models distilled from ChatGPT and trained on a large-scale dataset of 2.58M instructions. | | | Tool |
      • Llama 3
      • OneLLM
      • Jan
      • Perplexica - powered search engine. | | | Tool |
      • LLocalSearch
      • RepoAgent - Source project driven by Large Language Models(LLMs) that aims to provide an intelligent way to document projects. |[arXiv](https://arxiv.org/abs/2402.16667) | | Tool |
      • Bisheng
      • GLM-4 - 4-9B is the open-source version of the latest generation of pre-trained models in the GLM-4 series launched by Zhipu AI. | | | Tool |
      • Lumina-T2X - T2X is a unified framework for Text to Any Modality Generation. |[arXiv](https://arxiv.org/abs/2405.05945) | | Tool |
      • Large World Model (LWM) - purpose large-context multimodal autoregressive model. |[arXiv](https://arxiv.org/abs/2402.08268) | | Tool |
      • MiniCPM-2B - side LLM outperforms Llama2-13B. | | | Tool |
      • Lepton AI
      • ImageBind
      • LLM Answer Engine - Inspired Answer Engine Using Next.js, Groq, Mixtral, Langchain, OpenAI, Brave & Serper. | | | Tool |
      • Flowise
      • WordGPT
      • Lamini - tuning on their own data. | | | Tool |
      • Chrome-GPT
      • 01 Project - source language model computer. | | | Tool |
      • OLMo
      • Llama 3.1
      • Devika
      • Gemma - of-the art open models built from research and technology used to create Google Gemini models. | | | Tool |
      • Yi
      • llama2-webui
      • Baichuan 2
      • Notebook.ai
      • LaVague
      • Devon - source pair programmer. | | | Tool |
      • ToolBench
      • baichuan-7B - scale 7B pretraining language model developed by Baichuan. | | | Tool |
      • DemoGPT - AI App Generator with the Power of Llama 2 | | | Tool |
      • Web3-GPT
      • AICommand
      • Index-1.9B
      • DBRX
      • MobiLlama
      • mPLUG-Owl🦉
      • Baichuan-13B
      • InteractML-Unity
      • AI-Writer - trained generative model. | | | Writer |
      • Hugging Face API Unity Integration - to-use integration for the Hugging Face Inference API, allowing developers to access and use Hugging Face AI models within their Unity projects. | | Unity | Tool |
      • AIOS
      • Character-LLM - Playing. |[arXiv](https://arxiv.org/abs/2310.10158) | | Tool |
      • ChatGPT-API-unity
      • ChatGPTForUnity
      • Chinese-LLaMA-Alpaca-3 - 3 LLMs) developed from Meta Llama 3. | | | Tool |
      • DCLM
      • Design2Code - End Engineering | | | Tool |
      • Lemur
      • LLMUnity
      • LogicGamesSolver
      • MiniGPT-5 - and-Language Generation via Generative Vokens. |[arXiv](https://arxiv.org/abs/2310.02239) | | Tool |
      • Orion-14B - 14B is a family of models includes a 14B foundation LLM, and a series of models. |[arXiv](https://arxiv.org/abs/2401.12246) | | Tool |
      • Sanity AI Engine
      • Skywork - trained on 3.2TB of high-quality multilingual (mainly Chinese and English) and code data. | | | Tool |
      • TinyChatEngine - Device LLM Inference Library. | | | Tool |
      • Unity ChatGPT
      • Unreal Engine 5 Llama LoRA - of-concept project that showcases the potential for using small, locally trainable LLMs to create next-generation documentation tools. | | Unreal Engine | Tool |
      • UnrealGPT
      • AI Scientist - Ended Scientific Discovery. |[arXiv](https://arxiv.org/abs/2408.06292) | | Tool |
      • LongWriter
      • Moshi - text foundation model for real time dialogue. | | | Tool |
      • Janus
      • DeepSeek-V3 - V3 is a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. |[arXiv](https://arxiv.org/abs/2412.19437) | | LLM |
      • Cosmos
      • MiniMax-01 - 01: Scaling Foundation Models with Lightning Attention. |[arXiv](https://arxiv.org/abs/2501.08313) | | LLM |
      • SkyThought - T1: Train your own O1 preview model within $450. | | | LLM |
      • DeepSeek-R1 - R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning. | | | LLM |
      • s1 - time scaling. |[arXiv](https://arxiv.org/abs/2501.19393) | | LLM |
      • Open Deep Research - powered research assistant that performs iterative, deep research on any topic by combining search engines, web scraping, and large language models. | | | LLM |
      • InteractML-Unreal Engine
      • LangChain
      • OpenDevin
      • Qwen3
      • Gemini
      • SimpleOllamaUnity
      • GLM-4.5 - 4.5: An open-source large language model designed for intelligent agents by Z.ai. | | | LLM |
      • gpt-oss - oss-120b and gpt-oss-20b are two open-weight language models by OpenAI. | | | LLM |
      • Kimi K2 - of-the-art mixture-of-experts (MoE) language model with 32 billion activated parameters and 1 trillion total parameters. | | | LLM |
      • Seed-OSS - OSS is a series of open-source large language models developed by ByteDance's Seed Team, designed for powerful long-context, reasoning, agent and general capabilities, and versatile developer-friendly features. | | | LLM |
      • LongCat-Flash - Flash is a powerful and efficient language model with 560 billion total parameters, featuring an innovative Mixture-of-Experts (MoE) architecture. The model incorporates a dynamic computation mechanism that activates 18.6B∼31.3B parameters (averaging∼27B) based on contextual demands, optimizing both computational efficiency and performance. | | | LLM |
      • Hunyuan-MT - MT comprises a translation model, Hunyuan-MT-7B, and an ensemble model, Hunyuan-MT-Chimera. The translation model is used to translate source text into the target language, while the ensemble model integrates multiple translation outputs to produce a higher-quality result. | | | LLM |
      • MOSS - source tool-augmented conversational language model from Fudan University. | | | Tool |
      • Auto-GPT - source attempt to make GPT-4 fully autonomous. | | | Tool |
      • Qwen1.5
    • <span id="tool">Tool (AI LLM)</span>

  • <span id="speech">Speech</span>

    • <span id="tool">LLM (LLM & Tool)</span>

      • Fliki
      • VALL-E X - Lingual Neural Codec Language Modeling | [arXiv](https://arxiv.org/abs/2303.03926) | | Speech |
      • LOVO - to AI Voice Generator & Text to Speech platform for thousands of creators. | | | Speech |
      • VALL-E - Shot Text to Speech Synthesizers. | [arXiv](https://arxiv.org/abs/2301.02111) | | Speech |
      • Audyo
      • CLAPSpeech - Audio Pre-Training. | [arXiv](https://arxiv.org/abs/2305.10763) | | Speech |
      • Narakeet
      • TorToiSe-TTS - voice TTS system trained with an emphasis on quality. | | | Speech |
      • Bark - Prompted Generative Audio Model. | | | Speech |
      • Whisper - purpose speech recognition model. | | | Speech |
      • XTTS - to-Speech generation. | | | Speech |
      • Voicebox - Guided Multilingual Universal Speech Generation at Scale. | [arXiv](https://arxiv.org/abs/2306.15687) | | Speech |
      • OpenVoice
      • CosyVoice - lingual large voice generation model, providing inference, training and deployment full-stack ability. | | | Speech |
      • ChatTTS
      • MeloTTS - quality multi-lingual text-to-speech library by MyShell.ai. Support English, Spanish, French, Chinese, Japanese and Korean. | | | Speech |
      • GPT-SoVITS - shot Voice Conversion and Text-to-Speech WebUI. | | | Speech |
      • EmotiVoice - Voice and Prompt-Controlled TTS Engine. | | | Speech |
      • Bert-VITS2
      • VoiceCraft - Shot Speech Editing and Text-to-Speech in the Wild. | | | Speech |
      • TTS Generation WebUI
      • YourTTS - Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone. | [arXiv](https://arxiv.org/abs/2112.02418) | | Speech |
      • MetaVoice-1B - level speech intelligence. | | | Speech |
      • StyleTTS 2 - Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models. | [arXiv](https://arxiv.org/abs/2306.07691) | | Speech |
      • SpeechGPT - Modal Conversational Abilities. | [arXiv](https://arxiv.org/abs/2305.11000) | | Speech |
      • One-Shot-Voice-Cloning - TTS. | | | Speech |
      • Applio - friendly experience. | | | Speech |
      • DEX-TTS - based EXpressive Text-to-Speech with Style Modeling on Time Variability. | [arXiv](https://arxiv.org/abs/2406.19135) | | Speech |
      • Glow-TTS - to-Speech via Monotonic Alignment Search. | [arXiv](https://arxiv.org/abs/2005.11129) | | Speech |
      • MahaTTS - Source Large Speech Generation Model. | | | Speech |
      • Matcha-TTS
      • OverFlow
      • RealtimeTTS - of-the-art text-to-speech (TTS) library designed for real-time applications. | | | Speech |
      • SenseVoice
      • speech-to-text-gpt3-unity
      • StableTTS - generation TTS model using flow-matching and DiT, inspired by Stable Diffusion 3. | | | Speech |
      • X-E-Speech - Autoregressive Cross-lingual Emotional Text-to-Speech and Voice Conversion. | | | Speech |
      • ZMM-TTS - shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations. | [arXiv](https://arxiv.org/abs/2312.14398) | | Speech |
      • tortoise.cpp - tts. | | | Speech |
      • Mini-Omni - Omni: Language Models Can Hear, Talk While Thinking in Streaming. Mini-Omni is an open-source multimodel large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities. | [arXiv](https://arxiv.org/abs/2408.16725) | | Speech |
      • GLM-4-Voice - 4-Voice is an end-to-end voice model launched by Zhipu AI. GLM-4-Voice can directly understand and generate Chinese and English speech, engage in real-time voice conversations, and change attributes such as emotion, intonation, speech rate, and dialect based on user instructions. | | | Speech |
      • Step-Audio - Audio: Unified Understanding and Generation in Intelligent Speech Interaction. | [arXiv](https://arxiv.org/abs/2502.11946) | | Speech |
      • Chatterbox - grade open-source TTS model. | | | Speech |
      • IndexTTS2 - Controlled Auto-Regressive Zero-Shot Text-to-Speech. | [arXiv](https://arxiv.org/abs/2506.21619) | | Speech |
      • UnityNeuroSpeech
      • Higgs Audio
      • Kitten TTS - source realistic text-to-speech model with just 15 million parameters, designed for lightweight deployment and high-quality voice synthesis. | | | Speech |
      • VibeVoice - form, multi-speaker conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking. | | | Speech |
      • Step-Audio 2 - Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. | [arXiv](https://arxiv.org/abs/2507.16632) | | Speech |
      • UniAudio 2.0 - task Audio Foundation Model with Reasoning-Augmented Audio Tokenization. | | | Speech |
      • FireRedTTS-2 - 2: Towards Long Conversational Speech Generation for Podcast and Chatbot. | [arXiv](https://arxiv.org/abs/2509.02020) | | Speech |
      • VoxCPM - Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning. | | | Speech |
      • Liquid Audio - Speech-to-Speech audio models by Liquid AI. | | | Speech |
      • Stable Speech - to-Speech model. | | | Speech |
      • WhisperSpeech - to-speech system built by inverting Whisper. | | | Speech |
    • <span id="tool">Tool (AI LLM)</span>

      • Vocode - source library for building voice-based LLM applications. | | | Speech |
  • <span id="animation">Animation</span>

    • <span id="tool">LLM (LLM & Tool)</span>

      • Stable Animation - to-animation tool for developers. | | | Animation |
      • Wonder Studio - action scene. | | | Animation |
      • FreeInit
      • PIA - and-Play Modules in Text-to-Image Models. |[arXiv](https://arxiv.org/abs/2312.13964) | | Animation |
      • ID-Animator - Shot Identity-Preserving Human Video Generation. |[arXiv](https://arxiv.org/abs/2404.15275) | | Animation |
      • Deforum
      • NUWA-XL
      • NUWA-Infinity - Infinity is a multimodal generative model that is designed to generate high-quality images and videos from given text, image or video input. | | | Animation |
      • Animate Anyone - to-Video Synthesis for Character Animation. |[arXiv](https://arxiv.org/abs/2311.17117) | | Animation |
      • MagicAnimate
      • DreaMoving
      • AnimateLCM
      • FaceFusion
      • Wav2Lip - syncing Videos In The Wild. |[arXiv](https://arxiv.org/abs/2008.10010) | | Animation |
      • GeneFace - Fidelity Audio-Driven 3D Talking Face Synthesis. |[arXiv](https://arxiv.org/abs/2301.13430) | | Animation |
      • SadTalker-Video-Lip-Sync
      • AnimateAnything - Grained Open Domain Image Animation with Motion Guidance. |[arXiv](https://arxiv.org/abs/2311.12886) | | Animation |
      • AnimationGPT
      • DrawingSpinUp
      • Animate-X - X: Universal Character Image Animation with Enhanced Motion Representation. |[arXiv](https://arxiv.org/abs/2410.10306) | | Animation |
      • ToonCrafter
      • Omni Animation
      • AnimateZero - Shot Image Animators. |[arXiv](https://arxiv.org/abs/2312.03793) | | Animation |
      • Index-AniSora - AniSora is the most powerful open-source animated video generation model. It enables one-click creation of video shots across diverse anime styles including series episodes, Chinese original animations, manga adaptations, VTuber content, anime PVs, mad-style parodies(鬼畜动画), and more! |[arXiv](https://arxiv.org/abs/2412.10255) | | Animation |
      • ToonComposer - Keyframing. |[arXiv](https://arxiv.org/abs/2508.10881) | | Animation |
      • AnimateDiff - to-Image Diffusion Models without Specific Tuning. |[arXiv](https://arxiv.org/abs/2307.04725) | | Animation |
      • SadTalker - Driven Single Image Talking Face Animation. |[arXiv](https://arxiv.org/abs/2211.12194) | | Animation |
      • TaleCrafter
    • <span id="tool">Tool (AI LLM)</span>

  • <span id="audio">Audio</span>

    • <span id="tool">LLM (LLM & Tool)</span>

      • Stable Audio - Conditioned Latent Audio Diffusion. | | | Audio |
      • AudioLDM - to-Audio Generation with Latent Diffusion Models. |[arXiv](https://arxiv.org/abs/2301.12503) | | Audio |
      • OptimizerAI
      • MAGNeT - Autoregressive Transformer. | | | Audio |
      • Audiobox
      • Make-An-Audio - To-Audio Generation with Prompt-Enhanced Diffusion Models. |[arXiv](https://arxiv.org/abs/2301.12661) | | Audio |
      • SoundStorm
      • Stable Audio Open - length (up to 47s) stereo audio at 44.1kHz from text prompts. | | | Audio |
      • FoleyCrafter
      • SyncFusion - synchronized Video-to-Audio Foley Synthesis. |[arXiv](https://arxiv.org/abs/2310.15247) | | Audio |
      • AudioGPT
      • Amphion - Source Audio, Music, and Speech Generation Toolkit. |[arXiv](https://arxiv.org/abs/2312.09911) | | Audio |
      • TANGO - to-Audio Generation using Instruction Tuned LLM and Latent Diffusion Model. | | | Audio |
      • ArchiSound
      • AcademiCodec
      • AudioEditing - Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion. |[arXiv](https://arxiv.org/abs/2402.10009) | | Audio |
      • Audiogen Codec
      • AudioLCM - to-Audio Generation with Latent Consistency Models. |[arXiv](https://arxiv.org/abs/2406.00356v1) | | Audio |
      • AudioLDM 2 - supervised Pretraining. |[arXiv](https://arxiv.org/abs/2308.05734) | | Audio |
      • Auffusion - to-Audio Generation. |[arXiv](https://arxiv.org/abs/2401.01044) | | Audio |
      • CTAG - to-Audio Generation via Synthesizer Programming. | | | Audio |
      • Make-An-Audio 3 - based Large Diffusion Transformers. |[arXiv](https://arxiv.org/abs/2305.18474) | | Audio |
      • NeuralSound - based Modal Sound Synthesis with Acoustic Transfer. |[arXiv](https://arxiv.org/abs/2108.07425) | | Audio |
      • Qwen2-Audio - Audio chat & pretrained large audio language model proposed by Alibaba Cloud. |[arXiv](https://arxiv.org/abs/2407.10759) | | Audio |
      • SEE-2-SOUND - Shot Spatial Environment-to-Spatial Sound. |[arXiv](https://arxiv.org/abs/2406.06612) | | Audio |
      • VTA-LDM - to-Audio Generation with Hidden Alignment. |[arXiv](https://arxiv.org/abs/2407.07464) | | Audio |
      • WavJourney
      • MMAudio - Quality Video-to-Audio Synthesis. |[arXiv](https://arxiv.org/abs/2412.15322) | | Audio |
      • AudioX - to-Audio Generation. |[arXiv](https://arxiv.org/abs/2503.10522) | | Audio |
      • ThinkSound - of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing. |[arXiv](https://arxiv.org/abs/2506.21448) | | Audio |
      • MiDashengLM
      • MeanAudio - to-Audio Generation with Mean Flows. | | | Audio |
      • HunyuanVideo-Foley - Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation. |[arXiv](https://www.arxiv.org/abs/2508.16930) | | Audio |
  • <span id="code">Code</span>

  • <span id="game">Game (World Model & Agent)</span>

    • <span id="tool">LLM (LLM & Tool)</span>

      • Opus
      • OmAgent
      • Astrocade
      • CogAgent - source visual language model improved based on CogVLM. |[arXiv](https://arxiv.org/abs/2312.08914) | | Agent |
      • Digital Life Project
      • Moonlander.ai
      • SIMA
      • StoryGames.ai
      • V-IRL
      • LangChain
      • LlamaIndex
      • Dify - source LLM app building platform. | | | Agent |
      • FastGPT - based platform built on the LLM. | | | Agent |
      • AutoGen - Gen Large Language Model Applications. |[arXiv](https://arxiv.org/abs/2308.08155) | | Agent |
      • Translation Agent
      • XAgent
      • everything-ai - powered and local chatbot assistant🤖. | | | Agent |
      • fabric - source framework for augmenting humans using AI. | | | Agent |
      • fastRAG
      • Generative Agents
      • ChatDev
      • OpenAgents
      • AI Town
      • Ragas
      • Qwen-Agent - Agent is a framework for developing LLM applications based on the instruction following, tool usage, planning, and memory capabilities of Qwen. | | | Agent |
      • Pipecat
      • AutoAgents
      • AgentScope - empowered multi-agent applications in an easier way. |[arXiv](https://arxiv.org/abs/2402.14034) | | Agent |
      • KwaiAgents - seeking agent system with Large Language Models (LLMs). |[arXiv](https://arxiv.org/abs/2312.04889) | | Agent |
      • Genesis
      • AgentBench
      • behaviac
      • MindSearch - based Multi-agent Framework of Web Search Engine (like Perplexity.ai Pro and SearchGPT). | | | Agent |
      • Mixture of Agents (MoA) - of-Agents Enhances Large Language Model Capabilities. |[arXiv](https://arxiv.org/abs/2406.04692) | | Agent |
      • Agent Group Chat
      • anime.gf
      • Biomes
      • Buffer of Thoughts - Augmented Reasoning with Large Language Models. |[arXiv](https://arxiv.org/abs/2406.04271) | | Agent |
      • Byzer-Agent
      • Cat Town - powered simulation with cats. | | | Agent |
      • CharacterGLM
      • Cradle
      • GameAISDK - based game AI automation framework. | | | Framework |
      • gigax - powered NPCs. | | | Game |
      • HippoRAG - Term Memory for Large Language Models. |[arXiv](https://arxiv.org/abs/2405.14831) | | Agent |
      • Interactive LLM Powered NPCs - source project that completely transforms your interaction with non-player characters (NPCs) in any game! | | | Game |
      • IoA - source framework for collaborative AI agents, enabling diverse, distributed agents to team up and tackle complex tasks through internet-like connectivity. | | | Agent |
      • LangGraph Studio
      • LARP - Agent Role Play for open-world games. |[arXiv](https://arxiv.org/abs/2312.17653) | | Agent |
      • MuG Diffusion
      • Video2Game - time, Interactive, Realistic and Browser-Compatible Environment from a Single Video. |[arXiv](https://arxiv.org/abs/2404.09833) | | Game |
      • WebDesignAgent
      • MMRole - Playing Agents. |[arXiv](https://arxiv.org/abs/2408.04203v1) | | Agent |
      • TaskGen - based agentic framework building on StrictJSON outputs by LLM agents. | | | Agent |
      • Twitter
      • Agent K - evolving and modular. | | | Agent |
      • RPBench-Auto - playing. | | | Game |
      • GameNGen - Time Game Engines. |[arXiv](https://arxiv.org/abs/2408.14837) | | Game |
      • GameGen-O - O: Open-world Video Game Generation. | | | Game |
      • TEN Agent - time multimodal agent integrated with the OpenAI Realtime API, RTC, and features weather checks, web search, vision, and RAG capabilities. | | | Agent |
      • Unbounded
      • Oasis
      • Agent Laboratory
      • SWE-agent
      • AWorld - Improvement. | | | Agent |
      • Jaaz - The world's first open-source multimodal creative assistant. AI design agent, local alternative for Lovart. Canva + Cursor. AI agent with ability to design, edit and generate images, posters, storyboards, etc. | | | Agent |
      • Matrix-Game 2.0 - Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model. | | | Game |
      • NVIDIA NeMo Agent Toolkit
      • Genie 3
      • HunyuanWorld 1.0
      • ComoRAG - Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning. |[arXiv](https://arxiv.org/abs/2508.10419) | | Agent |
      • Datarus Jupyter Agent - step reasoning system that executes complex analytical workflows with step-by-step reasoning, automatic error recovery, and comprehensive result synthesis. | | | Agent |
      • Hunyuan-GameCraft - GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition. |[arXiv](https://arxiv.org/abs/2506.17201) | | Game |
      • HunyuanWorld-Voyager - Voyager is a novel video diffusion framework that generates world-consistent 3D point-cloud sequences from a single image with user-defined camera path. Voyager can generate 3D-consistent scene videos for world exploration following custom camera trajectories. | | | Game |
      • Langflow - flow to provide an effortless way to experiment and prototype flows. | | | Agent |
      • GenAgent - Case Studies on ComfyUI. |[arXiv](https://arxiv.org/abs/2409.01392) | | Agent |
  • <span id="texture">Texture</span>

    • <span id="tool">LLM (LLM & Tool)</span>

      • TexFusion - Guided Image Diffusion Models. |[arXiv](https://arxiv.org/abs/2310.13772) | | Texture |
      • With Poly
      • Neuralangelo - Fidelity Neural Surface Reconstruction. |[arXiv](https://arxiv.org/abs/2306.03092) | | Texture |
      • Dream Textures - in to Blender. Create textures, concept art, background assets, and more with a simple text prompt. | | Blender | Texture |
      • Text2Tex - driven texture Synthesis via Diffusion Models. |[arXiv](https://arxiv.org/abs/2303.11396) | | Texture |
      • CRM
      • DreamMat - quality PBR Material Generation with Geometry- and Light-aware Diffusion Models. |[arXiv](https://arxiv.org/abs/2405.17176) | | Texture |
      • DreamSpace - Driven Panoramic Texture Propagation. | | | Texture |
      • InstructHumans
      • InteX - to-Texture Synthesis via Unified Depth-aware Inpainting. |[arXiv](https://arxiv.org/abs/2403.11878) | | Texture |
      • MaterialSeg3D
      • MeshAnything
      • X-Mesh - Mesh: Towards Fast and Accurate Text-driven 3D Stylization via Dynamic Textual Guidance. |[arXiv](https://arxiv.org/abs/2303.15764) | | Texture |
      • LLaMA-Mesh - Mesh: Unifying 3D Mesh Generation with Language Models. |[arXiv](https://arxiv.org/abs/2411.09595) | | Mesh |
      • Paint-it - to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering. | | | Texture |
    • <span id="tool">Tool (AI LLM)</span>

      • Polycam
      • Paint-it - to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering. | | | Texture |
  • <span id="speech">Analytics</span>

    • <span id="tool">LLM (LLM & Tool)</span>

  • <span id="visual">VLM (Visual)</span>

    • <span id="tool">LLM (LLM & Tool)</span>

      • LGVI - Driven Video Inpainting via Multimodal Large Language Models. | | | Visual |
      • CoTracker
      • FaceHi
      • MaskViT - Training for Video Prediction. |[arXiv](https://arxiv.org/abs/2206.11894) | | Visual |
      • Qwen-VL - Language Model for Understanding, Localization, Text Reading, and Beyond. |[arXiv](https://arxiv.org/abs/2308.12966) | | Visual |
      • MoE-LLaVA - Language Models. |[arXiv](https://arxiv.org/abs/2401.15947) | | Visual |
      • LLaVA++ - 3 and Phi-3. | | | Visual |
      • PLLaVA - free LLaVA Extension from Images to Videos for Video Dense Captioning. |[arXiv](https://arxiv.org/abs/2404.16994) | | Visual |
      • Cambrian-1 - 1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. |[arXiv](https://arxiv.org/abs/2406.16860) | | Multimodal LLMs |
      • CogVLM2 - level open-source multi-modal model based on Llama3-8B. | | | Visual |
      • EVF-SAM - SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model. |[arXiv](https://arxiv.org/abs/2406.20076) | | Visual |
      • Kangaroo - Language Model Supporting Long-context Video Input. | | | Visual |
      • LongVA
      • MotionLLM
      • ShareGPT4V - modal Models with Better Captions. |[arXiv](https://arxiv.org/abs/2311.12793) | | Visual |
      • SOLO - Language Modeling. |[arXiv](https://arxiv.org/abs/2407.06438) | | Visual |
      • Video-CCAM - CCAM: Advancing Video-Language Understanding with Causal Cross-Attention Masks. | | | Visual |
      • VideoLLaMA 2 - Temporal Modeling and Audio Understanding in Video-LLMs. |[arXiv](https://arxiv.org/abs/2406.07476) | | Visual |
      • Vitron - level Vision LLM for Understanding, Generating, Segmenting, Editing. | | | Visual |
      • VILA - training for Visual Language Models. |[arXiv](https://arxiv.org/abs/2312.07533) | | Visual |
      • LLaVA-OneVision - OneVision: Easy Visual Task Transfer. |[arXiv](https://arxiv.org/abs/2408.03326) | | Visual |
      • Sapiens
      • VideoLLaMA 3
      • Video-MME - Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis. |[arXiv](https://arxiv.org/abs/2405.21075) | | Visual |
      • MiniCPM-Llama3-V 2.5 - 4V Level MLLM on Your Phone. | | | Visual |
      • dots.vlm1 - language model in the dots model family. Built upon a 1.2 billion-parameter vision encoder and the DeepSeek V3 large language model (LLM), dots.vlm1 demonstrates strong multimodal understanding and reasoning capabilities. | | | VLM |
      • GLM-V - 4.1V-Thinking and GLM-4.5V: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning. |[arXiv](https://arxiv.org/abs/2507.01006) | | VLM |
      • VideoAgent - augmented Multimodal Agent for Video Understanding. |[arXiv](https://arxiv.org/abs/2403.11481) | | Agent |
      • Kwai Keye-VL - VL is a cutting-edge multimodal large language model meticulously crafted by the Kwai Keye Team at Kuaishou. |[arXiv](https://arxiv.org/abs/2509.01563) | | VLM |
      • Lumina-DiMOO - DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding. | | | VLM |
      • POINTS-Reader - Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion. |[arXiv](https://arxiv.org/abs/2509.01215) | | Visual |
  • <span id="voice">Singing Voice</span>

  • <span id="shader">Shader</span>

    • <span id="tool">LLM (LLM & Tool)</span>

      • AI Shader - powered shader generator for Unity. | | Unity | Shader |
  • <span id="game">Game (Agent)</span>

    • <span id="tool">Tool (AI LLM)</span>

      • AgentSims - Source Sandbox for Large Language Model Evaluation. | | | Agent |
  • <span id="visual">Visual</span>

    • <span id="tool">Tool (AI LLM)</span>

      • Video-MME - Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis. |[arXiv](https://arxiv.org/abs/2405.21075) | | Visual |