Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Awesome-LLM-3D
Awesome-LLM-3D: a curated list of Multi-modal Large Language Model in 3D world Resources
https://github.com/ActiveVisionLab/Awesome-LLM-3D
Last synced: 5 days ago
JSON representation
-
3D Understanding via LLM
- LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
- Situational Awareness Matters in 3D Vision Language Reasoning
- SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors
- ShapeLLM: Universal 3D Object Understanding for Embodied Interaction
- LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding - llm) |
- 3DAxiesPrompts: Unleashing the 3D Spatial Task Capabilities of GPT-4V
- Chat-3D v2: Bridging 3D Scene and Large Language Models with Object Identifiers - 3D/Chat-3D-v2) |
- GPT4Point: A Unified Framework for Point-Language Understanding and Generation
- Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding
- JM3D & JM3D-LLM: Elevating 3D Representation with Joint Multi-modal Cues - neko/jm3d) |
- Zero-Shot 3D Shape Correspondence - |
- LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent - group/chat-with-nerf) |
- Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following - Bind_Point-LLM) |
- PointLLM: Empowering Large Language Models to UnderstandPoint Clouds
- Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes - 3D/Chat-3D)|
- 3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment
- 3D-LLM: Injecting the 3D World into Large Language Models - Foundation-Model/3D-LLM) |
- ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding - Tang-3D/ViewRefer3D) |
- Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding - SPARK/llm_scene_understanding) |
- Uni3D: Exploring Unified 3D Representation at Scale
- Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes - 3D/Chat-3D)|
- 3D-LLM: Injecting the 3D World into Large Language Models - Foundation-Model/3D-LLM) |
- 3DAxiesPrompts: Unleashing the 3D Spatial Task Capabilities of GPT-4V
- Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers - Scene) |
- GPT4Point: A Unified Framework for Point-Language Understanding and Generation
- Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding
- LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding - llm) |
- LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
- JM3D & JM3D-LLM: Elevating 3D Representation with Joint Multi-modal Cues - neko/jm3d) |
- LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent - group/chat-with-nerf) |
- Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following - Bind_Point-LLM) |
- PointLLM: Empowering Large Language Models to UnderstandPoint Clouds
- ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding - Tang-3D/ViewRefer3D) |
- Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding - SPARK/llm_scene_understanding) |
- More Text, Less Point: Towards 3D Data-Efficient Point-Language Understanding
- MiniGPT-3D: Efficiently Aligning 3D Point Clouds with Large Language Models using 2D Priors
- LLaNA: Large Language and NeRF Assistant
- LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness - 3D/) |
-
3D Understanding via other Foundation Models
- LERF: Language Embedded Radiance Fields
- Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast Contrastive Fusion - Lift) |
- CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP
- PLA: Language-Driven Open-Vocabulary 3D Scene Understanding - Lab/PLA) |
- UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding
- CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition - goes-3D) |
- From Language to 3D Worlds: Adapting Language Model for Point Cloud Perception - |
- OpenNerf: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views
- CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP - |
- VL-Fields: Towards Language-Grounded Neural Implicit Spatial Representations - fields/) |
- CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes - VQA) |
- Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes - |
- Weakly Supervised 3D Open-vocabulary Segmentation - Liu/3D-OVS) |
- RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding
- OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation
- Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
- N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields - |
- Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding - Tang-3D/Any2Point) |
- ConceptFusion: Open-set Multimodal 3D Mapping - fusion.github.io/) |
- SAI3D: Segment Any Instance in 3D Scenes - yin.github.io/SAI3D) |
- Open3DIS: Open-vocabulary 3D Instance Segmentation with 2D Mask Guidance
- OVIR-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D Data - 3D/) |
- OpenMask3D: Open-Vocabulary 3D Instance Segmentation
- Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation - AICV/OpenFusion) |
- CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection
- OpenScene: 3D Scene Understanding with Open Vocabularies
- Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models - abstraction.cs.columbia.edu/) |
- Language-Grounded Indoor 3D Semantic Segmentation in the Wild
- OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation
- Weakly Supervised 3D Open-vocabulary Segmentation - Liu/3D-OVS) |
- VL-Fields: Towards Language-Grounded Neural Implicit Spatial Representations - fields/) |
- ConceptFusion: Open-set Multimodal 3D Mapping - fusion.github.io/) |
- CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP
- CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection
- SAI3D: Segment Any Instance in 3D Scenes - yin.github.io/SAI3D) |
- Open3DIS: Open-vocabulary 3D Instance Segmentation with 2D Mask Guidance
- OVIR-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D Data - 3D/) |
- Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation - AICV/OpenFusion) |
- Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast Contrastive Fusion - Lift) |
- CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes - VQA) |
- PLA: Language-Driven Open-Vocabulary 3D Scene Understanding - Lab/PLA) |
- OpenScene: 3D Scene Understanding with Open Vocabularies
- RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding
- CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition - goes-3D) |
- Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models - abstraction.cs.columbia.edu/) |
- Language-Grounded Indoor 3D Semantic Segmentation in the Wild
- N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields - |
- Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes - |
- Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models
-
3D Reasoning
-
3D Generation
- 3D-GPT: Procedural 3D Modeling with Large Language Models
- MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers - gpt/) |
- ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model
- DreamLLM: Synergistic Multimodal Comprehension and Creation
- LLMR: Real-time Prompting of Interactive Worlds using Large Language Models
- DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance
- ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model
- DreamLLM: Synergistic Multimodal Comprehension and Creation
- DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance
- DreamLLM: Synergistic Multimodal Comprehension and Creation
- LLMR: Real-time Prompting of Interactive Worlds using Large Language Models - |
- ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model
- MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers - gpt/) |
- 3D-GPT: Procedural 3D Modeling with Large Language Models
- DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance
-
3D Embodied Agent
- RT-1: Robotics Transformer for Real-World Control at Scale - transformer1.github.io/) |
- RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control - transformer2.github.io/) |
- SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning
- Unified Human-Scene Interaction via Prompted Chain-of-Contacts
- LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models - NLP-Group/LLM-Planner/) |
- See and Think: Embodied Agent in Virtual Environment
- On Bringing Robots Home - e) |
- VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
- RT-1: Robotics Transformer for Real-World Control at Scale - transformer1.github.io/) |
- RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control - transformer2.github.io/) |
- SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning
- Unified Human-Scene Interaction via Prompted Chain-of-Contacts
- LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models - NLP-Group/LLM-Planner/) |
- See and Think: Embodied Agent in Virtual Environment
- Diffusion-based Generation, Optimization, and Planning in 3D Scenes - Diffuser) |
- On Bringing Robots Home - e) |
- VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
- RT-1: Robotics Transformer for Real-World Control at Scale - transformer1.github.io/) |
- SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning
- Unified Human-Scene Interaction via Prompted Chain-of-Contacts
- LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models - NLP-Group/LLM-Planner/) |
- See and Think: Embodied Agent in Virtual Environment
- On Bringing Robots Home - e) |
- VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
- SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities - vlm.github.io/) |
- Unified Human-Scene Interaction via Prompted Chain-of-Contacts
- SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning
- LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models - NLP-Group/LLM-Planner/) |
- An Embodied Generalist Agent in 3D World - generalist/embodied-generalist) |
- RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control - transformer2.github.io/) |
- Open-vocabulary Queryable Scene Representations for Real World Planning - saycan.github.io/) |
- CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory - fields) |
- An Embodied Generalist Agent in 3D World - generalist/embodied-generalist) |
-
3D Benchmarks
- ScanQA: 3D Question Answering for Spatial Scene Understanding - DBI/ScanQA) |
- Scan2Cap: Context-aware Dense Captioning in RGB-D Scans
- SQA3D: Situated Question Answering in 3D Scenes
- Evaluating VLMs for Score-Based, Multi-Probe Annotation of 3D Objects
- M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts
- Scan2Cap: Context-aware Dense Captioning in RGB-D Scans
- Evaluating VLMs for Score-Based, Multi-Probe Annotation of 3D Objects
- M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts
- ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language
- M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts
- Evaluating VLMs for Score-Based, Multi-Probe Annotation of 3D Objects
- ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language
- 3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination - grand.github.io) |
- SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding - verse/sceneverse) |
- EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI
- ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes
- ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language
- SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding - verse/sceneverse) |
- EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI
- Scan2Cap: Context-aware Dense Captioning in RGB-D Scans
- Multi-modal Situated Reasoning in 3D Scenes
- 3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination - grand.github.io) |
- SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models
- Looking at words and points with attention: a benchmark for text-to-shape coherence
-
Acknowledgement
-
🔥 News
- 2023-12-16
- 2024-01-06 - A order for better following the latest advances.
- 2024-05-16
-
Star History
- ![Star History Chart - history.com/#ActiveVisionLab/Awesome-LLM-3D&Date)