https://github.com/ActiveVisionLab/Awesome-LLM-3D

Awesome-LLM-3D: a curated list of Multi-modal Large Language Model in 3D world Resources
https://github.com/ActiveVisionLab/Awesome-LLM-3D
List: Awesome-LLM-3D
Last synced: 4 months ago
JSON representation
Awesome-LLM-3D: a curated list of Multi-modal Large Language Model in 3D world Resources
Host: GitHub
URL: https://github.com/ActiveVisionLab/Awesome-LLM-3D
Owner: ActiveVisionLab
License: mit
Created: 2023-12-15T06:02:44.000Z (over 1 year ago)
Default Branch: avl-branch
Last Pushed: 2024-05-18T14:05:47.000Z (about 1 year ago)
Last Synced: 2024-05-20T00:02:40.310Z (about 1 year ago)
Homepage:
Size: 13.6 MB
Stars: 670
Watchers: 27
Forks: 46
Open Issues: 3
Metadata Files:
- Readme: README.md
- Contributing: contributing.md
- License: LICENSE
Awesome Lists containing this project

awesome-llm - 3D世界中的LLM研究资源 - 研究3D世界中多模态大语言模型的论文和资源合集。 (其他相关论文)
awesome-llm - 3D世界中的LLM研究资源 - 研究3D世界中多模态大语言模型的论文和资源合集。 (其他相关论文)
awesome-LLMs-finetuning - Awesome-LLM-3D
Awesome-LLM - Awesome-LLM-3D - A curated list of Multi-modal Large Language Model in 3D world, including 3D understanding, reasoning, generation, and embodied agents. (Other Papers)
awesome-awesome-artificial-intelligence - Awesome LLM 3D - modal large language models in 3D world resources. | ![GitHub stars](https://img.shields.io/github/stars/ActiveVisionLab/Awesome-LLM-3D?style=social) | (Multi-modal learning)
awesome-awesome-artificial-intelligence - Awesome LLM 3D - modal large language models in 3D world resources. | ![GitHub stars](https://img.shields.io/github/stars/ActiveVisionLab/Awesome-LLM-3D?style=social) | (Multi-modal learning)
World-Simulator - Awesome LLM 3D
ultimate-awesome - Awesome-LLM-3D - Awesome-LLM-3D: a curated list of Multi-modal Large Language Model in 3D world Resources. (Other Lists / TeX Lists)
README

        
# Awesome-LLM-3D [![Awesome](https://awesome.re/badge.svg)](https://awesome.re) [![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://GitHub.com/Naereen/StrapDown.js/graphs/commit-activity) [![PR's Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat)](http://makeapullrequest.com)    [![arXiv](https://img.shields.io/badge/arXiv-2405.10255-b31b1b.svg)](https://arxiv.org/abs/2405.10255)



    



## 🏠 About

Here is a curated list of papers about 3D-Related Tasks empowered by Large Language Models (LLMs). 

It contains various tasks including 3D understanding, reasoning, generation, and embodied agents. Also, we include other Foundation Models (CLIP, SAM) for the whole picture of this area.

This is an active repository, you can watch for following the latest advances. If you find it useful, please kindly star ⭐ this repo and [cite](#citation) the paper.

## 🔥 News

- [2024-05-16] 📢 Check out the first survey paper in the 3D-LLM domain: [When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models](https://arxiv.org/pdf/2405.10255) 

- [2024-01-06] [Runsen Xu](https://runsenxu.com/) added chronological information and [Xianzheng Ma](https://xianzhengma.github.io/) reorganized it in Z-A order for better following the latest advances.

- [2023-12-16] [Xianzheng Ma](https://xianzhengma.github.io/) and [Yash Bhalgat](https://yashbhalgat.github.io/) curated this list and published the first version;

## Table of Content

- [Awesome-LLM-3D](#awesome-llm-3D)

  - [3D Understanding (LLM)](#3d-understanding-via-llm)

  - [3D Understanding (other Foundation Models)](#3d-understanding-via-other-foundation-models)

  - [3D Reasoning](#3d-reasoning)

  - [3D Generation](#3d-generation)

  - [3D Embodied Agent](#3d-embodied-agent)

  - [3D Benchmarks](#3d-benchmarks)

  - [Contributing](#contributing)

## 3D Understanding via LLM

|  Date |       Keywords       |    Institute (first)   | Paper                                                                                                                                                                               | Publication | Others |

| :-----: | :------------------: | :--------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------:

| 2025-02-02 | LSceneLLM | SCUT| [LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences](https://arxiv.org/pdf/2412.01292) | CVPR '25 | [project](https://github.com/Hoyyyaard/LSceneLLM) |

| 2025-01-02 | GPT4Scene | HKU | [GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models](https://arxiv.org/pdf/2501.01428) | Arxiv | [project](https://gpt4scene.github.io/) |

| 2024-12-03 | Video-3D LLM | CUHK | [Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding](https://arxiv.org/abs/2412.00493) | Arxiv | [project](https://github.com/LaVi-Lab/Video-3D-LLM) |

| 2024-10-12 | Situation3D | UIUC | [Situational Awareness Matters in 3D Vision Language Reasoning](https://arxiv.org/abs/2406.07544) | CVPR '24 | [project](https://yunzeman.github.io/situation3d/) |

| 2024-09-28 | LLaVA-3D | HKU | [LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness](https://arxiv.org/pdf/2409.18125) | Arxiv | [project](https://zcmax.github.io/projects/LLaVA-3D/) |

| 2024-09-08 | MSR3D | BIGAI | [Multi-modal Situated Reasoning in 3D Scenes](https://arxiv.org/abs/2409.02389) | NeurIPS '24| [project](https://msr3d.github.io/) |

| 2024-08-28 | GreenPLM | HUST | [ More Text, Less Point: Towards 3D Data-Efficient Point-Language Understanding]( https://arxiv.org/pdf/2408.15966) | Arxiv| [github](https://github.com/TangYuan96/GreenPLM) |

| 2024-06-17 | LLaNA | UniBO | [LLaNA: Large Language and NeRF Assistant](https://arxiv.org/pdf/2406.11840)| NeurIPS '24 | [project](https://andreamaduzzi.github.io/llana/) |

| 2024-06-07  | SpatialPIN           | Oxford                 | [SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors](https://arxiv.org/abs/2403.13438)                         | NeurIPS '24       | [project](https://dannymcy.github.io/zeroshot_task_hallucination/) |

| 2024-06-03 | SpatialRGPT | UCSD | [SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models](https://arxiv.org/abs/2406.01584) | NeurIPS '24 | [github](https://github.com/AnjieCheng/SpatialRGPT) |

| 2024-05-02 | MiniGPT-3D | HUST| [MiniGPT-3D: Efficiently Aligning 3D Point Clouds with Large Language Models using 2D Priors](https://arxiv.org/pdf/2405.01413)| ACM MM '24 | [project](https://tangyuan96.github.io/minigpt_3d_project_page/) |

| 2024-02-27 |  ShapeLLM |    XJTU  | [ShapeLLM: Universal 3D Object Understanding for Embodied Interaction](https://arxiv.org/pdf/2402.17766)                                                                                | Arxiv | [project](https://qizekun.github.io/shapellm/) |

| 2024-01-22  | SpatialVLM           | Google DeepMind        | [SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities](https://arxiv.org/abs/2401.12168)                                                                  | CVPR '24    | [project](https://spatial-vlm.github.io/) |

| 2023-12-21 |  LiDAR-LLM |    PKU  | [LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding](https://arxiv.org/pdf/2312.14074.pdf)                                                                                | Arxiv | [project](https://sites.google.com/view/lidar-llm) |

| 2023-12-15 |  3DAP |    Shanghai AI Lab  | [3DAxiesPrompts: Unleashing the 3D Spatial Task Capabilities of GPT-4V](https://arxiv.org/pdf/2312.09738.pdf)                                                                                | Arxiv | [project]() |

| 2023-12-13 |  Chat-Scene |    ZJU | [Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers](https://arxiv.org/pdf/2312.08168.pdf)                                                                                | NeurIPS '24 | [github](https://github.com/ZzZZCHS/Chat-Scene) |

| 2023-12-5 | GPT4Point | HKU | [GPT4Point: A Unified Framework for Point-Language Understanding and Generation](https://arxiv.org/pdf/2312.02980.pdf) |Arxiv |  [github](https://github.com/Pointcept/GPT4Point) |

| 2023-11-30 |         LL3DA        |     Fudan University    | [LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning](https://arxiv.org/pdf/2311.18651.pdf)                                                                                                  |Arxiv|  [github](https://github.com/Open3DA/LL3DA) |

| 2023-11-26 | ZSVG3D | CUHK(SZ) | [Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding](https://arxiv.org/pdf/2311.15383.pdf) | Arxiv | [project](https://curryyuan.github.io/ZSVG3D/) | Arxiv | 

| 2023-11-18 |          LEO          |      BIGAI      | [An Embodied Generalist Agent in 3D World](https://arxiv.org/pdf/2311.12871.pdf)                                                           |    Arxiv  |  [github](https://github.com/embodied-generalist/embodied-generalist) |

| 2023-10-14 | JM3D-LLM | Xiamen University | [JM3D & JM3D-LLM: Elevating 3D Representation with Joint Multi-modal Cues](https://arxiv.org/pdf/2310.09503v2.pdf)                                        | ACM MM '23 |  [github](https://github.com/mr-neko/jm3d) |

| 2023-10-10 |  Uni3D |    BAAI  | [Uni3D: Exploring Unified 3D Representation at Scale](https://arxiv.org/abs/2310.06773)                                                                                | ICLR '24 | [project](https://github.com/baaivision/Uni3D) |

| 2023-9-27 |  - |    KAUST  | [Zero-Shot 3D Shape Correspondence](https://arxiv.org/abs/2306.03253)                                                                                | Siggraph Asia '23 | - |

| 2023-9-21|       LLM-Grounder       |      U-Mich      | [LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent](https://arxiv.org/pdf/2309.12311.pdf)              |     ICRA '24     |  [github](https://github.com/sled-group/chat-with-nerf) |

| 2023-9-1 |        Point-Bind       |      CUHK     | [Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following](https://arxiv.org/pdf/2309.00615.pdf)             |  Arxiv   |  [github](https://github.com/ZiyuGuo99/Point-Bind_Point-LLM) |

| 2023-8-31 |         PointLLM         |      CUHK      | [PointLLM: Empowering Large Language Models to Understand Point Clouds](https://arxiv.org/pdf/2308.16911.pdf)                                                                             |   ECCV '24 |  [github](https://github.com/OpenRobotLab/PointLLM) |

| 2023-8-17|     Chat-3D     |      ZJU     | [Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes](https://arxiv.org/pdf/2308.08769v1.pdf)                                                          |  Arxiv      |  [github](https://github.com/Chat-3D/Chat-3D)|

| 2023-8-8 |         3D-VisTA          |      BIGAI      | [3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment](https://Arxiv.org/abs/2308.04352)                                                           |    ICCV '23  | [github]() |

| 2023-7-24 |     3D-LLM     |      UCLA    | [3D-LLM: Injecting the 3D World into Large Language Models](https://arxiv.org/pdf/2307.12981.pdf)                                                                                                                      |   NeurIPS '23|  [github](https://github.com/UMass-Foundation-Model/3D-LLM) |

| 2023-3-29 |       ViewRefer       |      CUHK      | [ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding](https://arxiv.org/pdf/2303.16894.pdf)                                                                                               |ICCV '23 |[github](https://github.com/Ivan-Tang-3D/ViewRefer3D) |

| 2022-9-12 |        -        |      MIT      | [Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding](https://arxiv.org/pdf/2209.05629.pdf)                                                                      |Arxiv|  [github](https://github.com/MIT-SPARK/llm_scene_understanding) |

## 3D Understanding via 
|  ID |       keywords 
| :-----: | :------------------: 
| 2024-10-12 |  Lexicon3D | 
| 2024-10-07 |  Diff2Scene | 
| 2024-04-07 |  Any2Point | 
| 2024-03-16 |  N2F2 | 
| 2023-12-17 |  SAI3D | 
| 2023-12-17 |  Open3DIS | 
| 2023-11-6 |  OVIR-3D | 
| 2023-10-29|  OpenMask3D | 
| 2023-10-5 |     Open-Fusion 
| 2023-9-22 |  OV-3DDet | 
| 2023-9-19 | LAMP | 
| 2023-9-15 |  OpenNerf | 
| 2023-9-1|  OpenIns3D | 
| 2023-6-7 | 
| 2023-6-4 |  Multi-CLIP | 
| 2023-5-23 |  3D-OVS | 
| 2023-5-21 |  VL-Fields | 
| 2023-5-8 |  CLIP-FO3D | 
| 2023-4-12 |  3D-VQA | 
| 2023-4-3 |  RegionPLC | 
| 2023-3-20 |        CG3D 
| 2023-3-16 | LERF | 
| 2023-2-14 |  ConceptFusion | 
| 2023-1-12 | 
| 2022-12-1 |         UniT3D 
| 2022-11-29 | 
| 2022-11-28 | 
| 2022-10-11 |  CLIP-Fields | 
| 2022-7-23 | 
| 2022-4-26 |   ScanNet200 |

other Foundation Models |    Institute (first)    | Paper                                                                                                                                                                               | Publication | Others | | :--------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------: UIUC  | [Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding](https://arxiv.org/abs/2409.03757) | NeurIPS '24 | [project](https://yunzeman.github.io/lexicon3d/) | CMU  | [Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models](https://arxiv.org/pdf/2407.13642) | ECCV 2024 | [project](https://diff2scene.github.io/) | Shanghai AI Lab  | [Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding](https://arxiv.org/pdf/2404.07989) | ECCV 2024 | [github](https://github.com/Ivan-Tang-3D/Any2Point) | Oxford-VGG  | [N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields](https://arxiv.org/pdf/2403.10997.pdf) | Arxiv | - | PKU  | [SAI3D: Segment Any Instance in 3D Scenes](https://arxiv.org/pdf/2312.11557.pdf)                                                                                | Arxiv | [project](https://yd-yin.github.io/SAI3D) | VinAI  | [Open3DIS: Open-vocabulary 3D Instance Segmentation with 2D Mask Guidance](https://arxiv.org/pdf/2312.10671.pdf)                                                                                | Arxiv | [project](https://open3dis.github.io/) | Rutgers University  | [OVIR-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D Data](https://arxiv.org/pdf/2311.02873.pdf) | CoRL '23 | [github](https://github.com/shiyoung77/OVIR-3D/) | ETH  | [OpenMask3D: Open-Vocabulary 3D Instance Segmentation](https://openmask3d.github.io/static/pdf/openmask3d.pdf)                                                                                | NeurIPS '23 | [project](https://openmask3d.github.io/) | |      -     | [Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation](https://arxiv.org/pdf/2310.03923.pdf)                                                                            |Arxiv|  [github](https://github.com/UARK-AICV/OpenFusion) | HKUST  | [CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection](https://arxiv.org/pdf/2310.02960.pdf)                                                                                | NeurIPS '23 | [github](https://github.com/yangcaoai/CoDA_NeurIPS2023) | -      | [From Language to 3D Worlds: Adapting Language Model for Point Cloud Perception](https://openreview.net/forum?id=H49g8rRIiF)                                                              |    OpenReview     | - | -    | [OpenNerf: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views](https://openreview.net/pdf?id=SgjAojPKb3)                                                                                | OpenReview | [github]() | Cambridge  | [OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation](https://arxiv.org/pdf/2309.00616.pdf)                                                                                | Arxiv | [project](https://zheninghuang.github.io/OpenIns3D/) | Contrastive Lift         |     Oxford-VGG     | [Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast Contrastive Fusion](https://arxiv.org/pdf/2306.04633.pdf)                                                                                        |   NeurIPS '23| [github](https://github.com/yashbhalgat/Contrastive-Lift) | ETH  | [Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes](https://arxiv.org/pdf/2306.02329.pdf)                                                                                | Arxiv | - | NTU  | [Weakly Supervised 3D Open-vocabulary Segmentation](https://arxiv.org/pdf/2305.14093.pdf)                                                                                | NeurIPS '23 | [github](https://github.com/Kunhao-Liu/3D-OVS) | University of Edinburgh  | [VL-Fields: Towards Language-Grounded Neural Implicit Spatial Representations](https://arxiv.org/pdf/2305.12427.pdf)                                                                                | ICRA '23 | [project](https://tsagkas.github.io/vl-fields/)  | Tsinghua University  | [CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP](https://arxiv.org/pdf/2303.04748.pdf)                                                                                | ICCVW '23 | - | ETH  | [CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes](https://arxiv.org/pdf/2304.06061.pdf)                                                                                | CVPRW '23 | [github](https://github.com/AlexDelitzas/3D-VQA) | HKU | [RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding](https://arxiv.org/pdf/2304.00962.pdf)                                                                                | Arxiv | [project](https://jihanyang.github.io/projects/RegionPLC) | |      JHU      | [CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition](https://arxiv.org/pdf/2303.11313.pdf)                                                                                                 |Arxiv|  [github](https://github.com/deeptibhegde/CLIP-goes-3D) | UC Berkeley     | [LERF: Language Embedded Radiance Fields](https://arxiv.org/pdf/2303.09553.pdf)                                                   | ICCV '23   | [github](https://github.com/kerrj/lerf) | MIT  | [ConceptFusion: Open-set Multimodal 3D Mapping](https://arxiv.org/pdf/2302.07241.pdf)                                                                                | RSS '23 | [project](https://concept-fusion.github.io/) | CLIP2Scene         |      HKU      | [CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP](https://arxiv.org/pdf/2301.04926.pdf)                                                                                        |    CVPR '23 | [github](https://github.com/runnanchen/CLIP2Scene) | |      TUM     | [UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding](https://openaccess.thecvf.com/content/ICCV2023/papers/Chen_UniT3D_A_Unified_Transformer_for_3D_Dense_Captioning_and_Visual_ICCV_2023_paper.pdf)                                                                          |   ICCV '23| [github]() | PLA        |     HKU    | [PLA: Language-Driven Open-Vocabulary 3D Scene Understanding](https://arxiv.org/pdf/2211.16312.pdf)                                                                 |CVPR '23|  [github](https://github.com/CVMI-Lab/PLA) | OpenScene       |      ETHz      | [OpenScene: 3D Scene Understanding with Open Vocabularies](https://arxiv.org/pdf/2211.15654.pdf)                                                             |   CVPR '23  | [github](https://github.com/pengsongyou/openscene) | NYU  | [CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory](https://arxiv.org/pdf/2210.05663.pdf)                                                                                | Arxiv | [project](https://mahis.life/clip-fields/) | Semantic Abstraction |    Columbia  | [Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models](https://arxiv.org/pdf/2207.11514.pdf)                                                                                | CoRL '22 | [project](https://semantic-abstraction.cs.columbia.edu/) | TUM  | [Language-Grounded Indoor 3D Semantic Segmentation in the Wild](https://arxiv.org/pdf/2204.07761.pdf)                                                                                | ECCV '22 | [project](https://rozdavid.github.io/scannet200) |

## 3D Reasoning

|  Date |       keywords       |    Institute (first)    | Paper                                                                                                                                                                               | Publication | Others |

| :-----: | :------------------: | :--------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------: 

| 2023-5-20|       3D-CLR      |      UCLA     | [3D Concept Learning and Reasoning from Multi-View Images](https://arxiv.org/pdf/2303.11327.pdf)                                                 |   CVPR '23  | [github](https://github.com/evelinehong/3D-CLR-Official) |

| - |         Transcribe3D         |     TTI, Chicago     | [Transcribe3D: Grounding LLMs Using Transcribed Information for 3D Referential Reasoning with Self-Corrected Finetuning](https://openreview.net/pdf?id=7j3sdUZMTF)                                                                                                  |CoRL '23|  [github]() |

## 3D Generation

|  Date |       keywords       |    Institute    | Paper                                                                                                                                                                               | Publication | Others |

| :-----: | :------------------: | :--------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------: 

| 2023-11-29 |         ShapeGPT         |     Fudan University     | [ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model](https://arxiv.org/pdf/2311.17618.pdf)                                                                                                  |Arxiv|  [github](https://github.com/OpenShapeLab/ShapeGPT) |                                                                                              | Arxiv  |  [github]() |

| 2023-11-27|         MeshGPT         |     TUM     | [MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers](https://arxiv.org/pdf/2311.15475.pdf)                                                                                                  |Arxiv |  [project](https://nihalsid.github.io/mesh-gpt/) |

| 2023-10-19 |         3D-GPT        |     ANU   | [3D-GPT: Procedural 3D Modeling with Large Language Models](https://arxiv.org/pdf/2310.12945.pdf)                                                                                                   |Arxiv|  [github](https://dreamllm.github.io/) |

| 2023-9-21 |         LLMR         |     MIT     | [LLMR: Real-time Prompting of Interactive Worlds using Large Language Models](https://arxiv.org/pdf/2309.12276.pdf)                                                                                                  |Arxiv| - |

| 2023-9-20 |         DreamLLM         |     MEGVII    | [DreamLLM: Synergistic Multimodal Comprehension and Creation](https://arxiv.org/pdf/2309.11499.pdf) | Arxiv | [github](https://github.com/RunpeiDong/DreamLLM)

| 2023-4-1 |      ChatAvatar      |       Deemos Tech            | [DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance](https://dl.acm.org/doi/abs/10.1145/3592094)                                               |  ACM TOG    | [website](https://hyperhuman.deemos.com/) |

## 3D Embodied Agent

|  Date |       keywords       |    Institute   | Paper                                                                                                                                                                               | Publication | Others |

| :-----: | :------------------: | :--------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------: 

| 2024-01-22 |  SpatialVLM |    Deepmind  | [SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities](https://arxiv.org/abs/2401.12168)                                                                                | CVPR '24 | [project](https://spatial-vlm.github.io/) |

| 2023-12-05 | NaviLLM | CUHK | [Towards Learning a Generalist Model for Embodied Navigation](https://arxiv.org/abs/2312.02010) | CVPR '24 | [project](https://github.com/zd11024/NaviLLM) |

| 2023-11-27 | Dobb-E | NYU | [On Bringing Robots Home](https://arxiv.org/pdf/2311.16098.pdf)        |    Arxiv  |  [github](https://github.com/notmahi/dobb-e) |

| 2023-11-26 | STEVE | ZJU | [See and Think: Embodied Agent in Virtual Environment](https://arxiv.org/abs/2311.15209) | Arxiv | [github](https://github.com/rese1f/STEVE) |

| 2023-11-18 | LEO  |   BIGAI  | [An Embodied Generalist Agent in 3D World](https://arxiv.org/pdf/2311.12871.pdf)   |    ICML '24  |  [github](https://github.com/embodied-generalist/embodied-generalist) |

| 2023-9-14 |        UniHSI      |      Shanghai AI Lab     | [Unified Human-Scene Interaction via Prompted Chain-of-Contacts](https://arxiv.org/pdf/2309.07918.pdf)                                                                                                 |   Arxiv |  [github](https://github.com/OpenRobotLab/UniHSI) |

| 2023-7-28 |         RT-2         |     Google-DeepMind     | [RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control](https://arxiv.org/pdf/2307.15818.pdf)                                                                                                  |Arxiv|  [github](https://robotics-transformer2.github.io/) |

| 2023-7-12 |         SayPlan        |     QUT Centre for Robotics    | [SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning](https://arxiv.org/pdf/2307.06135.pdf)                                                                                                  |CoRL '23|  [github](https://sayplan.github.io/) |

| 2023-7-12 |          VoxPoser          |      Stanford      | [VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models](https://voxposer.github.io/voxposer.pdf)                                                           |    Arxiv  |  [github](https://github.com/huangwl18/VoxPoser) |

| 2022-12-13|         RT-1         |     Google     | [RT-1: Robotics Transformer for Real-World Control at Scale](https://robotics-transformer1.github.io/assets/rt1.pdf)                                                                                                  |Arxiv|  [github](https://robotics-transformer1.github.io/) |

| 2022-12-8 |         LLM-Planner         |     The Ohio State University    | [LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models](https://arxiv.org/pdf/2212.04088.pdf)                                                                                                  |ICCV '23|  [github](https://github.com/OSU-NLP-Group/LLM-Planner/) |

| 2022-10-11 |          CLIP-Fields          |      NYU, Meta      | [CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory](https://arxiv.org/pdf/2210.05663.pdf)                                                           |    RSS '23  |  [github](https://github.com/notmahi/clip-fields) |

| 2022-09-20|       NLMap-SayCan       |     Google     | [Open-vocabulary Queryable Scene Representations for Real World Planning](https://arxiv.org/abs/2209.09874)                                                                    | ICRA '23|  [github](https://nlmap-saycan.github.io/) |

## 3D Benchmarks

|  Date |       keywords       |    Institute    | Paper                                                                                                                                                                               | Publication | Others |

| :-----: | :------------------: | :--------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------: 

| 2025-03-08 | 3D-CoT | PolyU, EIT | [Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning](https://arxiv.org/abs/2503.06232) | Arxiv | [dataset](https://huggingface.co/datasets/Battam/3D-CoT) |

| 2024-09-08 | MSQA / MSNN | BIGAI | [Multi-modal Situated Reasoning in 3D Scenes](https://arxiv.org/abs/2409.02389) | NeurIPS '24| [project](https://msr3d.github.io/) |

| 2024-06-10 | 3D-GRAND / 3D-POPE | UMich | [3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination](https://arxiv.org/pdf/2406.05132.pdf) | Arxiv | [project](https://3d-grand.github.io) |

| 2024-06-03 | SpatialRGPT-Bench | UCSD | [SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models](https://arxiv.org/abs/2406.01584) | NeurIPS '24 | [github](https://github.com/AnjieCheng/SpatialRGPT) |

| 2024-1-18 | SceneVerse | BIGAI | [SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding](https://arxiv.org/pdf/2401.09340.pdf) | Arxiv | [github](https://github.com/scene-verse/sceneverse) |

| 2023-12-26 | EmbodiedScan | Shanghai AI Lab | [EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI](https://arxiv.org/pdf/2312.16170.pdf) | Arxiv | [github](https://github.com/OpenRobotLab/EmbodiedScan) |

| 2023-12-17 |         M3DBench        |     Fudan University     | [M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts](https://arxiv.org/abs/2312.10763)                                                                                                  |Arxiv|  [github](https://github.com/OpenM3D/M3DBench) |

| 2023-11-29 |         -         |     DeepMind  | [Leveraging VLM-Based Pipelines to Annotate 3D Objects](https://arxiv.org/pdf/2311.17851.pdf)                                                                                                  |ICML '24|  [github](https://github.com/google-deepmind/objaverse_annotations) |

| 2023-09-14 |CrossCoherence   |     UniBO  | [Looking at words and points with attention: a benchmark for text-to-shape coherence](https://arxiv.org/pdf/2309.07917)                                                                                                  |ICCV '23|  [github](https://github.com/AndreAmaduzzi/CrossCoherence) |

| 2022-10-14 |     SQA3D     |      BIGAI    | [SQA3D: Situated Question Answering in 3D Scenes](https://arxiv.org/pdf/2210.07474.pdf)                                                                                                        | ICLR '23| [github](https://github.com/SilongYong/SQA3D) |

| 2021-12-20|     ScanQA     |      RIKEN AIP    | [ScanQA: 3D Question Answering for Spatial Scene Understanding](https://arxiv.org/pdf/2112.10482.pdf)                                                                                                        | CVPR '23| [github](https://github.com/ATR-DBI/ScanQA) |

| 2020-12-3 |     Scan2Cap     |      TUM    | [Scan2Cap: Context-aware Dense Captioning in RGB-D Scans](https://arxiv.org/pdf/2012.02206.pdf)                                                                                                        | CVPR '21| [github](https://github.com/daveredrum/Scan2Cap) |

| 2020-8-23 | ReferIt3D | Stanford | [ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123460409.pdf) | ECCV '20 | [github](https://github.com/referit3d/referit3d) 

| 2019-12-18 |     ScanRefer     |      TUM   | [ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language](https://arxiv.org/pdf/2112.10482.pdf)                                                                                                        | ECCV '20 | [github](https://daveredrum.github.io/ScanRefer/) |

## Contributing

Your contributions are always welcome!

I will keep some pull requests open if I'm not sure if they are awesome for 3D LLMs, you could vote for them by adding 👍 to them.

---

If you have any questions about this opinionated list, please get in touch at [email protected] or Wechat ID: mxz1997112.

## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=ActiveVisionLab/Awesome-LLM-3D&type=Date)](https://star-history.com/#ActiveVisionLab/Awesome-LLM-3D&Date)

## Citation

If you find this repository useful, please consider citing this paper:

```

@misc{ma2024llmsstep3dworld,

      title={When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models}, 

      author={Xianzheng Ma and Yash Bhalgat and Brandon Smart and Shuai Chen and Xinghui Li and Jian Ding and Jindong Gu and Dave Zhenyu Chen and Songyou Peng and Jia-Wang Bian and Philip H Torr and Marc Pollefeys and Matthias Nießner and Ian D Reid and Angel X. Chang and Iro Laina and Victor Adrian Prisacariu},

      year={2024},

      journal={arXiv preprint arXiv:2405.10255},

}

```

## Acknowledgement

This repo is inspired by [Awesome-LLM](https://github.com/Hannibal046/Awesome-LLM?tab=readme-ov-file#other-awesome-lists)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ActiveVisionLab/Awesome-LLM-3D

Awesome Lists containing this project

README