https://github.com/Open3DA/LL3DA
[CVPR 2024] "LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning"; an interactive Large Language 3D Assistant.
https://github.com/Open3DA/LL3DA
3d 3d-models 3d-to-text cvpr2024 gpt instruction-tuning language-model llm multi-modal scene-understanding
Last synced: 5 days ago
JSON representation
[CVPR 2024] "LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning"; an interactive Large Language 3D Assistant.
- Host: GitHub
- URL: https://github.com/Open3DA/LL3DA
- Owner: Open3DA
- License: mit
- Created: 2023-11-27T08:58:00.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-07-17T15:08:55.000Z (about 1 year ago)
- Last Synced: 2024-09-27T15:03:11.163Z (about 1 year ago)
- Topics: 3d, 3d-models, 3d-to-text, cvpr2024, gpt, instruction-tuning, language-model, llm, multi-modal, scene-understanding
- Language: Python
- Homepage: https://ll3da.github.io/
- Size: 72.7 MB
- Stars: 227
- Watchers: 6
- Forks: 9
- Open Issues: 15
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Official repo for LL3DA
![]()
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
💻Project Page •
📄Arxiv Paper •
🎞YouTube •
🤗HuggingFace Demo (WIP) •
Citation
## 🏃 Intro LL3DA
LL3DA is a Large Language 3D Assistant that could respond to both visual and textual interactions within **complex 3D environments**.
Recent advances in Large Multimodal Models (LMM) have made it possible for various applications in human-machine interactions. However, developing LMMs that can comprehend, reason, and plan in complex and diverse 3D environments remains a challenging topic, especially considering the demand for understanding permutation-invariant point cloud 3D representations of the 3D scene. Existing works seek help from multi-view images, and project 2D features to 3D space as 3D scene representations. This, however, leads to huge computational overhead and performance degradation. In this paper, we present LL3DA, a Large Language 3D Assistant that takes point cloud as direct input and respond to both textual-instructions and visual-prompts. This help LMMs better comprehend human interactions and further help to remove the ambiguities in cluttered 3D scenes. Experiments show that LL3DA achieves remarkable results, and surpasses various 3D vision-language models on both 3D Dense Captioning and 3D Question Answering.

## 🚩 News
- 2024-03-04. 💥 The code is fully released! Now you can train your customized models!
- 2024-02-27. 🎉 LL3DA is accepted by CVPR 2024! See you in Seattle!
- 2023-11-30. 📣 Upload paper and init project**TODO**:
- [x] Upload our paper to arXiv and build project pages.
- [x] Pray for acceptance.
- [x] Upload all the code and training scripts.
- [x] Release pre-trained weights. (see [checkpoint](https://huggingface.co/CH3COOK/LL3DA-weight-release/blob/main/ll3da-opt-1.3b.pth))
- [ ] Add local demo interface.
- [ ] Train on larger 3D VL benchmarks and scale up models.## ⚡ Quick Start
Environment Setup
**Step 1. Build Dependencies.** Our code is tested with CUDA 11.6 and Python 3.8.16. To run the codes, you should first install the following packages:
```
h5py
scipy
cython
plyfile
'trimesh>=2.35.39,<2.35.40'
'networkx>=2.2,<2.3'
'torch=1.13.1+cu116'
'transformers>=4.37.0'
```After that, build the `pointnet2` and accelerated `giou` from source:
```{bash}
cd third_party/pointnet2
python setup.py install
``````{bash}
cd utils
python cython_compile.py build_ext --inplace
```**Step 2. Download pre-trained embeddings.** Download the pre-processed BERT embedding weights from [huggingface](https://huggingface.co/CH3COOK/bert-base-embedding/tree/main) and store them under the [`./bert-base-embedding`](./bert-base-embedding) folder. The weights are **the same** from the official BERT model, we just modified the names of certain parameters.
Data Preparation
Our repo requires the 3D data from ScanNet, the natural language annotations, and the pre-trained LLM weights.
**Step 1. Download and Prepare the ScanNet 3D Data.**
**Updates 2024-07-01:** You can download the pre-processed data from [here](https://huggingface.co/CH3COOK/LL3DA-weight-release/blob/main/scannet_data.zip).
1. Follow the instructions [here](https://github.com/ch3cook-fdu/Vote2Cap-DETR/tree/master/data/scannet) and download the ScanNetV2 dataset.
2. Change the `SCANNET_DIR` to the scans folder in [`data/scannet/batch_load_scannet_data.py`](https://github.com/ch3cook-fdu/Vote2Cap-DETR/blob/master/data/scannet/batch_load_scannet_data.py#L16), and run the following commands.
```{bash}
cd data/scannet/
python batch_load_scannet_data.py
```**Step 2. Prepare Language Annotations**
To train the model, you are required to prepare language annotations from `ScanRefer`, `Nr3D`, `ScanQA`, and the ScanNet part of `3D-LLM`.
1. `ScanRefer`. Follow the commands [here](https://github.com/daveredrum/ScanRefer) to download the `ScanRefer` dataset.
2. `Nr3D`. Follow the commands [here](https://referit3d.github.io/#dataset) to download the `Nr3D` dataset, and [pre-process](https://github.com/ch3cook-fdu/Vote2Cap-DETR/blob/master/data/parse_nr3d.py) it.
3. `ScanQA`. Follow the commands [here](https://github.com/ATR-DBI/ScanQA/blob/main/docs/dataset.md) to download the `ScanQA` dataset.
4. `3D-LLM`. The data are located at [here](./data/3D_LLM). We have also shared our pre-processing scripts [here](./data/3D_LLM/pre-process-3D-LLM.py).We will update the latest released data (V3) from 3D-LLM.
Finally, organize the files into the following folders:
```
./data/
ScanRefer/
ScanRefer_filtered_train.json
ScanRefer_filtered_train.txt
ScanRefer_filtered_val.json
ScanRefer_filtered_val.txtNr3D/
nr3d_train.json
nr3d_train.txt
nr3d_val.json
nr3d_val.txtScanQA/
ScanQA_v1.0_test_w_obj.json
ScanQA_v1.0_test_wo_obj.json
ScanQA_v1.0_train.json
ScanQA_v1.0_val.json3D_LLM/
3d_llm_embodied_dialogue_filtered_train.json
3d_llm_embodied_dialogue_filtered_val.json
3d_llm_embodied_planning_filtered_train.json
3d_llm_embodied_planning_filtered_val.json
3d_llm_scene_description_train.json
3d_llm_scene_description_val.json
```**Step 3. \[Optional\] Download Pre-trained LLM weights.** If your server has no trouble auto-downloading weights from huggingface🤗, feel free to skip this step.
Download files from the `opt-1.3b` checkpoint (or any other decoder-only LLM) at [huggingface](https://huggingface.co/facebook/opt-1.3b/tree/main), and store them under the `./facebook/opt-1.3b` directory. Make sure the required files are downloaded:
```
./facebook/opt-1.3b/
config.json
merges.txt
pytorch_model.bin
special_tokens_map.json
tokenizer_config.json
vocab.json
```## 💻 Train your own models
**Updates 2024-07-01:** The released version is slightly different from our paper implementation. In our released version, we *standardized the data format* and *dropped duplicated text annotations*. To reproduce our reported results, please use the scripts provided in `scripts-v0` to produce the generalist weights.
```
bash scripts-v0/opt-1.3b/train.generalist.sh
```Our code should support **any decoder-only LLMs** (`facebook/opt-1.3b`, `gpt2-xl`, `meta-llama/Llama-2-7b` or even the **LATEST** `Qwen/Qwen1.5-1.8B` and `Qwen/Qwen1.5-4B`). Check out the following table for recommended LLMs in different scales! **By default, the models are trained with eight GPUs.**
| <1B | 1B-4B | ~7B |
|:-------------------------:|:-------------------------:|:--------------------------------:|
| `gpt2`(124m) | `TinyLlama-1.1B`(1.1b) | `facebook/opt-6.7b`(6.7b) |
| `facebook/opt-125m`(125m) | `facebook/opt-1.3b`(1.3b) | `meta-llama/Llama-2-7b-hf`(6.7b) |
| `gpt2-medium`(355m) | `gpt2-xl`(1.6b) | `Qwen/Qwen1.5-7B`(7.7b) |
| `Qwen/Qwen1.5-0.5B`(620m) | `Qwen/Qwen1.5-1.8B`(1.8b) | - |
| `gpt2-large`(774m) | `facebook/opt-2.7b`(2.7b) | - |
| - | `microsoft/phi-2`(2.8b) | - |
| - | `Qwen/Qwen1.5-4B`(3.9b) | - |We provide training scripts in the `scripts` folder with different LLM backends. Feel free to modify the hyper parameters in those commands.
For other LLM backends, please modify the commands manually by changing `--vocab` to other LLMs.
Training
To train the model as a 3D generalist: (We have also uploaded the pre-trained weights to [huggingface](https://huggingface.co/CH3COOK/LL3DA-weight-release/blob/main/ll3da-opt-1.3b.pth).)
```{bash}
bash scripts/opt-1.3b/train.generalist.sh
```After the model is trained, you can tune the model on ScanQA for 3D Question Answering:
```{bash}
bash scripts/opt-1.3b/tuning.scanqa.sh
```And, on ScanRefer / Nr3D for 3D Dense Captioning:
```{bash}
bash scripts/opt-1.3b/tuning.scanrefer.sh
bash scripts/opt-1.3b/tuning.nr3d.sh
```You can also tune the model to predict bounding boxes for open vocabulary object detection!
```{bash}
bash scripts/opt-1.3b/tuning.ovdet.sh
```Evaluation
To evaluate the model as a 3D generalist:
```{bash}
bash scripts/opt-1.3b/eval.generalist.sh
```On ScanQA for 3D Question Answering:
```{bash}
bash scripts/opt-1.3b/eval.scanqa.sh
```And, on ScanRefer / Nr3D for 3D Dense Captioning:
```{bash}
bash scripts/opt-1.3b/eval.scanrefer.sh
bash scripts/opt-1.3b/eval.nr3d.sh
```## 📖 Citation
If you find our code or paper helpful, please consider starring ⭐ us and citing:
```{bibtex}
@misc{chen2023ll3da,
title={LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning},
author={Sijin Chen and Xin Chen and Chi Zhang and Mingsheng Li and Gang Yu and Hao Fei and Hongyuan Zhu and Jiayuan Fan and Tao Chen},
year={2023},
eprint={2311.18651},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```## Acknowledgments
Thanks to [Vote2Cap-DETR](https://github.com/ch3cook-fdu/Vote2Cap-DETR), [3D-LLM](https://github.com/UMass-Foundation-Model/3D-LLM), [Scan2Cap](https://github.com/daveredrum/Scan2Cap), and [3DETR](https://github.com/facebookresearch/3detr). We borrow some of their codes and data.
## License
This code is distributed under an [MIT LICENSE](LICENSE). If there are any problem regarding our paper and code, feel free to open an issue!