{"id":15151220,"url":"https://github.com/Open3DA/LL3DA","last_synced_at":"2025-09-29T20:31:36.833Z","repository":{"id":209426953,"uuid":"724031176","full_name":"Open3DA/LL3DA","owner":"Open3DA","description":"[CVPR 2024] \"LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning\"; an interactive Large Language 3D Assistant.","archived":false,"fork":false,"pushed_at":"2024-07-17T15:08:55.000Z","size":76201,"stargazers_count":227,"open_issues_count":15,"forks_count":9,"subscribers_count":6,"default_branch":"main","last_synced_at":"2024-09-27T15:03:11.163Z","etag":null,"topics":["3d","3d-models","3d-to-text","cvpr2024","gpt","instruction-tuning","language-model","llm","multi-modal","scene-understanding"],"latest_commit_sha":null,"homepage":"https://ll3da.github.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Open3DA.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-27T08:58:00.000Z","updated_at":"2024-09-25T21:09:14.000Z","dependencies_parsed_at":"2024-03-04T20:08:17.320Z","dependency_job_id":null,"html_url":"https://github.com/Open3DA/LL3DA","commit_stats":null,"previous_names":["open3da/ll3da"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Open3DA%2FLL3DA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Open3DA%2FLL3DA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Open3DA%2FLL3DA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Open3DA%2FLL3DA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Open3DA","download_url":"https://codeload.github.com/Open3DA/LL3DA/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":234659889,"owners_count":18867635,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["3d","3d-models","3d-to-text","cvpr2024","gpt","instruction-tuning","language-model","llm","multi-modal","scene-understanding"],"created_at":"2024-09-26T15:01:01.586Z","updated_at":"2025-09-29T20:31:26.863Z","avatar_url":"https://github.com/Open3DA.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003cdiv align= \"center\"\u003e\n    \u003ch1\u003e Official repo for LL3DA \u003cimg src=\"./assets/icon.png\" width=\"35px\"\u003e\u003c/h1\u003e\n\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n    \u003ch2\u003e \u003ca href=\"https://ll3da.github.io/\"\u003eLL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning\u003c/a\u003e\u003c/h2\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://ll3da.github.io/\"\u003e💻Project Page\u003c/a\u003e •\n  \u003ca href=\"https://arxiv.org/abs/2311.18651\"\u003e📄Arxiv Paper\u003c/a\u003e •\n  \u003ca href=\"https://www.youtube.com/watch?v=224JzkdHjfg\"\u003e🎞YouTube\u003c/a\u003e •\n  🤗HuggingFace Demo (WIP) •\n  \u003ca href=\"#-citation\"\u003eCitation\n\u003c/p\u003e\n\n\u003c/div\u003e\n\n![teaser.gif](assets/teaser-simutaneous.gif)\n\n\n## 🏃 Intro LL3DA\n\nLL3DA is a Large Language 3D Assistant that could respond to both visual and textual interactions within **complex 3D environments**.\n\u003c!-- \n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003eTechnical details\u003c/b\u003e\u003c/summary\u003e --\u003e\n\nRecent advances in Large Multimodal Models (LMM) have made it possible for various applications in human-machine interactions. However, developing LMMs that can comprehend, reason, and plan in complex and diverse 3D environments remains a challenging topic, especially considering the demand for understanding permutation-invariant point cloud 3D representations of the 3D scene. Existing works seek help from multi-view images, and project 2D features to 3D space as 3D scene representations. This, however, leads to huge computational overhead and performance degradation. In this paper, we present LL3DA, a Large Language 3D Assistant that takes point cloud as direct input and respond to both textual-instructions and visual-prompts. This help LMMs better comprehend human interactions and further help to remove the ambiguities in cluttered 3D scenes. Experiments show that LL3DA achieves remarkable results, and surpasses various 3D vision-language models on both 3D Dense Captioning and 3D Question Answering.\n\n![pipeline.png](assets/pipeline.png)\n\n\n## 🚩 News\n\n- 2024-03-04. 💥 The code is fully released! Now you can train your customized models!\n- 2024-02-27. 🎉 LL3DA is accepted by \u003cfont color=\"#dd0000\"\u003eCVPR 2024\u003c/font\u003e! See you in Seattle!\n- 2023-11-30. 📣 Upload paper and init project\n\n**TODO**:\n\n- [x] Upload our paper to arXiv and build project pages.\n- [x] Pray for acceptance.\n- [x] Upload all the code and training scripts.\n- [x] Release pre-trained weights. (see [checkpoint](https://huggingface.co/CH3COOK/LL3DA-weight-release/blob/main/ll3da-opt-1.3b.pth))\n- [ ] Add local demo interface.\n- [ ] Train on larger 3D VL benchmarks and scale up models.\n\n## ⚡ Quick Start\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cb\u003eEnvironment Setup\u003c/b\u003e\u003c/summary\u003e\n\n**Step 1. Build Dependencies.** Our code is tested with CUDA 11.6 and Python 3.8.16. To run the codes, you should first install the following packages:\n\n```\nh5py\nscipy\ncython\nplyfile\n'trimesh\u003e=2.35.39,\u003c2.35.40'\n'networkx\u003e=2.2,\u003c2.3'\n'torch=1.13.1+cu116'\n'transformers\u003e=4.37.0'\n```\n\nAfter that, build the `pointnet2` and accelerated `giou` from source:\n\n```{bash}\ncd third_party/pointnet2\npython setup.py install\n```\n\n```{bash}\ncd utils\npython cython_compile.py build_ext --inplace\n```\n\n**Step 2. Download pre-trained embeddings.** Download the pre-processed BERT embedding weights from [huggingface](https://huggingface.co/CH3COOK/bert-base-embedding/tree/main) and store them under the [`./bert-base-embedding`](./bert-base-embedding) folder. The weights are **the same** from the official BERT model, we just modified the names of certain parameters.\n\n\u003c/details\u003e\n\n\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cb\u003eData Preparation\u003c/b\u003e\u003c/summary\u003e\n\nOur repo requires the 3D data from ScanNet, the natural language annotations, and the pre-trained LLM weights.\n\n**Step 1. Download and Prepare the ScanNet 3D Data.**\n\n**\u003cfont color=\"#dd0000\"\u003eUpdates 2024-07-01:\u003c/font\u003e** You can download the pre-processed data from [here](https://huggingface.co/CH3COOK/LL3DA-weight-release/blob/main/scannet_data.zip).\n\n\n1. Follow the instructions [here](https://github.com/ch3cook-fdu/Vote2Cap-DETR/tree/master/data/scannet) and download the ScanNetV2 dataset. \n2. Change the `SCANNET_DIR` to the scans folder in [`data/scannet/batch_load_scannet_data.py`](https://github.com/ch3cook-fdu/Vote2Cap-DETR/blob/master/data/scannet/batch_load_scannet_data.py#L16), and run the following commands.\n```{bash}\ncd data/scannet/\npython batch_load_scannet_data.py\n```\n\n**Step 2. Prepare Language Annotations**\n\nTo train the model, you are required to prepare language annotations from `ScanRefer`, `Nr3D`, `ScanQA`, and the ScanNet part of `3D-LLM`.\n\n1. `ScanRefer`. Follow the commands [here](https://github.com/daveredrum/ScanRefer) to download the `ScanRefer` dataset.\n2. `Nr3D`. Follow the commands [here](https://referit3d.github.io/#dataset) to download the `Nr3D` dataset, and [pre-process](https://github.com/ch3cook-fdu/Vote2Cap-DETR/blob/master/data/parse_nr3d.py) it.\n3. `ScanQA`. Follow the commands [here](https://github.com/ATR-DBI/ScanQA/blob/main/docs/dataset.md) to download the `ScanQA` dataset.\n4. `3D-LLM`. The data are located at [here](./data/3D_LLM). We have also shared our pre-processing scripts [here](./data/3D_LLM/pre-process-3D-LLM.py).\n\nWe will update the latest released data (V3) from 3D-LLM.\n\n\nFinally, organize the files into the following folders:\n\n```\n./data/\n  ScanRefer/\n    ScanRefer_filtered_train.json\n    ScanRefer_filtered_train.txt\n    ScanRefer_filtered_val.json\n    ScanRefer_filtered_val.txt\n\n  Nr3D/\n    nr3d_train.json\n    nr3d_train.txt\n    nr3d_val.json\n    nr3d_val.txt\n\n  ScanQA/\n    ScanQA_v1.0_test_w_obj.json\n    ScanQA_v1.0_test_wo_obj.json\n    ScanQA_v1.0_train.json\n    ScanQA_v1.0_val.json\n\n  3D_LLM/\n    3d_llm_embodied_dialogue_filtered_train.json\n    3d_llm_embodied_dialogue_filtered_val.json\n    3d_llm_embodied_planning_filtered_train.json\n    3d_llm_embodied_planning_filtered_val.json\n    3d_llm_scene_description_train.json\n    3d_llm_scene_description_val.json\n```\n\n**Step 3. \\[Optional\\] Download Pre-trained LLM weights.** If your server has no trouble auto-downloading weights from huggingface🤗, feel free to skip this step.\n\nDownload files from the `opt-1.3b` checkpoint (or any other decoder-only LLM) at [huggingface](https://huggingface.co/facebook/opt-1.3b/tree/main), and store them under the `./facebook/opt-1.3b` directory. Make sure the required files are downloaded:\n```\n./facebook/opt-1.3b/\n  config.json\n  merges.txt\n  pytorch_model.bin\n  special_tokens_map.json\n  tokenizer_config.json\n  vocab.json\n```\n\n\n\u003c/details\u003e\n\n\n\n\n## 💻 Train your own models\n\n**\u003cfont color=\"#dd0000\"\u003eUpdates 2024-07-01:\u003c/font\u003e** The released version is slightly different from our paper implementation. In our released version, we *standardized the data format* and *dropped duplicated text annotations*. To reproduce our reported results, please use the scripts provided in `scripts-v0` to produce the generalist weights.\n\n```\nbash scripts-v0/opt-1.3b/train.generalist.sh\n```\n\nOur code should support **any decoder-only LLMs** (`facebook/opt-1.3b`, `gpt2-xl`, `meta-llama/Llama-2-7b` or even the **\u003cfont color=\"#dd0000\"\u003eLATEST\u003c/font\u003e** `Qwen/Qwen1.5-1.8B` and `Qwen/Qwen1.5-4B`). Check out the following table for recommended LLMs in different scales! **By default, the models are trained with eight GPUs.**\n\n|            \u003c1B            |           1B-4B           |                ~7B               |\n|:-------------------------:|:-------------------------:|:--------------------------------:|\n|        `gpt2`(124m)       |   `TinyLlama-1.1B`(1.1b)  |     `facebook/opt-6.7b`(6.7b)    |\n| `facebook/opt-125m`(125m) | `facebook/opt-1.3b`(1.3b) | `meta-llama/Llama-2-7b-hf`(6.7b) |\n|    `gpt2-medium`(355m)    |      `gpt2-xl`(1.6b)      |      `Qwen/Qwen1.5-7B`(7.7b)     |\n| `Qwen/Qwen1.5-0.5B`(620m) | `Qwen/Qwen1.5-1.8B`(1.8b) |                 -                |\n|     `gpt2-large`(774m)    | `facebook/opt-2.7b`(2.7b) |                 -                |\n|             -             |  `microsoft/phi-2`(2.8b)  |                 -                |\n|             -             |  `Qwen/Qwen1.5-4B`(3.9b)  |                 -                |\n\nWe provide training scripts in the `scripts` folder with different LLM backends. Feel free to modify the hyper parameters in those commands.\n\nFor other LLM backends, please modify the commands manually by changing `--vocab` to other LLMs.\n\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cb\u003eTraining\u003c/b\u003e\u003c/summary\u003e\n\n  To train the model as a 3D generalist: (We have also uploaded the pre-trained weights to [huggingface](https://huggingface.co/CH3COOK/LL3DA-weight-release/blob/main/ll3da-opt-1.3b.pth).)\n\n  ```{bash}\n  bash scripts/opt-1.3b/train.generalist.sh\n  ```\n\n  After the model is trained, you can tune the model on ScanQA for 3D Question Answering:\n\n  ```{bash}\n  bash scripts/opt-1.3b/tuning.scanqa.sh\n  ```\n\n  And, on ScanRefer / Nr3D for 3D Dense Captioning:\n\n  ```{bash}\n  bash scripts/opt-1.3b/tuning.scanrefer.sh\n  bash scripts/opt-1.3b/tuning.nr3d.sh\n  ```\n\n  You can also tune the model to predict bounding boxes for open vocabulary object detection!\n\n  ```{bash}\n  bash scripts/opt-1.3b/tuning.ovdet.sh\n  ```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cb\u003eEvaluation\u003c/b\u003e\u003c/summary\u003e\n\n  To evaluate the model as a 3D generalist:\n\n  ```{bash}\n  bash scripts/opt-1.3b/eval.generalist.sh\n  ```\n\n  On ScanQA for 3D Question Answering:\n\n  ```{bash}\n  bash scripts/opt-1.3b/eval.scanqa.sh\n  ```\n\n  And, on ScanRefer / Nr3D for 3D Dense Captioning:\n\n  ```{bash}\n  bash scripts/opt-1.3b/eval.scanrefer.sh\n  bash scripts/opt-1.3b/eval.nr3d.sh\n  ```\n\n\u003c/details\u003e\n\n\n## 📖 Citation\n\nIf you find our code or paper helpful, please consider starring ⭐ us and citing:\n\n```{bibtex}\n@misc{chen2023ll3da,\n    title={LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning}, \n    author={Sijin Chen and Xin Chen and Chi Zhang and Mingsheng Li and Gang Yu and Hao Fei and Hongyuan Zhu and Jiayuan Fan and Tao Chen},\n    year={2023},\n    eprint={2311.18651},\n    archivePrefix={arXiv},\n    primaryClass={cs.CV}\n}\n```\n\n## Acknowledgments\n\nThanks to [Vote2Cap-DETR](https://github.com/ch3cook-fdu/Vote2Cap-DETR), [3D-LLM](https://github.com/UMass-Foundation-Model/3D-LLM), [Scan2Cap](https://github.com/daveredrum/Scan2Cap), and [3DETR](https://github.com/facebookresearch/3detr). We borrow some of their codes and data.\n\n\n## License\n\nThis code is distributed under an [MIT LICENSE](LICENSE). If there are any problem regarding our paper and code, feel free to open an issue!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpen3DA%2FLL3DA","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FOpen3DA%2FLL3DA","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpen3DA%2FLL3DA/lists"}