{"id":15151218,"url":"https://github.com/qizekun/ShapeLLM","last_synced_at":"2025-09-29T20:31:21.976Z","repository":{"id":224955461,"uuid":"764692836","full_name":"qizekun/ShapeLLM","owner":"qizekun","description":"[ECCV 2024] ShapeLLM: Universal 3D Object Understanding for Embodied Interaction","archived":false,"fork":false,"pushed_at":"2024-07-16T12:32:19.000Z","size":2153,"stargazers_count":119,"open_issues_count":4,"forks_count":8,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-09-27T15:03:11.155Z","etag":null,"topics":["3d-point-clouds","large-language-models","representation-learning"],"latest_commit_sha":null,"homepage":"https://qizekun.github.io/shapellm/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/qizekun.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-28T14:44:27.000Z","updated_at":"2024-09-27T07:53:00.000Z","dependencies_parsed_at":"2024-02-28T15:58:33.890Z","dependency_job_id":"32d6bc2b-ca60-4505-bb5d-be63da8183d2","html_url":"https://github.com/qizekun/ShapeLLM","commit_stats":null,"previous_names":["qizekun/shapellm"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qizekun%2FShapeLLM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qizekun%2FShapeLLM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qizekun%2FShapeLLM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qizekun%2FShapeLLM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/qizekun","download_url":"https://codeload.github.com/qizekun/ShapeLLM/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":234659887,"owners_count":18867634,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["3d-point-clouds","large-language-models","representation-learning"],"created_at":"2024-09-26T15:01:01.562Z","updated_at":"2025-09-29T20:31:21.372Z","avatar_url":"https://github.com/qizekun.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# ShapeLLM: Universal 3D Object Understanding for Embodied Interaction\n\n*We present ShapeLLM, the first 3D Multimodal Large Language Model designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages.*\n\n[Zekun Qi](https://qizekun.github.io/), [Runpei Dong](https://runpeidong.web.illinois.edu/), [Shaochen Zhang](https://github.com/zsc000722), [Haoran Geng](https://geng-haoran.github.io/), [Chunrui Han](https://scholar.google.com/citations?user=D6tWz44AAAAJ), [Zheng Ge](https://joker316701882.github.io/), [Li Yi](https://ericyi.github.io) and [Kaisheng Ma](http://group.iiis.tsinghua.edu.cn/~maks/leader.html)\n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/shapellm-universal-3d-object-understanding/3d-question-answering-3d-qa-on-3d-mm-vet)](https://paperswithcode.com/sota/3d-question-answering-3d-qa-on-3d-mm-vet?p=shapellm-universal-3d-object-understanding)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/shapellm-universal-3d-object-understanding/zero-shot-transfer-3d-point-cloud-2)](https://paperswithcode.com/sota/zero-shot-transfer-3d-point-cloud-2?p=shapellm-universal-3d-object-understanding)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/shapellm-universal-3d-object-understanding/zero-shot-3d-classification-on-objaverse-lvis)](https://paperswithcode.com/sota/zero-shot-3d-classification-on-objaverse-lvis?p=shapellm-universal-3d-object-understanding)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/shapellm-universal-3d-object-understanding/zero-shot-transfer-3d-point-cloud)](https://paperswithcode.com/sota/zero-shot-transfer-3d-point-cloud?p=shapellm-universal-3d-object-understanding)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/shapellm-universal-3d-object-understanding/3d-point-cloud-classification-on-scanobjectnn)](https://paperswithcode.com/sota/3d-point-cloud-classification-on-scanobjectnn?p=shapellm-universal-3d-object-understanding)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/shapellm-universal-3d-object-understanding/3d-point-cloud-classification-on-modelnet40)](https://paperswithcode.com/sota/3d-point-cloud-classification-on-modelnet40?p=shapellm-universal-3d-object-understanding)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/shapellm-universal-3d-object-understanding/few-shot-3d-point-cloud-classification-on-3)](https://paperswithcode.com/sota/few-shot-3d-point-cloud-classification-on-3?p=shapellm-universal-3d-object-understanding)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/shapellm-universal-3d-object-understanding/3d-point-cloud-linear-classification-on)](https://paperswithcode.com/sota/3d-point-cloud-linear-classification-on?p=shapellm-universal-3d-object-understanding)\n\n[![Project Page](https://img.shields.io/badge/Project-Page-Green.svg)](https://qizekun.github.io/shapellm/)\n[![Paper PDF](https://img.shields.io/badge/Paper-PDF-orange.svg)](https://arxiv.org/abs/2402.17766)\n[![Hugging Face](https://img.shields.io/badge/🤗-Hugging_Face-yellow.svg)](https://huggingface.co/collections/qizekun/shapellm-65e978379c1260a85abe8aee)\n[![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/LICENSE)\n[![Data License](https://img.shields.io/badge/Data%20License-CC%20By%20NC%204.0-red.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/DATA_LICENSE)\n\n\u003cdiv style=\"text-align: center;\"\u003e\n    \u003cimg src=\"assets/framework.jpg\" width=100% \u003e\n\u003c/div\u003e\n\n**1.** ShapeLLM is the first 3D Multimodal Large Language Model designed for `embodied interaction`.\n\n**2.** ShapeLLM supports `single-view colored point cloud input`, which can be effortlessly obtained from RGBD cameras.\n \n**3.** We introduce a robust 3D QA benchmark, `3D MM-Vet`, encompassing various variants including single-view, noise jitter, etc.\n\n**4.** We extend the powerful point encoder architecture, `ReCon++`, achieving state-of-the-art performance across a range of representation learning tasks.\n\n## Contents\n- [Install](#install)\n- [Model Zoo](https://github.com/qizekun/ShapeLLM/blob/main/docs/MODEL_ZOO.md)\n- [Dataset](https://github.com/qizekun/ShapeLLM/blob/main/docs/DATA.md)\n- [ShapeLLM](#ShapeLLM)\n  - [Demo](#Demo)\n  - [Training](#Training)\n  - [3D MM-Vet](#Zero-shot-Understanding-on-3D-MM-Vet)\n  - [GApartNet](#Visual-Grounding-on-GApartNet)\n- [ReCon++](#ReCon++)\n  - [Pretrain](#Pretrain)\n  - [Classification](#Classification)\n  - [Few-shot Learning](#Few-shot-Learning)\n  - [Zero-shot Learning](#Zero-shot-Learning)\n- [3D MM-Vet](#3D-MM-Vet)\n\n## Install\n\n[//]: # (If you are using Windows, do *NOT* proceed, see instructions [here]\u0026#40;https://github.com/qizekun/LLaVA/blob/main/docs/Windows.md\u0026#41;.)\n\n1. Clone this repository and navigate to ShapeLLM folder\n```Shell\ngit clone https://github.com/qizekun/ShapeLLM.git\ncd ShapeLLM\n```\n2. Install Package\n```Shell\nconda create -n shapellm python=3.10 -y\nconda activate shapellm\npip install --upgrade pip  # enable PEP 660 support\npip install -e .\n```\n3. Install additional packages for training cases\n```Shell\npip install -e \".[train]\"\npip install flash-attn --no-build-isolation\n```\n4. Install PointNet++\n```Shell\npip install \"git+https://github.com/erikwijmans/Pointnet2_PyTorch.git#egg=pointnet2_ops\u0026subdirectory=pointnet2_ops_lib\"\n```\n\n\n## ShapeLLM\n### model weights\nPlease check out our [Model Zoo](https://github.com/qizekun/ShapeLLM/blob/main/docs/MODEL_ZOO.md) for all public ShapeLLM checkpoints.\n\n### Demo\n#### CLI Inference\nChat about point clouds using CLI interface. It also supports multiple GPUs, 4-bit and 8-bit quantized inference.\nIf you encounter issues accessing Huggingface, please use `export HF_ENDPOINT=https://hf-mirror.com`.\n```Shell\npython -m llava.serve.cli \\\n    --model-path qizekun/ShapeLLM_13B_general_v1.0 \\\n    --pts-file assets/instrument.npy\n```\n\n### Training\nConsistent with LLaVA, we adopt a two-stage training approach. In the first stage, we solely fine-tune the projector for semantic alignment. In the second stage, we conduct full fine-tuning using Instruction Following data.\nDownload data following [DATA](https://github.com/qizekun/ShapeLLM/blob/main/docs/DATA.md), organize the data as follows in `./playground/data/shapellm/`,\n```\n│playground/data/shapellm/\n├── cap3d_objaverse_785k.json\n├── cap3d_objaverse_sft_45k.json\n├── gapartnet_sft_27k_openai.json\n├── gapartnet_pcs\n│   ├── Box_100129_0_0.npy\n│   └── ...\n└── cap3d_pcs\n    ├── 00000054c36d44a2a483bdbff31d8edf.pt\n    └── ...\n```\nFurthermore, ShapeLLM utilizes the Large version of [ReCon++](https://github.com/qizekun/ShapeLLM/blob/main/ReConV2/cfgs/pretrain/large/openshape.yaml) as the point encoder.\nYou need to download the [ReCon++ weight](https://huggingface.co/qizekun/ReConV2/blob/main/zeroshot/large/best_lvis.pth) and save it to `./checkpoints/recon/large.pth`.\n```\n│checkpoints/recon/\n└── large.pth\n```\n**1. Feature Alignment Stage**\n```\nsh scripts/pretrain.sh\n```\n**2. Visual Instruction Tuning Stage**\n```\nsh scripts/finetune.sh\n```\nThe training takes around 14 hours for ShapeLLM-13B on 8x A100 (80G). It takes around 7 hours for ShapeLLM-7B.\n\n### Zero-shot Understanding on 3D MM-Vet\nEvaluate 3D MLLMs for integrated capabilities and embodied interaction capabilities, run the script:\n```\nsh scripts/eval/mmvet.sh\n```\nUsing GPT-4 to calulate the 3D MM-Vet score:\n```\nsh scripts/eval/eval_mmvet.sh\n```\n\n### Visual Grounding on GApartNet\nEvaluate the performance of ShapeLLM on the GApartNet dataset, run the script:\n```\nsh scripts/eval/gapartnet_ref.sh\n```\nCalucate the generative 3D visual grounding accuracy:\n```\nsh scripts/eval/eval_gapartnet.sh\n```\n\n## ReCon++\n### ReCon++ model weights\nPlease check out our [Model Zoo](https://github.com/qizekun/ShapeLLM/blob/main/docs/MODEL_ZOO.md) for all public ReCon++ checkpoints.\n\n### Pretrain\nDownload and organize data following [DATA](https://github.com/qizekun/ShapeLLM/blob/main/docs/DATA.md).\nIf you encounter issues accessing Huggingface, please use `export HF_ENDPOINT=https://hf-mirror.com`.\n\nReCon++ adopts a two-stage pre-training approach, initially conducting generative pre-training in either random or causal form, followed by cross-modal contrastive learning. It is worth noting that we employ a gradient stopping strategy for transfer learning tasks, while we do not use gradient stopping for zero-shot tasks.\n```\nsh ReConV2/scripts/pretrain_zeroshot/pretrain_reconstruct.sh \u003cexp_name\u003e\nsh ReConV2/scripts/pretrain_transfer/pretrain_reconstruct.sh \u003cexp_name\u003e\n```\n```\nsh ReConV2/scripts/pretrain_zeroshot/pretrain_contrast.sh \u003cexp_name\u003e \u003cpath/to/stage1-pre-trained/model\u003e\nsh ReConV2/scripts/pretrain_transfer/pretrain_contrast.sh \u003cexp_name\u003e \u003cpath/to/stage1-pre-trained/model\u003e\n```\n\n### Classification\n| Model                                                 | Version | OBJ_BG | OBJ_ONLY | PB_T50_RS | MN-40 1k | MN-40 8k |\n|-------------------------------------------------------|---------|--------|----------|-----------|----------|----------|\n| [ACT](https://github.com/RunpeiDong/ACT)              | Small   | 93.29% | 91.91%   | 88.21%    | 93.7%    | 94.0%    |\n| [ReCon](https://github.com/qizekun/ReCon)             | Small   | 95.35% | 93.80%   | 91.26%    | 94.5%    | 94.7%    |\n| [PointGPT](https://github.com/CGuangyan-BIT/PointGPT) | Base    | 95.8%  | 95.2%    | 91.9%     | 94.4%    | 94.6%    |\n| [ReCon++](https://github.com/qizekun/ShapeLLM)        | Base    | 98.62% | 96.21%   | 93.34%    | 94.6%    | 94.8%    |\n| [ReCon++](https://github.com/qizekun/ShapeLLM)        | Large   | 98.80% | 97.59%   | 95.25%    | 94.8%    | 95.0%    |\n\nFine-tuning with the default configuration, run the script:\n```\nbash ReConV2/scripts/downstream/cls.sh \u003cGPU\u003e \u003cexp_name\u003e \u003cpath/to/pre-trained/model\u003e\n```\nTest\u0026Voting with the default configuration, run the script:\n```\nbash ReConV2/scripts/downstream/test.sh \u003cGPU\u003e \u003cexp_name\u003e \u003cpath/to/best/fine-tuned/model\u003e\n```\n\n### Few-shot-Learning\n| Model                                                 | Version | 5w10s (%)  | 5w20s (%)  | 10w10s (%) | 10w20s (%) |\n|-------------------------------------------------------|---------|------------|------------|------------|------------|\n| [ACT](https://github.com/RunpeiDong/ACT)              | Small   | 96.8 ± 2.3 | 98.0 ± 1.4 | 93.3 ± 4.0 | 95.6 ± 2.8 |\n| [ReCon](https://github.com/qizekun/ReCon)             | Small   | 97.3 ± 1.9 | 98.9 ± 1.2 | 93.3 ± 3.9 | 95.8 ± 3.0 |\n| [PointGPT](https://github.com/CGuangyan-BIT/PointGPT) | Large   | 98.0 ± 1.9 | 99.0 ± 1.0 | 94.1 ± 3.3 | 96.1 ± 2.8 |\n| [ReCon++](https://github.com/qizekun/ShapeLLM)        | Large   | 98.0 ± 2.3 | 99.5 ± 0.8 | 94.5 ± 4.1 | 96.5 ± 3.0 |\n\nFew-shot with the default configuration, run the script:\n```\nsh ReConV2/scripts/downstream/fewshot.sh \u003cGPU\u003e \u003cexp_name\u003e \u003cpath/to/pre-trained/model\u003e \u003cway\u003e \u003cshot\u003e \u003cfold\u003e\n```\n\n### Zero-shot-Learning\n| Model                                                  | Version | Objaverse-LVIS | ModelNet40 | ScanObjectNN |\n|--------------------------------------------------------|---------|----------------|------------|--------------|\n| [OpenShape](https://github.com/Colin97/OpenShape_code) | Base    | 46.8%          | 84.4%      | 52.2%        |\n| [Uni3D](https://github.com/baaivision/Uni3D)           | Base    | 51.7%          | 86.3%      | 63.8%        |\n| [Uni3D](https://github.com/baaivision/Uni3D)           | Large   | 53.1%          | 86.3%      | 58.2%        |\n| [ReCon++](https://github.com/qizekun/ShapeLLM)         | Base    | 53.2%          | 86.5%      | 63.6%        |\n| [ReCon++](https://github.com/qizekun/ShapeLLM)         | Large   | 53.7%          | 87.3%      | 65.4%        |\n\nIn the pre-training process, Zero-shot evaluation is enabled by default.\nZero-shot with the default configuration, run the script:\n```\nbash ReConV2/scripts/downstream/zeroshot.sh \u003cGPU\u003e \u003cexp_name\u003e \u003cpath/to/pre-trained/model\u003e\n```\n\n\n## 3D MM-Vet\n\n3D MM-Vet is a carefully crafted multi-level 3D QA benchmark that consists of 59 unique 3D models and 232 human-written questions and answers with rich content.\n\nThe test data and scripts have been uploaded to [Hugging Face](https://huggingface.co/datasets/qizekun/3D-MM-Vet). You can also locate the evaluation scripts from the [codebase](https://github.com/qizekun/ShapeLLM/blob/main/scripts/eval/eval_mmvet.sh) of ShapeLLM.\n\nFurthermore, we propose 3D MM-Vet-C, which contains three variants: single-view, jitter, and rotation. They represent extracting partial point clouds of the front view field of view, adding Gaussian noise to the point cloud xyz, and random rotation on the x, y, and z axes, respectively.\n\nHere is a more detailed explanation of each variant:\n\n- **Single-view**: This variant focuses on the model's ability to understand the 3D object from a single viewpoint. To create the single-view variant, we extract the front-view point cloud of each model.\n- **Jitter**: This variant tests the model's robustness to noise. To create the jitter variant, we add Gaussian noise with zero mean and variance of 0.01 to the point cloud xyz.\n- **Rotation**: This variant examines the model's ability to understand the 3D scene from different viewpoints. To create the rotation variant, we randomly apply 30 degrees of random rotation on the x, y, and z axes.\n\nWe believe that 3D MM-Vet and 3D MM-Vet-C are valuable resources for the 3D QA community. They can be used to evaluate the performance of existing models and to develop new models that are better at understanding and reasoning about 3D objects.\n\n## Visualization\nWe use [PointVisualizaiton](https://github.com/qizekun/PointVisualizaiton) repo to render beautiful point cloud images, including specified color rendering and attention distribution rendering.\n\n## Acknowledgement\n\nThis codebase is built upon [LLaVA](https://github.com/haotian-liu/LLaVA), [OpenShape](https://github.com/Colin97/OpenShape_code), [ReCon](https://github.com/qizekun/ReCon) and [PointGPT](https://github.com/CGuangyan-BIT/PointGPT).\n\n## Related Works\n\n- [Point-Bind \u0026 Point-LLM](https://arxiv.org/abs/2309.00615)\n- [3D-LLM](https://arxiv.org/abs/2307.12981)\n- [PointLLM](http://arxiv.org/abs/2308.16911)\n\n## Citation\n\nIf you find ShapeLLM or ReCon++ useful for your research and applications, please cite using this BibTeX:\n```bibtex\n@article{qi2024shapellm,\n  author = {Qi, Zekun and Dong, Runpei and Zhang, Shaochen and Geng, Haoran and Han, Chunrui and Ge, Zheng and Yi, Li and Ma, Kaisheng},\n  title = {ShapeLLM: Universal 3D Object Understanding for Embodied Interaction},\n  journal = {arXiv preprint arXiv:2402.17766},\n  year = {2024}\n}\n```\nand closely related work [ReCon](https://github.com/qizekun/ReCon) and [ACT](https://github.com/RunpeiDong/ACT):\n```bibtex\n@inproceedings{qi2023recon,\n  title={Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining},\n  author={Qi, Zekun and Dong, Runpei and Fan, Guofan and Ge, Zheng and Zhang, Xiangyu and Ma, Kaisheng and Yi, Li},\n  booktitle={International Conference on Machine Learning (ICML) },\n  url={https://openreview.net/forum?id=80IfYewOh1},\n  year={2023}\n}\n\n@inproceedings{dong2023act,\n  title={Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning?},\n  author={Runpei Dong and Zekun Qi and Linfeng Zhang and Junbo Zhang and Jianjian Sun and Zheng Ge and Li Yi and Kaisheng Ma},\n  booktitle={The Eleventh International Conference on Learning Representations (ICLR) },\n  url={https://openreview.net/forum?id=8Oun8ZUVe8N},\n  year={2023}\n}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqizekun%2FShapeLLM","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fqizekun%2FShapeLLM","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqizekun%2FShapeLLM/lists"}