{"id":13442332,"url":"https://github.com/RunpeiDong/ACT","last_synced_at":"2025-03-20T13:33:27.001Z","repository":{"id":65412797,"uuid":"580232368","full_name":"RunpeiDong/ACT","owner":"RunpeiDong","description":"[ICLR 2023] Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning?","archived":false,"fork":false,"pushed_at":"2024-07-01T14:55:49.000Z","size":5770,"stargazers_count":98,"open_issues_count":0,"forks_count":5,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-10-28T05:13:07.181Z","etag":null,"topics":["3d-point-clouds","cross-modal-learning","representation-learning","self-supervised-learning"],"latest_commit_sha":null,"homepage":"https://openreview.net/forum?id=8Oun8ZUVe8N","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RunpeiDong.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-12-20T03:27:25.000Z","updated_at":"2024-10-22T16:48:34.000Z","dependencies_parsed_at":"2024-01-16T02:46:28.278Z","dependency_job_id":"f37da3c0-16af-4c3c-91ea-722608048cbe","html_url":"https://github.com/RunpeiDong/ACT","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RunpeiDong%2FACT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RunpeiDong%2FACT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RunpeiDong%2FACT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RunpeiDong%2FACT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RunpeiDong","download_url":"https://codeload.github.com/RunpeiDong/ACT/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244619281,"owners_count":20482393,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["3d-point-clouds","cross-modal-learning","representation-learning","self-supervised-learning"],"created_at":"2024-07-31T03:01:44.449Z","updated_at":"2025-03-20T13:33:26.202Z","avatar_url":"https://github.com/RunpeiDong.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Autoencoders as Cross-Modal Teachers\n\u003e [**Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning?**](https://openreview.net/forum?id=8Oun8ZUVe8N) **ICLR 2023** \u003cbr\u003e\n\u003e [Runpei Dong](https://runpeidong.com), [Zekun Qi](https://scholar.google.com.hk/citations?user=ap8yc3oAAAAJ\u0026hl=en), [Linfeng Zhang](http://group.iiis.tsinghua.edu.cn/~maks/linfeng/index.html), [Junbo Zhang](https://scholar.google.com.hk/citations?user=rSP0pGQAAAAJ\u0026hl=en), [Jianjian Sun](https://scholar.google.com.hk/citations?user=MVZrGkYAAAAJ\u0026hl=en\u0026oi=ao), [Zheng Ge](https://scholar.google.com.hk/citations?user=hJ-VrrIAAAAJ\u0026hl=en\u0026oi=ao), [Li Yi](https://ericyi.github.io/), and [Kaisheng Ma](http://group.iiis.tsinghua.edu.cn/~maks/leader.html) \u003cbr\u003e\n\n[OpenReview](https://openreview.net/forum?id=8Oun8ZUVe8N) | [arXiv](https://arxiv.org/abs/2212.08320) | [Models](https://drive.google.com/drive/folders/1hZUmqRvAg64abnkaI1HxctfQPTetWpaH?usp=share_link)\n\nThis repository contains the code release of paper **Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning?** (ICLR 2023).\n\n## News\n\n- 🍾 July, 2024: [**ShapeLLM (ReCon++)**](https://qizekun.github.io/shapellm/) accepted by ECCV 2024, check out the [code](https://github.com/qizekun/ShapeLLM)\n- 💥 Mar, 2024: Check out our latest work [**ShapeLLM (ReCon++)**](https://qizekun.github.io/shapellm/), which achieves **95.25%** fine-tuned accuracy and **65.4** zero-shot accuracy on ScanObjectNN\n- 🎉 Apr, 2023: [**ReCon**](https://arxiv.org/abs/2302.02318) accepted by ICML 2023, check out the [code](https://github.com/qizekun/ReCon)\n- 📌 Feb, 2023: Check out our latest work [**ReCon**](https://arxiv.org/abs/2302.02318), which achieves **91.26%** accuracy on ScanObjectNN\n- 💥 Jan, 2023: [**ACT**](https://arxiv.org/abs/2212.08320) accepted by ICLR 2023\n\n## ACT:clapper:\n\nThe success of deep learning heavily relies on large-scale data with comprehensive labels, which is more expensive and time-consuming to fetch in 3D compared to 2D images or natural languages. This promotes the potential of utilizing models pretrained with data more than 3D as teachers for cross-modal knowledge transferring. In this paper, we revisit masked modeling in a unified fashion of knowledge distillation, and we show that foundational Transformers pretrained with 2D images or natural languages can help self-supervised 3D representation learning through training **A**utoencoders as **C**ross-Modal **T**eachers (**ACT**:clapper:). The pretrained Transformers are transferred as cross-modal 3D teachers using discrete variational autoencoding self-supervision, during which the Transformers are frozen with prompt tuning for better knowledge inheritance. The latent features encoded by the 3D teachers are used as the target of masked point modeling, wherein the dark knowledge is distilled to the 3D Transformer students as foundational geometry understanding. Our ACT pretrained 3D learner achieves state-of-the-art generalization capacity across various downstream benchmarks, e.g., 88.21% overall accuracy on ScanObjectNN.\n\n\u003cdiv  align=\"center\"\u003e    \n \u003cimg src=\"./figure/framework.png\" width = \"666\"  align=center /\u003e\n\u003c/div\u003e\n\n\n## Environment\n\nThis codebase was tested with the following environment configurations. It may work with other versions.\n- Ubuntu 18.04\n- CUDA 11.3\n- GCC 7.5.0\n- Python 3.8.8\n- PyTorch 1.10.0\n\n## 1. Installation\nWe recommend using Anaconda for the installation process:\n```shell\n# Make sure `g++-7 --version` is at least 7.4.0\n$ sudo apt install g++-7  # For CUDA 10.2, must use GCC \u003c 8\n\n# Create virtual env and install PyTorch\n$ conda create -n act python=3.8.8\n$ conda activate act\n\n(act) $ conda install openblas-devel -c anaconda\n(act) $ conda install pytorch==1.10.0 torchvision==0.11.0 cudatoolkit=11.3 -c pytorch -c nvidia\n# Or, you can set up Pytorch with pip from official link:\n# (act) $ pip install torch==1.10.0+cu113 torchvision==0.11.0+cu113 torchaudio==0.10.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html # recommended\n# For CUDA 10.2, use conda:\n# (act) $ conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 -c pytorch -c nvidia\n# Or pip:\n# (act) $ pip install torch==1.11.0+cu102 torchvision==0.12.0+cu102 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu102\n\n# Install basic required packages\n(act) $ pip install -r requirements.txt\n\n# Chamfer Distance\n(act) $ cd ./extensions/chamfer_dist \u0026\u0026 python setup.py install --user\n# PointNet++\n(act) $ pip install \"git+https://github.com/erikwijmans/Pointnet2_PyTorch.git#egg=pointnet2_ops\u0026subdirectory=pointnet2_ops_lib\"\n# GPU kNN\n(act) $ pip install --upgrade https://github.com/unlimblue/KNN_CUDA/releases/download/0.2/KNN_CUDA-0.2-py3-none-any.whl\n```\n\n## 2. Datasets\n\nWe use ShapeNet, ScanObjectNN, ModelNet40, S3DIS and ShapeNetPart in this work. See [DATASET.md](./DATASET.md) for details.\n\n## 3. Models\n\nThe models and logs have been released on [Google Drive](https://drive.google.com/drive/folders/1hZUmqRvAg64abnkaI1HxctfQPTetWpaH?usp=share_link). See [MODEL_ZOO.md](./model_zoo/MODEL_ZOO.md) for details.\n\n## 4. ACT Pre-training\nTo pretrain ACT on the ShapeNet training set, run the following command. If you want to try different models or masking ratios etc., first create a new config file, and pass its path to --config.\n\nACT pretraining includes two stages:\n\n* Stage I, transferring pretrained Transformer on ShapeNet as 3D autoencoder by running:\n\n  ```shell\n  CUDA_VISIBLE_DEVICES=\u003cGPUs\u003e python main_autoencoder.py \\\n      --config \"cfgs/autoencoder/act_dvae_with_pretrained_transformer.yaml\" \\\n      --exp_name \u003coutput_file_name\u003e\n  ```\n\n  or\n\n  ```shell\n  sh train_autoencoder.sh \u003cGPU\u003e\n  ```\n\n* Stage II, pretrain 3D Transformer student on ShapeNet by running:\n\n  ```shell\n  CUDA_VISIBLE_DEVICES=\u003cGPUs\u003e \\\n      python main.py --config \"cfgs/pretrain/pretrain_act_distill.yaml\" \\\n      --exp_name \u003coutput_file_name\u003e\n  ```\n\n  or\n\n  ```shell\n  sh pretrain.sh \u003cGPU\u003e\n  ```\n\n## 5. ACT Fine-tuning\n\nFine-tuning on ScanObjectNN, run:\n```shell\nCUDA_VISIBLE_DEVICES=\u003cGPUs\u003e python main.py --config cfgs/finetune_classification/full/finetune_scan_hardest.yaml \\\n--finetune_model --exp_name \u003coutput_file_name\u003e --ckpts \u003cpath/to/pre-trained/model\u003e\n```\nFine-tuning on ModelNet40, run:\n```shell\nCUDA_VISIBLE_DEVICES=\u003cGPUs\u003e python main.py --config cfgs/finetune_classification/full/finetune_modelnet.yaml \\\n--finetune_model --exp_name \u003coutput_file_name\u003e --ckpts \u003cpath/to/pre-trained/model\u003e\n```\nVoting on ModelNet40, run:\n```shell\nCUDA_VISIBLE_DEVICES=\u003cGPUs\u003e python main.py --test --config cfgs/finetune_classification/full/finetune_modelnet.yaml \\\n--exp_name \u003coutput_file_name\u003e --ckpts \u003cpath/to/best/fine-tuned/model\u003e\n```\nFew-shot learning, run:\n```shell\nCUDA_VISIBLE_DEVICES=\u003cGPUs\u003e python main.py --config cfgs/finetune_classification/few_shot/fewshot_modelnet.yaml --finetune_model \\\n--ckpts \u003cpath/to/pre-trained/model\u003e --exp_name \u003coutput_file_name\u003e --way \u003c5 or 10\u003e --shot \u003c10 or 20\u003e --fold \u003c0-9\u003e\n```\nSemantic segmentation on S3DIS, run:\n```shell\ncd semantic_segmentation\npython main.py --ckpts \u003cpath/to/pre-trained/model\u003e --root path/to/data --learning_rate 0.0002 --epoch 60\n```\n\n## 6. Visualization\nWe use [PointVisualizaiton](https://github.com/qizekun/PointVisualizaiton) repo to render beautiful pointcloud image.\nReconstruction results of synthetic objects from ShapeNet test set. We show the comparison of our ACT autoencoder and Point-BERT dVAE model:\n\u003cdiv  align=\"center\"\u003e    \n \u003cimg src=\"./figure/visualization.png\" width = \"666\"  align=center /\u003e\n\u003c/div\u003e\n\n\n## License\nACT is released under the MIT License. See the [LICENSE](./LICENSE) file for more details. Besides, the licensing information for `pointnet2` modules is available [here](https://github.com/erikwijmans/Pointnet2_PyTorch/blob/master/UNLICENSE).\n\n## Acknowledgements\nMany thanks to the following codes that help us a lot in building this codebase:\n* [Point-BERT](https://github.com/lulutang0608/Point-BERT)\n* [Point-MAE](https://github.com/Pang-Yatian/Point-MAE)\n* [VPT](https://github.com/KMnP/vpt)\n* [Pointnet2_PyTorch](https://github.com/erikwijmans/Pointnet2_PyTorch)\n* [PointVisualizaiton](https://github.com/qizekun/PointVisualizaiton)\n\n## Contact\n\nIf you have any questions related to the code or the paper, feel free to email Runpei (`runpei.dong@gmail.com`). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!\n\n## Citation\n\nIf you find our work useful in your research, please consider citing ACT:\n```bibtex\n@inproceedings{dong2023act,\n  title={Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning?},\n  author={Runpei Dong and Zekun Qi and Linfeng Zhang and Junbo Zhang and Jianjian Sun and Zheng Ge and Li Yi and Kaisheng Ma},\n  booktitle={The Eleventh International Conference on Learning Representations (ICLR) },\n  year={2023},\n  url={https://openreview.net/forum?id=8Oun8ZUVe8N}\n}\n```\nand closely related work [ReCon](https://github.com/qizekun/ReCon) and [ShapeLLM](https://github.com/qizekun/ShapeLLM):\n```bibtex\n@inproceedings{qi2023recon,\n  title={Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining},\n  author={Qi, Zekun and Dong, Runpei and Fan, Guofan and Ge, Zheng and Zhang, Xiangyu and Ma, Kaisheng and Yi, Li},\n  booktitle={International Conference on Machine Learning (ICML) },\n  year={2023}\n}\n@inproceedings{qi2024shapellm,\n  author = {Qi, Zekun and Dong, Runpei and Zhang, Shaochen and Geng, Haoran and Han, Chunrui and Ge, Zheng and Wang, He and Yi, Li and Ma, Kaisheng},\n  title = {ShapeLLM: Universal 3D Object Understanding for Embodied Interaction},\n  booktitle={European Conference on Computer Vision (ECCV) },\n  year = {2024}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FRunpeiDong%2FACT","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FRunpeiDong%2FACT","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FRunpeiDong%2FACT/lists"}