{"id":27064875,"url":"https://github.com/3dlg-hcvc/m3dref-clip","last_synced_at":"2025-08-04T08:37:32.164Z","repository":{"id":173290859,"uuid":"647987006","full_name":"3dlg-hcvc/M3DRef-CLIP","owner":"3dlg-hcvc","description":"[ICCV 2023] Multi3DRefer: Grounding Text Description to Multiple 3D Objects","archived":false,"fork":false,"pushed_at":"2024-01-26T19:13:55.000Z","size":1597,"stargazers_count":51,"open_issues_count":1,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-04-20T20:53:31.299Z","etag":null,"topics":["3d","clip","computer-vision","cuda","deep-learning","localization","pytorch","pytorch-lightning","transformer","visual-grounding"],"latest_commit_sha":null,"homepage":"https://3dlg-hcvc.github.io/multi3drefer/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/3dlg-hcvc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-06-01T00:52:21.000Z","updated_at":"2024-04-09T10:59:21.000Z","dependencies_parsed_at":null,"dependency_job_id":"e143afad-e7c7-4e13-ae1f-0001fc8cd2c8","html_url":"https://github.com/3dlg-hcvc/M3DRef-CLIP","commit_stats":null,"previous_names":["3dlg-hcvc/m3dref-clip"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/3dlg-hcvc%2FM3DRef-CLIP","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/3dlg-hcvc%2FM3DRef-CLIP/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/3dlg-hcvc%2FM3DRef-CLIP/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/3dlg-hcvc%2FM3DRef-CLIP/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/3dlg-hcvc","download_url":"https://codeload.github.com/3dlg-hcvc/M3DRef-CLIP/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247370233,"owners_count":20927979,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["3d","clip","computer-vision","cuda","deep-learning","localization","pytorch","pytorch-lightning","transformer","visual-grounding"],"created_at":"2025-04-05T17:19:29.785Z","updated_at":"2025-04-05T17:19:30.274Z","avatar_url":"https://github.com/3dlg-hcvc.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# M3DRef-CLIP\n\n\u003ca href=\"https://pytorch.org/\"\u003e\u003cimg alt=\"PyTorch\" src=\"https://img.shields.io/badge/PyTorch-EE4C2C?style=for-the-badge\u0026logo=pytorch\u0026logoColor=white\"\u003e\u003c/a\u003e\n\u003ca href=\"https://pytorchlightning.ai/\"\u003e\u003cimg alt=\"Lightning\" src=\"https://img.shields.io/badge/Lightning-792DE4?style=for-the-badge\u0026logo=pytorch-lightning\u0026logoColor=white\"\u003e\u003c/a\u003e\n\u003ca href=\"https://wandb.ai/site\"\u003e\u003cimg alt=\"WandB\" src=\"https://img.shields.io/badge/Weights_\u0026_Biases-FFBE00?style=for-the-badge\u0026logo=WeightsAndBiases\u0026logoColor=white\"\u003e\u003c/a\u003e\n\nThis is the official implementation for [Multi3DRefer: Grounding Text Description to Multiple 3D Objects](https://3dlg-hcvc.github.io/multi3drefer/).\n\n![Model Architecture](./docs/img/model_arch.jpg)\n\n## Requirement\nThis repo contains [CUDA](https://developer.nvidia.com/cuda-zone) implementation, please make sure your [GPU compute capability](https://developer.nvidia.com/cuda-gpus) is at least 3.0 or above.\n\nWe report the max computing resources usage with batch size 4:\n\n|               | Training | Inference |\n|:--------------|:---------|:----------|\n| GPU mem usage | 15.2 GB  | 11.3 GB   |\n\n\n## Setup\n### Conda (recommended)\nWe recommend the use of [miniconda](https://docs.conda.io/en/latest/miniconda.html) to manage system dependencies.\n\n```shell\n# create and activate the conda environment\nconda create -n m3drefclip python=3.10\nconda activate m3drefclip\n\n# install PyTorch 2.0.1\nconda install pytorch torchvision pytorch-cuda=11.7 -c pytorch -c nvidia\n\n# install PyTorch3D with dependencies\nconda install -c fvcore -c iopath -c conda-forge fvcore iopath\nconda install pytorch3d -c pytorch3d\n\n# install MinkowskiEngine with dependencies\nconda install -c anaconda openblas\npip install -U git+https://github.com/NVIDIA/MinkowskiEngine -v --no-deps \\\n--install-option=\"--blas_include_dirs=${CONDA_PREFIX}/include\" --install-option=\"--blas=openblas\"\n\n# install Python libraries\npip install .\n\n# install CUDA extensions\ncd m3drefclip/common_ops\npip install .\n```\n\n### Pip\nNote: Setting up with pip (no conda) requires [OpenBLAS](https://github.com/xianyi/OpenBLAS) to be pre-installed in your system.\n```shell\n# create and activate the virtual environment\nvirtualenv env\nsource env/bin/activate\n\n# install PyTorch 2.0.1\npip install torch torchvision\n\n# install PyTorch3D\npip install pytorch3d\n\n# install MinkowskiEngine\npip install MinkowskiEngine\n\n# install Python libraries\npip install .\n\n# install CUDA extensions\ncd m3drefclip/common_ops\npip install .\n```\n\n## Data Preparation\nNote: Both [ScanRefer](https://daveredrum.github.io/ScanRefer/) and [Nr3D](https://referit3d.github.io/) datasets requires the [ScanNet v2](http://www.scan-net.org/) dataset. Please preprocess it first.\n\n### ScanNet v2 dataset\n1. Download the [ScanNet v2 dataset (train/val/test)](http://www.scan-net.org/), (refer to [ScanNet's instruction](dataset/scannetv2/README.md) for more details). The raw dataset files should be organized as follows:\n    ```shell\n    m3drefclip # project root\n    ├── dataset\n    │   ├── scannetv2\n    │   │   ├── scans\n    │   │   │   ├── [scene_id]\n    │   │   │   │   ├── [scene_id]_vh_clean_2.ply\n    │   │   │   │   ├── [scene_id]_vh_clean_2.0.010000.segs.json\n    │   │   │   │   ├── [scene_id].aggregation.json\n    │   │   │   │   ├── [scene_id].txt\n    ```\n\n2. Pre-process the data, it converts original meshes and annotations to `.pth` data:\n    ```shell\n    python dataset/scannetv2/preprocess_all_data.py data=scannetv2 +workers={cpu_count}\n    ```\n\n3. Pre-process the multiview features from ENet: Please refer to the instructions in [ScanRefer's repo](https://github.com/daveredrum/ScanRefer#data-preparation) with one modification:\n   - comment out lines 51 to 56 in [batch_load_scannet_data.py](https://github.com/daveredrum/ScanRefer/blob/master/data/scannet/batch_load_scannet_data.py#L51-L56) since we follow D3Net's setting that doesn't do point downsampling here.\n\n   Then put the generated `enet_feats_maxpool.hdf5` (116GB) under `m3drefclip/dataset/scannetv2`\n\n### ScanRefer dataset\n1. Download the [ScanRefer dataset (train/val)](https://daveredrum.github.io/ScanRefer/). Also, download the [test set](http://kaldir.vc.in.tum.de/scanrefer_benchmark_data.zip). The raw dataset files should be organized as follows:\n    ```shell\n    m3drefclip # project root\n    ├── dataset\n    │   ├── scanrefer\n    │   │   ├── metadata\n    │   │   │   ├── ScanRefer_filtered_train.json\n    │   │   │   ├── ScanRefer_filtered_val.json\n    │   │   │   ├── ScanRefer_filtered_test.json\n    ```\n\n2. Pre-process the data, \"unique/multiple\" labels will be added to raw `.json` files for evaluation purpose:\n    ```shell\n    python dataset/scanrefer/add_evaluation_labels.py data=scanrefer\n    ```\n\n### Nr3D dataset\n1. Download the [Nr3D dataset (train/test)](https://referit3d.github.io/benchmarks.html). The raw dataset files should be organized as follows:\n    ```shell\n    m3drefclip # project root\n    ├── dataset\n    │   ├── nr3d\n    │   │   ├── metadata\n    │   │   │   ├── nr3d_train.csv\n    │   │   │   ├── nr3d_test.csv\n    ```\n\n2. Pre-process the data, \"easy/hard/view-dep/view-indep\" labels will be added to raw `.csv` files for evaluation purpose:\n    ```shell\n    python dataset/nr3d/add_evaluation_labels.py data=nr3d\n    ```\n\n### Multi3DRefer dataset\n1. Downloading the [Multi3DRefer dataset (train/val)](https://aspis.cmpt.sfu.ca/projects/multi3drefer/data/multi3drefer_train_val.zip). The raw dataset files should be organized as follows:\n    ```shell\n    m3drefclip # project root\n    ├── dataset\n    │   ├── multi3drefer\n    │   │   ├── metadata\n    │   │   │   ├── multi3drefer_train.json\n    │   │   │   ├── multi3drefer_val.json\n    ```\n\n### Pre-trained detector\nWe pre-trained [PointGroup](https://arxiv.org/abs/2004.01658) implemented in [MINSU3D](https://github.com/3dlg-hcvc/minsu3d/) on [ScanNet v2](http://www.scan-net.org/) and use it as the detector. We use coordinates + colors + multi-view features as inputs.\n1. Download the [pre-trained detector](https://aspis.cmpt.sfu.ca/projects/m3dref-clip/pretrain/PointGroup_ScanNet.ckpt). The detector checkpoint file should be organized as follows:\n    ```shell\n    m3drefclip # project root\n    ├── checkpoints\n    │   ├── PointGroup_ScanNet.ckpt\n    ```\n\n## Training, Inference and Evaluation\nNote: Configuration files are managed by [Hydra](https://hydra.cc/), you can easily add or override any configuration attributes by passing them as arguments.\n```shell\n# log in to WandB\nwandb login\n\n# train a model with the pre-trained detector, using predicted object proposals\npython train.py data={scanrefer/nr3d/multi3drefer} experiment_name={any_string} +detector_path=checkpoints/PointGroup_ScanNet.ckpt\n\n# train a model with the pretrained detector, using GT object proposals\npython train.py data={scanrefer/nr3d/multi3drefer} experiment_name={any_string} +detector_path=checkpoints/PointGroup_ScanNet.ckpt model.network.detector.use_gt_proposal=True\n\n# train a model from a checkpoint, it restores all hyperparameters in the .ckpt file\npython train.py data={scanrefer/nr3d/multi3drefer} experiment_name={checkpoint_experiment_name} ckpt_path={ckpt_file_path}\n\n# test a model from a checkpoint and save its predictions\npython test.py data={scanrefer/nr3d/multi3drefer} data.inference.split={train/val/test} ckpt_path={ckpt_file_path} pred_path={predictions_path}\n\n# evaluate predictions\npython evaluate.py data={scanrefer/nr3d/multi3drefer} pred_path={predictions_path} data.evaluation.split={train/val/test}\n```\n## Checkpoints\n### ScanRefer dataset\n[M3DRef-CLIP_ScanRefer.ckpt](https://aspis.cmpt.sfu.ca/projects/m3dref-clip/pretrain/M3DRef-CLIP_ScanRefer.ckpt)\n\nPerformance:\n\n| Split | IoU  | Unique | Multiple | Overall | \n|:------|:-----|:-------|:---------|:--------|\n| Val   | 0.25 | 85.3   | 43.8     | 51.9    |\n| Val   | 0.5  | 77.2   | 36.8     | 44.7    |\n| Test  | 0.25 | 79.8   | 46.9     | 54.3    |\n| Test  | 0.5  | 70.9   | 38.1     | 45.5    |\n\n### Nr3D dataset\n[M3DRef-CLIP_Nr3d.ckpt](https://aspis.cmpt.sfu.ca/projects/m3dref-clip/pretrain/M3DRef-CLIP_Nr3D.ckpt)\n\nPerformance:\n\n| Split | Easy | Hard | View-dep | View-indep | Overall |\n|:------|:-----|:-----|:---------|:-----------|:--------|\n| Test  | 55.6 | 43.4 | 42.3     | 52.9       | 49.4    |\n\n### Multi3DRefer dataset\n[M3DRef-CLIP_Multi3DRefer.ckpt](https://aspis.cmpt.sfu.ca/projects/m3dref-clip/pretrain/M3DRef-CLIP_Multi3DRefer.ckpt)\n\nPerformance:\n\n| Split | IoU  | ZT w/ D | ZT w/o D | ST w/ D | ST w/o D | MT   | Overall |\n|:------|:-----|:--------|:---------|:--------|:---------|:-----|:--------|\n| Val   | 0.25 | 39.4   | 81.8     | 34.6    | 53.5     | 43.6 | 42.8    |\n| Val   | 0.5  | 39.4   | 81.8     | 30.6    | 47.8     | 37.9 | 38.4    |\n     \n\n## Benchmark\n### ScanRefer\nConvert M3DRef-CLIP predictions to [ScanRefer benchmark format](https://kaldir.vc.in.tum.de/scanrefer_benchmark/documentation):\n```shell\npython dataset/scanrefer/convert_output_to_benchmark_format.py data=scanrefer pred_path={predictions_path} +output_path={output_file_path}\n```\n### Nr3D\nPlease refer to [ReferIt3D benchmark](https://referit3d.github.io/benchmarks.html) to report results.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F3dlg-hcvc%2Fm3dref-clip","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F3dlg-hcvc%2Fm3dref-clip","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F3dlg-hcvc%2Fm3dref-clip/lists"}