{"id":29654558,"url":"https://github.com/tum-vision/scenedino","last_synced_at":"2025-07-22T07:34:57.566Z","repository":{"id":303705523,"uuid":"990645682","full_name":"tum-vision/scenedino","owner":"tum-vision","description":"Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion (ICCV 2025)","archived":false,"fork":false,"pushed_at":"2025-07-09T01:09:58.000Z","size":29890,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-07-09T02:26:46.012Z","etag":null,"topics":["3d-reconstruction","3d-scene-understanding","3d-semantic-segmentation","occupancy-prediction","segmentation","semantic-scene-completion","single-image-reconstruction","unsupervised-learning","unsupervised-scene-understanding","unsupervised-segmentation"],"latest_commit_sha":null,"homepage":"https://visinf.github.io/scenedino","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tum-vision.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-26T12:24:44.000Z","updated_at":"2025-07-09T02:07:07.000Z","dependencies_parsed_at":"2025-07-09T02:27:01.408Z","dependency_job_id":"3efdd63d-2a7d-460d-a4ad-bf506201fdc7","html_url":"https://github.com/tum-vision/scenedino","commit_stats":null,"previous_names":["tum-vision/scenedino"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/tum-vision/scenedino","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tum-vision%2Fscenedino","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tum-vision%2Fscenedino/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tum-vision%2Fscenedino/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tum-vision%2Fscenedino/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tum-vision","download_url":"https://codeload.github.com/tum-vision/scenedino/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tum-vision%2Fscenedino/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266448575,"owners_count":23930244,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-22T02:00:09.085Z","response_time":66,"last_error":null,"robots_txt_status":null,"robots_txt_updated_at":null,"robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["3d-reconstruction","3d-scene-understanding","3d-semantic-segmentation","occupancy-prediction","segmentation","semantic-scene-completion","single-image-reconstruction","unsupervised-learning","unsupervised-scene-understanding","unsupervised-segmentation"],"created_at":"2025-07-22T07:34:57.108Z","updated_at":"2025-07-22T07:34:57.557Z","avatar_url":"https://github.com/tum-vision.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\u003ch1\u003eFeed-Forward \u003ci\u003eSceneDINO\u003c/i\u003e for Unsupervised Semantic Scene Completion\u003c/h1\u003e\n\n\n[**Aleksandar Jevtić**](https://jev-aleks.github.io/)\u003csup\u003e* 1\u003c/sup\u003e\n[**Christoph Reich**](https://christophreich1996.github.io/)\u003csup\u003e* 1,2,4,5\u003c/sup\u003e\n[**Felix Wimbauer**](https://fwmb.github.io/)\u003csup\u003e1,4\u003c/sup\u003e\n[**Oliver Hahn**](https://olvrhhn.github.io/)\u003csup\u003e2\u003c/sup\u003e\n[**Christian Rupprecht**](https://chrirupp.github.io/)\u003csup\u003e3\u003c/sup\u003e\n[**Stefan Roth**](https://www.visinf.tu-darmstadt.de/visual_inference/people_vi/stefan_roth.en.jsp)\u003csup\u003e2,5,6\u003c/sup\u003e\n[**Daniel Cremers**](https://cvg.cit.tum.de/members/cremers/)\u003csup\u003e1,4,5\u003c/sup\u003e\n\n\n\u003csup\u003e1\u003c/sup\u003eTU Munich   \u003csup\u003e2\u003c/sup\u003eTU Darmstadt   \u003csup\u003e3\u003c/sup\u003eUniversity of Oxford   \u003csup\u003e4\u003c/sup\u003eMCML   \u003csup\u003e5\u003c/sup\u003eELIZA   \u003csup\u003e6\u003c/sup\u003ehessian.AI   *equal contribution\n\u003ch3\u003eICCV 2025\u003c/h3\u003e\n\n\n\u003ca href=\"https://arxiv.org/abs/2507.06230\"\u003e\u003cimg src='https://img.shields.io/badge/ArXiv-grey' alt='Paper PDF'\u003e\u003c/a\u003e\n\u003ca href=\"https://visinf.github.io/scenedino/\"\u003e\u003cimg src='https://img.shields.io/badge/Project Page-grey' alt='Project Page URL'\u003e\u003c/a\u003e\n\u003ca href=\"https://huggingface.co/spaces/jev-aleks/SceneDINO\"\u003e\u003cimg src='https://img.shields.io/badge/🤗 Demo-grey' alt='Project Page URL'\u003e\u003c/a\u003e\n\n\u003ca href=\"https://opensource.org/licenses/Apache-2.0\"\u003e\u003cimg src='https://img.shields.io/badge/License-Apache%202.0-blue.svg' alt='License'\u003e\u003c/a\u003e\n[![Framework](https://img.shields.io/badge/PyTorch-%23EE4C2C.svg?\u0026logo=PyTorch\u0026logoColor=white)](https://pytorch.org/)\n\n\n\u003ccenter\u003e\n    \u003cimg src=\"./assets/scenedino.gif\" width=\"512\"\u003e\n\u003c/center\u003e\n\u003c/div\u003e\n\u003cbr\u003e\n\n**TL;DR:** SceneDINO is unsupervised and infers 3D geometry and features from a single image in a feed-forward manner. Distilling and clustering SceneDINO's 3D feature field results in unsupervised semantic scene completion predictions. SceneDINO is trained using multi-view self-supervision.\n\n## Abstract\n\nSemantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.\n\n## News\n\n- `09/07/2025`: [ArXiv](https://arxiv.org/abs/2507.06230) preprint and code released. 🚀\n\n## Setup (Installation \u0026 Datasets)\n\n### Python Environment\n\nOur Python environment is managed with **Conda**.\n\n```shell\nconda env create -f environment.yml\nconda activate scenedino\n```\n\n### Datasets\n\nWe provide configuration files for the datasets SceneDINO is trained and evaluated on. Adjust these files and, most importantly, insert the data paths you use.\n\n```bash\nconfigs/dataset/kitti_360_sscbench.yaml\nconfigs/dataset/cityscapes_seg.yaml\nconfigs/dataset/bdd_seg.yaml\nconfigs/dataset/realestate10k.yaml\n```\n\n#### KITTI-360\n\nTo download KITTI-360, create and account and follow the instructions on the [official website](https://www.cvlibs.net/datasets/kitti-360/index.php). We require the perspective images, fisheye images, raw velodyne scans, calibrations, and vehicle poses.\n\n### Checkpoints\n\nOur pre-trained checkpoints are stored in the CVG webshare. Download one of the checkpoints using the dedicated script. To replicate our results using ORB-SLAM3, we provide the obtained poses in `datasets/kitti_360/orb_slam_poses`.\n\n```bash\n# Download best model trained on KITTI-360 (SSCBench split)\npython download_checkpoint.py ssc-kitti-360-dino\npython download_checkpoint.py ssc-kitti-360-dino-orb-slam\npython download_checkpoint.py ssc-kitti-360-dinov2\n```\n\n**Table 1. SSCBench-KITTI-360 results.** We compare SceneDINO to the STEGO + S4C baseline in unsupervised SSC using the mean intersection over union score (mIoU) in %.\n\u003ctable\u003e\u003cthead\u003e\n  \u003ctr\u003e\n    \u003cth\u003eMethod\u003c/th\u003e\n    \u003cth\u003eCheckpoint\u003c/th\u003e\n    \u003cth colspan=\"3\"\u003emIoU\u003c/th\u003e\n  \u003c/tr\u003e\u003c/thead\u003e\n\u003ctbody\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cem\u003e12.8m\u003c/em\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cem\u003e25.6m\u003c/em\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cem\u003e51.2m\u003c/em\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eBaseline\u003c/td\u003e\n    \u003ctd\u003e-\u003c/td\u003e\n    \u003ctd\u003e10.53\u003c/td\u003e\n    \u003ctd\u003e9.26\u003c/td\u003e\n    \u003ctd\u003e6.60\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eSceneDINO\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://huggingface.co/jev-aleks/SceneDINO/tree/main/seg-best-dino\"\u003essc-kitti-360-dino\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e10.76\u003c/td\u003e\n    \u003ctd\u003e10.01\u003c/td\u003e\n    \u003ctd\u003e8.00\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eSceneDINO (ORB-SLAM3 poses)\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://huggingface.co/jev-aleks/SceneDINO/tree/main/seg-best-dino-orb-slam\"\u003essc-kitti-360-dino-orb-slam\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e10.88\u003c/td\u003e\n    \u003ctd\u003e9.86\u003c/td\u003e\n    \u003ctd\u003e7.88\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eSceneDINO (DINOv2)\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://huggingface.co/jev-aleks/SceneDINO/tree/main/seg-best-dinov2\"\u003essc-kitti-360-dinov2\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e13.76\u003c/td\u003e\n    \u003ctd\u003e11.78\u003c/td\u003e\n    \u003ctd\u003e9.08\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\n## Inference Demo Script\n\nThis simple demo script demonstrates loading a model and performing inference in 3D and rendered 2D. It can be used as a starting point to experiment with SceneDINO feature fields.\n\n```bash\npython demo_script.py -h\n\n# First image of kitti-360 test set\npython demo_script.py --ckpt \u003cPATH-MODEL-CKPT\u003e\n# Custom image\npython demo_script.py --ckpt \u003cPATH-MODEL-CKPT\u003e --image \u003cPATH-DEMO-IMAGE\u003e\n```\n\n## Training\n\nFor unsupervised SSC, training is performed in two stages. We provide training configurations in ```configs/``` for each of them. \n\n**SceneDINO**\n\nFirst, the 3D feature fields of SceneDINO are trained. \n\n```bash\npython train.py -cn train_scenedino_kitti_360\n```\n\n**Unsupervised SSC**\n\nBased on a SceneDINO checkpoint, we train the unsupervised SSC head.\n\n```bash\npython train.py -cn train_semantic_kitti_360\n```\n\n**Logging**\n\nWe use TensorBoard to keep track of losses, metrics, and qualitative results.\n\n```bash\ntensorboard --port 8000 --logdir out/\n```\n\n## Evaluation\n\nWe further provide configurations to reproduce the evaluation results from the paper.\n\n**Unsupervised 2D Segmentation**\n\n```bash\n# Unsupervised 2D Segmentation\npython eval.py -cn evaluate_semantic_kitti_360\n```\n\n**Unsupervised SSC**\n\n```bash\n# Unsupervised SSC, adapted from S4C (https://github.com/ahayler/s4c)\npython evaluate_model_sscbench.py -ssc \u003cPATH-SSCBENCH\u003e -vgt \u003cPATH-SSCBENCH-LABELS\u003e -cp \u003cPATH-CHECKPOINT\u003e.pt -f -m scenedino -p \u003cRUN-NAME\u003e\n```\n\n## Citation\n\nIf you find our work useful, please consider citing our paper.\n```\n@inproceedings{Jevtic:2025:SceneDINO,\n    author  = {Aleksandar Jevti{\\'c} and\n               Christoph Reich and\n               Felix Wimbauer and\n               Oliver Hahn and\n               Christian Rupprecht and\n               Stefan Roth and\n               Daniel Cremers},\n    title   = {Feed-Forward {SceneDINO} for Unsupervised Semantic Scene Completion},\n    journal = {IEEE/CVF International Conference on Computer Vision (ICCV)},\n    year    = {2025},\n}\n```\n\n## Acknowledgements\n\nThis repository is based on the [Behind The Scenes (BTS)](https://github.com/Brummi/BehindTheScenes) code base.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftum-vision%2Fscenedino","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftum-vision%2Fscenedino","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftum-vision%2Fscenedino/lists"}