An open API service indexing awesome lists of open source software.

https://github.com/tum-vision/scenedino

Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion (ICCV 2025)
https://github.com/tum-vision/scenedino

3d-reconstruction 3d-scene-understanding 3d-semantic-segmentation occupancy-prediction segmentation semantic-scene-completion single-image-reconstruction unsupervised-learning unsupervised-scene-understanding unsupervised-segmentation

Last synced: 11 months ago
JSON representation

Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion (ICCV 2025)

Awesome Lists containing this project

README

          


Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion

[**Aleksandar Jevtić**](https://jev-aleks.github.io/)* 1
[**Christoph Reich**](https://christophreich1996.github.io/)* 1,2,4,5
[**Felix Wimbauer**](https://fwmb.github.io/)1,4
[**Oliver Hahn**](https://olvrhhn.github.io/)2
[**Christian Rupprecht**](https://chrirupp.github.io/)3
[**Stefan Roth**](https://www.visinf.tu-darmstadt.de/visual_inference/people_vi/stefan_roth.en.jsp)2,5,6
[**Daniel Cremers**](https://cvg.cit.tum.de/members/cremers/)1,4,5

1TU Munich 2TU Darmstadt 3University of Oxford 4MCML 5ELIZA 6hessian.AI *equal contribution

ICCV 2025

Paper PDF
Project Page URL
Project Page URL

License
[![Framework](https://img.shields.io/badge/PyTorch-%23EE4C2C.svg?&logo=PyTorch&logoColor=white)](https://pytorch.org/)



**TL;DR:** SceneDINO is unsupervised and infers 3D geometry and features from a single image in a feed-forward manner. Distilling and clustering SceneDINO's 3D feature field results in unsupervised semantic scene completion predictions. SceneDINO is trained using multi-view self-supervision.

## Abstract

Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.

## News

- `09/07/2025`: [ArXiv](https://arxiv.org/abs/2507.06230) preprint and code released. 🚀

## Setup (Installation & Datasets)

### Python Environment

Our Python environment is managed with **Conda**.

```shell
conda env create -f environment.yml
conda activate scenedino
```

### Datasets

We provide configuration files for the datasets SceneDINO is trained and evaluated on. Adjust these files and, most importantly, insert the data paths you use.

```bash
configs/dataset/kitti_360_sscbench.yaml
configs/dataset/cityscapes_seg.yaml
configs/dataset/bdd_seg.yaml
configs/dataset/realestate10k.yaml
```

#### KITTI-360

To download KITTI-360, create and account and follow the instructions on the [official website](https://www.cvlibs.net/datasets/kitti-360/index.php). We require the perspective images, fisheye images, raw velodyne scans, calibrations, and vehicle poses.

### Checkpoints

Our pre-trained checkpoints are stored in the CVG webshare. Download one of the checkpoints using the dedicated script. To replicate our results using ORB-SLAM3, we provide the obtained poses in `datasets/kitti_360/orb_slam_poses`.

```bash
# Download best model trained on KITTI-360 (SSCBench split)
python download_checkpoint.py ssc-kitti-360-dino
python download_checkpoint.py ssc-kitti-360-dino-orb-slam
python download_checkpoint.py ssc-kitti-360-dinov2
```

**Table 1. SSCBench-KITTI-360 results.** We compare SceneDINO to the STEGO + S4C baseline in unsupervised SSC using the mean intersection over union score (mIoU) in %.


Method
Checkpoint
mIoU




12.8m
25.6m
51.2m


Baseline
-
10.53
9.26
6.60


SceneDINO
ssc-kitti-360-dino
10.76
10.01
8.00


SceneDINO (ORB-SLAM3 poses)
ssc-kitti-360-dino-orb-slam
10.88
9.86
7.88


SceneDINO (DINOv2)
ssc-kitti-360-dinov2
13.76
11.78
9.08

## Inference Demo Script

This simple demo script demonstrates loading a model and performing inference in 3D and rendered 2D. It can be used as a starting point to experiment with SceneDINO feature fields.

```bash
python demo_script.py -h

# First image of kitti-360 test set
python demo_script.py --ckpt
# Custom image
python demo_script.py --ckpt --image
```

## Training

For unsupervised SSC, training is performed in two stages. We provide training configurations in ```configs/``` for each of them.

**SceneDINO**

First, the 3D feature fields of SceneDINO are trained.

```bash
python train.py -cn train_scenedino_kitti_360
```

**Unsupervised SSC**

Based on a SceneDINO checkpoint, we train the unsupervised SSC head.

```bash
python train.py -cn train_semantic_kitti_360
```

**Logging**

We use TensorBoard to keep track of losses, metrics, and qualitative results.

```bash
tensorboard --port 8000 --logdir out/
```

## Evaluation

We further provide configurations to reproduce the evaluation results from the paper.

**Unsupervised 2D Segmentation**

```bash
# Unsupervised 2D Segmentation
python eval.py -cn evaluate_semantic_kitti_360
```

**Unsupervised SSC**

```bash
# Unsupervised SSC, adapted from S4C (https://github.com/ahayler/s4c)
python evaluate_model_sscbench.py -ssc -vgt -cp .pt -f -m scenedino -p
```

## Citation

If you find our work useful, please consider citing our paper.
```
@inproceedings{Jevtic:2025:SceneDINO,
author = {Aleksandar Jevti{\'c} and
Christoph Reich and
Felix Wimbauer and
Oliver Hahn and
Christian Rupprecht and
Stefan Roth and
Daniel Cremers},
title = {Feed-Forward {SceneDINO} for Unsupervised Semantic Scene Completion},
journal = {IEEE/CVF International Conference on Computer Vision (ICCV)},
year = {2025},
}
```

## Acknowledgements

This repository is based on the [Behind The Scenes (BTS)](https://github.com/Brummi/BehindTheScenes) code base.