https://github.com/naver/dune
Code repository for "DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers"
https://github.com/naver/dune
computer-vision foundation-model image-encoder knowledge-distillation vision-transformer
Last synced: 2 months ago
JSON representation
Code repository for "DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers"
- Host: GitHub
- URL: https://github.com/naver/dune
- Owner: naver
- Created: 2025-05-20T06:06:42.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-10-28T10:27:31.000Z (8 months ago)
- Last Synced: 2026-02-15T01:50:01.884Z (4 months ago)
- Topics: computer-vision, foundation-model, image-encoder, knowledge-distillation, vision-transformer
- Language: Python
- Homepage: https://europe.naverlabs.com/dune
- Size: 1.9 MB
- Stars: 78
- Watchers: 5
- Forks: 6
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers
[**Mert Bulent Sariyildiz**](https://mbsariyildiz.github.io/) · [**Philippe Weinzaepfel**](https://europe.naverlabs.com/people_user_naverlabs/philippe-weinzaepfel/) · [**Thomas Lucas**](https://europe.naverlabs.com/people_user_naverlabs/thomas-lucas/) · [**Pau de Jorge**](https://europe.naverlabs.com/people_user_naverlabs/pau-de-jorge/) · [**Diane Larlus**](https://europe.naverlabs.com/people_user_naverlabs/diane-larlus/) · [**Yannis Kalantidis**](https://europe.naverlabs.com/people_user_naverlabs/yannis-kalantidis/)
NAVER LABS Europe
**CVPR 2025**
[[ArXiv](https://arxiv.org/abs/2503.14405)] · [[Citation](#citation)]
# Easy navigation
- [Summary](#summary)
- [Pre-trained models](#pre-trained-models)
- [Installation](#installation)
- [Conda environment](#conda-environment)
- [Teacher models](#teacher-models)
- [Distillation datasets](#distillation-datasets)
- [Table of datasets](#table-of-datasets)
- [Training models](#training-models)
- [Evaluating downstream tasks](#evaluating-downstream-tasks)
- [Fine-tuned MASt3R decoders](#fine-tuned-mast3r-decoders)
- [PCA visualization](#pca-visualization)
- [Citation](#citation)
# Summary
DUNE is a vision encoder trained via multi-teacher distillation. Specifically, DUNE encoders are distilled using three heterogeneous pre-trained vision models as teachers: [DINOv2](https://github.com/facebookresearch/dinov2/), [MASt3R](https://github.com/naver/mast3r) and [Multi-HMR](https://github.com/naver/multi-hmr). We use [19 datasets](https://oss.navercorp.com/bulent-sariyildiz/dune#distillation-datasets) for distillation, covering the visual domains of all three teachers and comprising approximately 20.7 million images in total. The full list of datasets is provided [below](https://oss.navercorp.com/bulent-sariyildiz/dune#distillation-datasets) and also provided in Table 5 of the paper. For all teachers, we used their publicly available ViT-Large checkpoints.
By using DUNE, you can achieve strong performance on a range of 2D and 3D downstream tasks, like monocular depth, semantic segmentation, multi-view depth estimation, multi-human mesh recovery, multi-view pose regression and 3D reconstruction. Notably, MASt3R model with a DUNE encoder achieves [a new state-of-the art performance in Map-free Visual Relocalization](https://research.nianticlabs.com/mapfree-reloc-benchmark/leaderboard?t=single&f=all), improving over the original MASt3R with a much smaller encoder.

# Pre-trained models
| Architecture | Resolution | Checkpoint | Sem.Seg. ADE20K | Sem.Seg. CityScapes | Sem.Seg. NYU | Sem.Seg. Scannet | Depth NYUd | BEDLAM (PA-PVE) | MapFree (AUC) |
|------------------------|------------|------------------------------------------------------------------------------------------------------------------|------------------|----------------------|----------------|-------------------|-------------|----------------|----------------|
| ViT-Base/14 (420MB) | 336 | [dune_vitbase14_336.pth](https://download.europe.naverlabs.com/dune/dune_vitbase14_336.pth) | 45.0 | 69.3 | 66.9 | 64.6 | 0.384 | 64.3 | 94.1
| ViT-Base/14 | 448 | [dune_vitbase14_448.pth](https://download.europe.naverlabs.com/dune/dune_vitbase14_448.pth) | **46.2** | **71.3** | **68.3** | **65.4** | 0.365 | 60.1 | 94.2
| ViT-Base/14* | 448 | [dune_vitbase14_448_paper.pth](https://download.europe.naverlabs.com/dune/dune_vitbase14_448_paper.pth) | 45.6 | 70.6 | 68.2 | 65.2 | **0.358** | **56.0** | **94.7** |
* *Model reported in the paper and trained using an earlier (internal) version of this codebase.
| Architecture | Resolution | Checkpoint | Sem.Seg. ADE20K | Sem.Seg. CityScapes | Sem.Seg. NYU | Sem.Seg. Scannet | Depth NYUd | BEDLAM (PA-PVE) | MapFree (AUC) |
|------------------------|------------|------------------------------------------------------------------------------------------------------------------|------------------|----------------------|----------------|-------------------|-------------|----------------|----------------|
| ViT-Small/14 (110MB) | 336 | [dune_vitsmall14_336.pth](https://download.europe.naverlabs.com/dune/dune_vitsmall14_336.pth) | 39.6 | 61.7 | 63.5 | 60.1 | 0.424 | 74.7 | _WIP_ |
| ViT-Small/14 | 448 | [dune_vitsmall14_448.pth](https://download.europe.naverlabs.com/dune/dune_vitsmall14_448.pth) | 41.4 | 63.7 | 65.5 | 61.2 | 0.404 | 69.0 | 94.5 |
- WIP: Work in progress.
- Semantic segmentation results in the table (Sem.Seg.) are obtained after DINOv2 projectors, following the convention in the paper.
To load a pretrained model, you can either clone this repository and download a pre-trained model:
```Python
from model.dune import load_dune_from_checkpoint
model = load_dune_from_checkpoint("./dune_vitbase14_448_paper.pth")
```
or use `torch.hub` directly (see [hubconf.py](hubconf.py) for all available models):
```Python
# full model with projectors and teacher norms
model = torch.hub.load("naver/dune", "dune_vitbase_14_448_paper")
# just the ViT encoder part of the model
model = torch.hub.load("naver/dune", "dune_vitbase_14_448_paper_encoder")
```
# Installation
## Conda environment
- Create a conda environment with all the necessary packages.
```bash
env_name="dune"
conda create -n ${env_name}
conda activate ${env_name}
conda install python=3.12
pip install -U torch=='2.7.0' torchvision torchfix timm 'huggingface_hub>=0.22' transformers accelerate einops torchmetrics optuna tensorboard matplotlib pandas jaxtyping scikit-learn-intelex omegaconf opencv-python ipython black flake8 pylint rich ipykernel
```
- Set the path of your conda in [scripts/setup_env.sh](./scripts/setup_env.sh), i.e. update the `conda_dir` variable.
Then your environment will be automatically used by the training script.
## Teacher models
- To download the teacher models we used in this work, you can check the bash scripts under the [scripts/teachers](./scripts/teachers) folder.
To download all teachers at once, use [scripts/teachers/prepare_all.sh](./scripts/teachers/prepare_all.sh):
```bash
# BEFORE EXECUTING THIS COMMAND, MAKE SURE TO SEE THE CONTENTS OF THE SCRIPTS!
(cd scripts/teachers && ./prepare_all.sh )
```
- Once teacher checkpoints are downloaded, update the `ckpt_path` keys in the `TEACHER_CFG` dictionary in [teachers/config.py](teachers/config.py) to point to the correct paths.
For MASt3R, the preparation script mentioned above will additionally clone the MASt3R repository.
You also need to set the `code_dir` in `TEACHER_CFG` key to point to the directory where this MASt3R repo is located.
## Distillation datasets
We train DUNE models on the combination of 19 datasets with roughly 20.7M images.
The full list of datasets is available below, and in Table 5 of the paper.
We provide the dataloaders for these datasets in [data/dino2.py](data/dino2.py), [data/mast3r.py](data/mast3r.py) and [data/multihmr.py](data/multihmr.py) for details.
However, we leave the downloading and preprocessing of the datasets to the user.
Once you have the datasets, set their paths in [data/paths.py](data/paths.py)
If downloading the 19 datasets is too cumbersome, it is also possible to train DUNE on ImageNet-1K only.
To do that, set the `IN1K_DIRS` variable in [data/paths.py](data/paths.py) to the path of your ImageNet-1K.
### Table of datasets
| Name | Size | Nature |
|-------------------|------------|------------|
| ImageNet-19K | 13,153,480 | Real |
| Mapillary | 1,205,907 | Real |
| Google Landmarks v2 | 4,132,914 | Real |
| Habitat | 284,968 | Rendered |
| ARKitScenes | 456,108 | Rendered |
| Blended MVS | 98,937 | Rendered |
| MegaDepth | 36,949 | Real |
| ScanNet++ | 60,188 | Rendered |
| CO3D-v2 | 185,100 | Real |
| Map-free | 41,300 | Real |
| WildRgb | 224,400 | Real |
| VirtualKitti | 1,200 | Synthetic |
| Unreal4K | 14,386 | Synthetic |
| TartanAir | 136,225 | Real |
| DL3DV | 208,800 | Rendered |
| BEDLAM | 353,118 | Synthetic |
| AGORA | 14,314 | Synthetic |
| CUFFS | 54,944 | Synthetic |
| UBody | 54,234 | Real |
# Training models
DUNE follows a two-stage training: Initial pre-training at resolution 336 for 100 "epochs", and fine-tuning at resolution 448 for 50 epochs.
We define an epoch by 1281167 images, as the size of the ImageNet-1K dataset.
```bash
# Pre-training at resolution 336
# To distill only on ImageNet-1K, pass --dataset="in1k" to the script
output_dir_pretrain="/path/to/dune/pretrain/dir/"
bash ./scripts/train.sh ${output_dir_pretrain}
# Fine-tuning at resolution 448
# Adjust batch size according to your GPU memory
output_dir_finetune="/path/to/dune/finetune/dir/"
bash ./scripts/train.sh ${output_dir_finetune} \
--pretrained=${output_dir_pretrain}/checkpoint.pth \
--image_size=448 \
--lr=5e-5 \
--epochs=50 \
--batch_size_per_gpu=128
```
# Evaluating downstream tasks
After training the DUNE encoder, we freeze it and fine-tune the decoder (from the corresponding teacher) for each downstream task.
Please see the Appendix B of the paper for the details on fine-tuning.
We do not provide in the codebase the evaluation scripts for the downstream task and kindly ask you to refer to the original repositories:
- https://github.com/facebookresearch/dinov2
- https://github.com/naver/mast3r (see below for the fine-tuned MASt3R decoders)
- https://github.com/naver/multi-hmr
## Fine-tuned MASt3R decoders
We provide the fine-tuned MASt3R decoders for the DUNE encoders with resolution 448:
- [DUNE-MASt3R ViT-Base](https://download.europe.naverlabs.com/dune/dunemast3r_cvpr25_vitbase.pth)
- [DUNE-MASt3R ViT-Small](https://download.europe.naverlabs.com/dune/dunemast3r_cvpr25_vitsmall.pth)
Please see [the relevant section in the MASt3R repository](https://github.com/naver/mast3r?tab=readme-ov-file#usage-dunemast3r) for using these decoders in MASt3R evaluations.
# PCA visualization
We provide an example script [scripts/pca_vis.py](scripts/pca_vis.py), which shows how to load the encoder part of the DUNE model and visualize the PCA output of its patch features.
To execute this script:
```bash
PYTHONPATH=${PWD}:${PYTHONPATH} python scripts/pca_vis.py
```
This will generate a PCA visualization of the patch features of the best DUNE model reported in the paper on a test image.

# Citation
If you find this repository useful, please consider citing us:
```LaTeX
@inproceedings{sariyildiz2025dune,
title={{DUNE}: Distilling a Universal Encoder from Heterogeneous {2D} and {3D} Teachers},
author={Sariyildiz, Mert Bulent and Weinzaepfel, Philippe and Lucas, Thomas and De Jorge, Pau and Larlus, Diane and Kalantidis, Yannis},
booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2025},
}
```