{"id":47916136,"url":"https://github.com/naver/dune","last_synced_at":"2026-04-04T05:37:50.394Z","repository":{"id":299033356,"uuid":"986789420","full_name":"naver/dune","owner":"naver","description":"Code repository for \"DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers\"","archived":false,"fork":false,"pushed_at":"2025-10-28T10:27:31.000Z","size":1993,"stargazers_count":78,"open_issues_count":1,"forks_count":6,"subscribers_count":5,"default_branch":"main","last_synced_at":"2026-02-15T01:50:01.884Z","etag":null,"topics":["computer-vision","foundation-model","image-encoder","knowledge-distillation","vision-transformer"],"latest_commit_sha":null,"homepage":"https://europe.naverlabs.com/dune","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/naver.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-20T06:06:42.000Z","updated_at":"2026-02-14T05:16:42.000Z","dependencies_parsed_at":"2025-06-14T09:32:20.682Z","dependency_job_id":null,"html_url":"https://github.com/naver/dune","commit_stats":null,"previous_names":["naver/dune"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/naver/dune","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/naver%2Fdune","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/naver%2Fdune/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/naver%2Fdune/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/naver%2Fdune/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/naver","download_url":"https://codeload.github.com/naver/dune/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/naver%2Fdune/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31389391,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-04T04:26:24.776Z","status":"ssl_error","status_checked_at":"2026-04-04T04:23:34.147Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","foundation-model","image-encoder","knowledge-distillation","vision-transformer"],"created_at":"2026-04-04T05:37:49.880Z","updated_at":"2026-04-04T05:37:50.379Z","avatar_url":"https://github.com/naver.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\u003ch1\u003eDUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers\u003c/h1\u003e\n\n[**Mert Bulent Sariyildiz**](https://mbsariyildiz.github.io/) · [**Philippe Weinzaepfel**](https://europe.naverlabs.com/people_user_naverlabs/philippe-weinzaepfel/) · [**Thomas Lucas**](https://europe.naverlabs.com/people_user_naverlabs/thomas-lucas/) · [**Pau de Jorge**](https://europe.naverlabs.com/people_user_naverlabs/pau-de-jorge/) · [**Diane Larlus**](https://europe.naverlabs.com/people_user_naverlabs/diane-larlus/) · [**Yannis Kalantidis**](https://europe.naverlabs.com/people_user_naverlabs/yannis-kalantidis/)\n\nNAVER LABS Europe\n\n**CVPR 2025**\n\n[[ArXiv](https://arxiv.org/abs/2503.14405)] · [[Citation](#citation)]\n\n\u003c/div\u003e\n\n\u003c!-- omit in toc --\u003e\n# Easy navigation\n\n- [Summary](#summary)\n- [Pre-trained models](#pre-trained-models)\n- [Installation](#installation)\n  - [Conda environment](#conda-environment)\n  - [Teacher models](#teacher-models)\n  - [Distillation datasets](#distillation-datasets)\n    - [Table of datasets](#table-of-datasets)\n- [Training models](#training-models)\n- [Evaluating downstream tasks](#evaluating-downstream-tasks)\n  - [Fine-tuned MASt3R decoders](#fine-tuned-mast3r-decoders)\n- [PCA visualization](#pca-visualization)\n- [Citation](#citation)\n\n# Summary\n\nDUNE is a vision encoder trained via multi-teacher distillation. Specifically, DUNE encoders are distilled using three heterogeneous pre-trained vision models as teachers: [DINOv2](https://github.com/facebookresearch/dinov2/), [MASt3R](https://github.com/naver/mast3r) and [Multi-HMR](https://github.com/naver/multi-hmr). We use [19 datasets](https://oss.navercorp.com/bulent-sariyildiz/dune#distillation-datasets) for distillation, covering the visual domains of all three teachers and comprising approximately 20.7 million images in total. The full list of datasets is provided [below](https://oss.navercorp.com/bulent-sariyildiz/dune#distillation-datasets) and also provided in Table 5 of the paper. For all teachers, we used their publicly available ViT-Large checkpoints.\n\nBy using DUNE, you can achieve strong performance on a range of 2D and 3D downstream tasks, like monocular depth, semantic segmentation, multi-view depth estimation, multi-human mesh recovery, multi-view pose regression and 3D reconstruction. Notably, MASt3R model with a DUNE encoder achieves [a new state-of-the art performance in Map-free Visual Relocalization](https://research.nianticlabs.com/mapfree-reloc-benchmark/leaderboard?t=single\u0026f=all), improving over the original MASt3R with a much smaller encoder.\n\n\n![DUNE model overview](./assets/dune.png)\n\n# Pre-trained models\n\n| Architecture           | Resolution | Checkpoint                                                                                                      | Sem.Seg. ADE20K | Sem.Seg. CityScapes | Sem.Seg. NYU | Sem.Seg. Scannet | Depth NYUd | BEDLAM (PA-PVE) | MapFree (AUC) |\n|------------------------|------------|------------------------------------------------------------------------------------------------------------------|------------------|----------------------|----------------|-------------------|-------------|----------------|----------------|\n| ViT-Base/14 (420MB)   | 336        | [dune_vitbase14_336.pth](https://download.europe.naverlabs.com/dune/dune_vitbase14_336.pth)                    | 45.0             | 69.3                 | 66.9           | 64.6              | 0.384       | 64.3                    | 94.1\n| ViT-Base/14           | 448        | [dune_vitbase14_448.pth](https://download.europe.naverlabs.com/dune/dune_vitbase14_448.pth)                    | **46.2**             | **71.3**                 | **68.3**          | **65.4**              | 0.365       | 60.1           | 94.2\n| ViT-Base/14*           | 448        | [dune_vitbase14_448_paper.pth](https://download.europe.naverlabs.com/dune/dune_vitbase14_448_paper.pth)        | 45.6             | 70.6                 | 68.2           | 65.2              | **0.358**       | **56.0**        | **94.7**           |\n* *Model reported in the paper and trained using an earlier (internal) version of this codebase.\n\n| Architecture           | Resolution | Checkpoint                                                                                                      | Sem.Seg. ADE20K | Sem.Seg. CityScapes | Sem.Seg. NYU | Sem.Seg. Scannet | Depth NYUd | BEDLAM (PA-PVE) | MapFree (AUC) |\n|------------------------|------------|------------------------------------------------------------------------------------------------------------------|------------------|----------------------|----------------|-------------------|-------------|----------------|----------------|\n| ViT-Small/14 (110MB)  | 336        | [dune_vitsmall14_336.pth](https://download.europe.naverlabs.com/dune/dune_vitsmall14_336.pth)                  | 39.6             | 61.7                 | 63.5           | 60.1              | 0.424       | 74.7              | _WIP_               |\n| ViT-Small/14          | 448        | [dune_vitsmall14_448.pth](https://download.europe.naverlabs.com/dune/dune_vitsmall14_448.pth)                  | 41.4             | 63.7                 | 65.5           | 61.2              | 0.404       | 69.0              | 94.5                |\n\n- WIP: Work in progress.\n- Semantic segmentation results in the table (Sem.Seg.) are obtained after DINOv2 projectors, following the convention in the paper.\n\nTo load a pretrained model, you can either clone this repository and download a pre-trained model:\n```Python\nfrom model.dune import load_dune_from_checkpoint\nmodel = load_dune_from_checkpoint(\"./dune_vitbase14_448_paper.pth\")\n```\n\nor use `torch.hub` directly (see [hubconf.py](hubconf.py) for all available models):\n```Python\n# full model with projectors and teacher norms\nmodel = torch.hub.load(\"naver/dune\", \"dune_vitbase_14_448_paper\")\n# just the ViT encoder part of the model\nmodel = torch.hub.load(\"naver/dune\", \"dune_vitbase_14_448_paper_encoder\")\n```\n\n# Installation\n\n## Conda environment\n\n- Create a conda environment with all the necessary packages.\n\n```bash\nenv_name=\"dune\"\nconda create -n ${env_name}\nconda activate ${env_name}\nconda install python=3.12\npip install -U torch=='2.7.0' torchvision torchfix timm 'huggingface_hub\u003e=0.22' transformers accelerate einops torchmetrics optuna tensorboard matplotlib pandas jaxtyping scikit-learn-intelex omegaconf opencv-python ipython black flake8 pylint rich ipykernel\n```\n\n- Set the path of your conda in [scripts/setup_env.sh](./scripts/setup_env.sh), i.e. update the `conda_dir` variable.\nThen your environment will be automatically used by the training script.\n\n\n## Teacher models\n\n- To download the teacher models we used in this work, you can check the bash scripts under the [scripts/teachers](./scripts/teachers) folder.\nTo download all teachers at once, use [scripts/teachers/prepare_all.sh](./scripts/teachers/prepare_all.sh):\n```bash\n# BEFORE EXECUTING THIS COMMAND, MAKE SURE TO SEE THE CONTENTS OF THE SCRIPTS!\n(cd scripts/teachers \u0026\u0026 ./prepare_all.sh \u003cpath_to_download_directory\u003e)\n```\n\n- Once teacher checkpoints are downloaded, update the `ckpt_path` keys in the `TEACHER_CFG` dictionary in [teachers/config.py](teachers/config.py) to point to the correct paths.\nFor MASt3R, the preparation script mentioned above will additionally clone the MASt3R repository.\nYou also need to set the `code_dir` in `TEACHER_CFG` key to point to the directory where this MASt3R repo is located.\n\n## Distillation datasets\n\nWe train DUNE models on the combination of 19 datasets with roughly 20.7M images.\nThe full list of datasets is available below, and in Table 5 of the paper.\nWe provide the dataloaders for these datasets in [data/dino2.py](data/dino2.py), [data/mast3r.py](data/mast3r.py) and [data/multihmr.py](data/multihmr.py) for details.\nHowever, we leave the downloading and preprocessing of the datasets to the user.\nOnce you have the datasets, set their paths in [data/paths.py](data/paths.py)\n\nIf downloading the 19 datasets is too cumbersome, it is also possible to train DUNE on ImageNet-1K only.\nTo do that, set the `IN1K_DIRS` variable in [data/paths.py](data/paths.py) to the path of your ImageNet-1K.\n\n\n### Table of datasets\n\n| Name              | Size       | Nature     |\n|-------------------|------------|------------|\n| ImageNet-19K      | 13,153,480 | Real       |\n| Mapillary         | 1,205,907  | Real       |\n| Google Landmarks v2 | 4,132,914  | Real       |\n| Habitat           | 284,968    | Rendered   |\n| ARKitScenes       | 456,108    | Rendered   |\n| Blended MVS       | 98,937     | Rendered   |\n| MegaDepth         | 36,949     | Real       |\n| ScanNet++         | 60,188     | Rendered   |\n| CO3D-v2           | 185,100    | Real       |\n| Map-free          | 41,300     | Real       |\n| WildRgb           | 224,400    | Real       |\n| VirtualKitti      | 1,200      | Synthetic  |\n| Unreal4K          | 14,386     | Synthetic  |\n| TartanAir         | 136,225    | Real       |\n| DL3DV             | 208,800    | Rendered   |\n| BEDLAM            | 353,118    | Synthetic  |\n| AGORA             | 14,314     | Synthetic  |\n| CUFFS             | 54,944     | Synthetic  |\n| UBody             | 54,234     | Real       |\n\n\n# Training models\n\nDUNE follows a two-stage training: Initial pre-training at resolution 336 for 100 \"epochs\", and fine-tuning at resolution 448 for 50 epochs.\nWe define an epoch by 1281167 images, as the size of the ImageNet-1K dataset.\n\n```bash\n# Pre-training at resolution 336\n# To distill only on ImageNet-1K, pass --dataset=\"in1k\" to the script\noutput_dir_pretrain=\"/path/to/dune/pretrain/dir/\"\nbash ./scripts/train.sh ${output_dir_pretrain}\n\n# Fine-tuning at resolution 448\n# Adjust batch size according to your GPU memory\noutput_dir_finetune=\"/path/to/dune/finetune/dir/\"\nbash ./scripts/train.sh ${output_dir_finetune} \\\n    --pretrained=${output_dir_pretrain}/checkpoint.pth \\\n    --image_size=448 \\\n    --lr=5e-5 \\\n    --epochs=50 \\\n    --batch_size_per_gpu=128\n```\n\n# Evaluating downstream tasks\n\nAfter training the DUNE encoder, we freeze it and fine-tune the decoder (from the corresponding teacher) for each downstream task.\nPlease see the Appendix B of the paper for the details on fine-tuning.\n\nWe do not provide in the codebase the evaluation scripts for the downstream task and kindly ask you to refer to the original repositories:\n- https://github.com/facebookresearch/dinov2\n- https://github.com/naver/mast3r (see below for the fine-tuned MASt3R decoders)\n- https://github.com/naver/multi-hmr\n\n## Fine-tuned MASt3R decoders\n\nWe provide the fine-tuned MASt3R decoders for the DUNE encoders with resolution 448:\n\n- [DUNE-MASt3R ViT-Base](https://download.europe.naverlabs.com/dune/dunemast3r_cvpr25_vitbase.pth)\n- [DUNE-MASt3R ViT-Small](https://download.europe.naverlabs.com/dune/dunemast3r_cvpr25_vitsmall.pth)\n\nPlease see [the relevant section in the MASt3R repository](https://github.com/naver/mast3r?tab=readme-ov-file#usage-dunemast3r) for using these decoders in MASt3R evaluations.\n\n# PCA visualization\n\nWe provide an example script [scripts/pca_vis.py](scripts/pca_vis.py), which shows how to load the encoder part of the DUNE model and visualize the PCA output of its patch features.\nTo execute this script:\n```bash\nPYTHONPATH=${PWD}:${PYTHONPATH} python scripts/pca_vis.py\n```\n\nThis will generate a PCA visualization of the patch features of the best DUNE model reported in the paper on a test image.\n![PCA visualization](./assets/test_image_patch_pca_dune_vitbase14_448_paper.png)\n\n# Citation\n\nIf you find this repository useful, please consider citing us:\n\n```LaTeX\n@inproceedings{sariyildiz2025dune,\n    title={{DUNE}: Distilling a Universal Encoder from Heterogeneous {2D} and {3D} Teachers},\n    author={Sariyildiz, Mert Bulent and Weinzaepfel, Philippe and Lucas, Thomas and De Jorge, Pau and Larlus, Diane and Kalantidis, Yannis},\n    booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},\n    year={2025},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnaver%2Fdune","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnaver%2Fdune","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnaver%2Fdune/lists"}