https://github.com/ByteDance-Seed/Depth-Anything-3

Depth Anything 3
https://github.com/ByteDance-Seed/Depth-Anything-3
Last synced: about 2 months ago
JSON representation
Depth Anything 3
Host: GitHub
URL: https://github.com/ByteDance-Seed/Depth-Anything-3
Owner: ByteDance-Seed
License: apache-2.0
Created: 2025-11-12T08:44:03.000Z (7 months ago)
Default Branch: main
Last Pushed: 2026-03-21T07:14:45.000Z (3 months ago)
Last Synced: 2026-03-26T11:51:08.147Z (3 months ago)
Language: Python
Homepage: https://depth-anything-3.github.io/
Size: 22.1 MB
Stars: 4,800
Watchers: 43
Forks: 494
Open Issues: 175
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          


Depth Anything 3: Recovering the Visual Space from Any Views


[**Haotong Lin**](https://haotongl.github.io/)^&ast; · [**Sili Chen**](https://github.com/SiliChen321)^&ast; · [**Jun Hao Liew**](https://liewjunhao.github.io/)^&ast; · [**Donny Y. Chen**](https://donydchen.github.io)^&ast; · [**Zhenyu Li**](https://zhyever.github.io/) · [**Guang Shi**](https://scholar.google.com/citations?user=MjXxWbUAAAAJ&hl=en) · [**Jiashi Feng**](https://scholar.google.com.sg/citations?user=Q8iay0gAAAAJ&hl=en)




[**Bingyi Kang**](https://bingyikang.com/)^&ast;†

†project lead &ast;Equal Contribution









This work presents **Depth Anything 3 (DA3)**, a model that predicts spatially consistent geometry from

arbitrary visual inputs, with or without known camera poses.

In pursuit of minimal modeling, DA3 yields two key insights:

- 💎 A **single plain transformer** (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization,

- ✨ A singular **depth-ray representation** obviates the need for complex multi-task learning.

🏆 DA3 significantly outperforms

[DA2](https://github.com/DepthAnything/Depth-Anything-V2) for monocular depth estimation,

and [VGGT](https://github.com/facebookresearch/vggt) for multi-view depth estimation and pose estimation.

All models are trained exclusively on **public academic datasets**.



  





  



## 📰 News

- **11-12-2025:** 🚀 New models and [**DA3-Streaming**](da3_streaming/README.md) released! Handle ultra-long video sequence inference with less than 12GB GPU memory via sliding-window streaming inference. Special thanks to [Kai Deng](https://github.com/DengKaiCQ) for his contribution to DA3-Streaming!

- **08-12-2025:** 📊 [Benchmark evaluation pipeline](docs/BENCHMARK.md) released! Evaluate pose estimation & 3D reconstruction on 5 datasets.

- **30-11-2025:** Add [`use_ray_pose`](#use-ray-pose) and [`ref_view_strategy`](docs/funcs/ref_view_strategy.md) (reference view selection for multi-view inputs).   

- **25-11-2025:** Add [Awesome DA3 Projects](#-awesome-da3-projects), a community-driven section featuring DA3-based applications.

- **14-11-2025:** Paper, project page, code and models are all released.

## ✨ Highlights

### 🏆 Model Zoo

We release three series of models, each tailored for specific use cases in visual geometry.

- 🌟 **DA3 Main Series** (`DA3-Giant`, `DA3-Large`, `DA3-Base`, `DA3-Small`) These are our flagship foundation models, trained with a unified depth-ray representation. By varying the input configuration, a single model can perform a wide range of tasks:

  + 🌊 **Monocular Depth Estimation**: Predicts a depth map from a single RGB image.

  + 🌊 **Multi-View Depth Estimation**: Generates consistent depth maps from multiple images for high-quality fusion.

  + 🎯 **Pose-Conditioned Depth Estimation**: Achieves superior depth consistency when camera poses are provided as input.

  + 📷 **Camera Pose Estimation**:  Estimates camera extrinsics and intrinsics from one or more images.

  + 🟡 **3D Gaussian Estimation**: Directly predicts 3D Gaussians, enabling high-fidelity novel view synthesis.

- 📐 **DA3 Metric Series** (`DA3Metric-Large`) A specialized model fine-tuned for metric depth estimation in monocular settings, ideal for applications requiring real-world scale.

- 🔍 **DA3 Monocular Series** (`DA3Mono-Large`). A dedicated model for high-quality relative monocular depth estimation. Unlike disparity-based models (e.g.,  [Depth Anything 2](https://github.com/DepthAnything/Depth-Anything-V2)), it directly predicts depth, resulting in superior geometric accuracy.

🔗 Leveraging these available models, we developed a **nested series** (`DA3Nested-Giant-Large`). This series combines a any-view giant model with a metric model to reconstruct visual geometry at a real-world metric scale.

### 🛠️ Codebase Features

Our repository is designed to be a powerful and user-friendly toolkit for both practical application and future research.

- 🎨 **Interactive Web UI & Gallery**: Visualize model outputs and compare results with an easy-to-use Gradio-based web interface.

- ⚡ **Flexible Command-Line Interface (CLI)**: Powerful and scriptable CLI for batch processing and integration into custom workflows.

- 💾 **Multiple Export Formats**: Save your results in various formats, including `glb`, `npz`, depth images, `ply`, 3DGS videos, etc, to seamlessly connect with other tools.

- 🔧 **Extensible and Modular Design**: The codebase is structured to facilitate future research and the integration of new models or functionalities.

## 🚀 Quick Start

### 📦 Installation

```bash

pip install xformers torch\>=2 torchvision

pip install -e . # Basic

pip install --no-build-isolation git+https://github.com/nerfstudio-project/gsplat.git@0b4dddf04cb687367602c01196913cde6a743d70 # for gaussian head

pip install -e ".[app]" # Gradio, python>=3.10

pip install -e ".[all]" # ALL

```

For detailed model information, please refer to the [Model Cards](#-model-cards) section below.

### 💻 Basic Usage

```python

import glob, os, torch

from depth_anything_3.api import DepthAnything3

device = torch.device("cuda")

model = DepthAnything3.from_pretrained("depth-anything/DA3NESTED-GIANT-LARGE")

model = model.to(device=device)

example_path = "assets/examples/SOH"

images = sorted(glob.glob(os.path.join(example_path, "*.png")))

prediction = model.inference(

    images,

)

# prediction.processed_images : [N, H, W, 3] uint8   array

print(prediction.processed_images.shape)

# prediction.depth            : [N, H, W]    float32 array

print(prediction.depth.shape)  

# prediction.conf             : [N, H, W]    float32 array

print(prediction.conf.shape)  

# prediction.extrinsics       : [N, 3, 4]    float32 array # opencv w2c or colmap format

print(prediction.extrinsics.shape)

# prediction.intrinsics       : [N, 3, 3]    float32 array

print(prediction.intrinsics.shape)

```

```bash

export MODEL_DIR=depth-anything/DA3NESTED-GIANT-LARGE

# This can be a Hugging Face repository or a local directory

# If you encounter network issues, consider using the following mirror: export HF_ENDPOINT=https://hf-mirror.com

# Alternatively, you can download the model directly from Hugging Face

export GALLERY_DIR=workspace/gallery

mkdir -p $GALLERY_DIR

# CLI auto mode with backend reuse

da3 backend --model-dir ${MODEL_DIR} --gallery-dir ${GALLERY_DIR} # Cache model to gpu

da3 auto assets/examples/SOH \

    --export-format glb \

    --export-dir ${GALLERY_DIR}/TEST_BACKEND/SOH \

    --use-backend

# CLI video processing with feature visualization

da3 video assets/examples/robot_unitree.mp4 \

    --fps 15 \

    --use-backend \

    --export-dir ${GALLERY_DIR}/TEST_BACKEND/robo \

    --export-format glb-feat_vis \

    --feat-vis-fps 15 \

    --process-res-method lower_bound_resize \

    --export-feat "11,21,31"

# CLI auto mode without backend reuse

da3 auto assets/examples/SOH \

    --export-format glb \

    --export-dir ${GALLERY_DIR}/TEST_CLI/SOH \

    --model-dir ${MODEL_DIR}

```

The model architecture is defined in [`DepthAnything3Net`](src/depth_anything_3/model/da3.py), and specified with a Yaml config file located at [`src/depth_anything_3/configs`](src/depth_anything_3/configs). The input and output processing are handled by [`DepthAnything3`](src/depth_anything_3/api.py). To customize the model architecture, simply create a new config file (*e.g.*, `path/to/new/config`) as:

```yaml

__object__:

  path: depth_anything_3.model.da3

  name: DepthAnything3Net

  args: as_params

net:

  __object__:

    path: depth_anything_3.model.dinov2.dinov2

    name: DinoV2

    args: as_params

  name: vitb

  out_layers: [5, 7, 9, 11]

  alt_start: 4

  qknorm_start: 4

  rope_start: 4

  cat_token: True

head:

  __object__:

    path: depth_anything_3.model.dualdpt

    name: DualDPT

    args: as_params

  dim_in: &head_dim_in 1536

  output_dim: 2

  features: &head_features 128

  out_channels: &head_out_channels [96, 192, 384, 768]

```

Then, the model can be created with the following code snippet.

```python

from depth_anything_3.cfg import create_object, load_config

Model = create_object(load_config("path/to/new/config"))

```

## 📚 Useful Documentation

- 🖥️ [Command Line Interface](docs/CLI.md)

- 📑 [Python API](docs/API.md)

- 📊 [Benchmark Evaluation](docs/BENCHMARK.md)

## 🗂️ Model Cards

Generally, you should observe that DA3-LARGE achieves comparable results to VGGT.

The Nested series uses an Any-view model to estimate pose and depth, and a monocular metric depth estimator for scaling. 

⚠️ Models with the `-1.1` suffix are retrained after fixing a training bug; prefer these refreshed checkpoints. The original `DA3NESTED-GIANT-LARGE`, `DA3-GIANT`, and `DA3-LARGE` remain available but are deprecated. You could expect much better performance for street scenes with the `-1.1` models.

| 🗃️ Model Name                  | 📏 Params | 📊 Rel. Depth | 📷 Pose Est. | 🧭 Pose Cond. | 🎨 GS | 📐 Met. Depth | ☁️ Sky Seg | 📄 License     |

|-------------------------------|-----------|---------------|--------------|---------------|-------|---------------|-----------|----------------|

| **Nested** | | | | | | | | |

| [DA3NESTED-GIANT-LARGE-1.1](https://huggingface.co/depth-anything/DA3NESTED-GIANT-LARGE-1.1)  | 1.40B     | ✅             | ✅            | ✅             | ✅     | ✅             | ✅         | CC BY-NC 4.0   |

| [DA3NESTED-GIANT-LARGE](https://huggingface.co/depth-anything/DA3NESTED-GIANT-LARGE)  | 1.40B     | ✅             | ✅            | ✅             | ✅     | ✅             | ✅         | CC BY-NC 4.0   |

| **Any-view Model** | | | | | | | | |

| [DA3-GIANT-1.1](https://huggingface.co/depth-anything/DA3-GIANT-1.1)                     | 1.15B     | ✅             | ✅            | ✅             | ✅     |               |           | CC BY-NC 4.0   |

| [DA3-GIANT](https://huggingface.co/depth-anything/DA3-GIANT)                     | 1.15B     | ✅             | ✅            | ✅             | ✅     |               |           | CC BY-NC 4.0   |

| [DA3-LARGE-1.1](https://huggingface.co/depth-anything/DA3-LARGE-1.1)                     | 0.35B     | ✅             | ✅            | ✅             |       |               |           | CC BY-NC 4.0     |

| [DA3-LARGE](https://huggingface.co/depth-anything/DA3-LARGE)                     | 0.35B     | ✅             | ✅            | ✅             |       |               |           | CC BY-NC 4.0     |

| [DA3-BASE](https://huggingface.co/depth-anything/DA3-BASE)                     | 0.12B     | ✅             | ✅            | ✅             |       |               |           | Apache 2.0     |

| [DA3-SMALL](https://huggingface.co/depth-anything/DA3-SMALL)                     | 0.08B     | ✅             | ✅            | ✅             |       |               |           | Apache 2.0     |

|                               |           |               |              |               |               |       |           |                |

| **Monocular Metric Depth** | | | | | | | | |

| [DA3METRIC-LARGE](https://huggingface.co/depth-anything/DA3METRIC-LARGE)              | 0.35B     | ✅             |              |               |       | ✅             | ✅         | Apache 2.0     |

|                               |           |               |              |               |               |       |           |                |

| **Monocular Depth** | | | | | | | | |

| [DA3MONO-LARGE](https://huggingface.co/depth-anything/DA3MONO-LARGE)                | 0.35B     | ✅             |              |               |               |       | ✅         | Apache 2.0     |

## ❓ FAQ

- **Monocular Metric Depth**: To obtain metric depth in meters from `DA3METRIC-LARGE`, use `metric_depth = focal * net_output / 300.`, where `focal` is the focal length in pixels (typically the average of fx and fy from the camera intrinsic matrix K). Note that the output from `DA3NESTED-GIANT-LARGE` is already in meters.

- **Ray Head (`use_ray_pose`)**:  Our API and CLI support `use_ray_pose` arg, which means that the model will derive camera pose from ray head, which is generally slightly slower, but more accurate. Note that the default is `False` for faster inference speed. 

  

  AUC3 Results for DA3NESTED-GIANT-LARGE

  

  | Model | HiRoom | ETH3D | DTU | 7Scenes | ScanNet++ | 

  |-------|------|-------|-----|---------|-----------|

  | `ray_head` | 84.4 | 52.6 | 93.9 | 29.5 | 89.4 |

  | `cam_head` | 80.3 | 48.4 | 94.1 | 28.5 | 85.0 |

  

- **Older GPUs without XFormers support**: See [Issue #11](https://github.com/ByteDance-Seed/Depth-Anything-3/issues/11). Thanks to [@S-Mahoney](https://github.com/S-Mahoney) for the solution!

## 🏢 Awesome DA3 Projects

A community-curated list of Depth Anything 3 integrations across 3D tools, creative pipelines, robotics, and web/VR viewers, including but not limited to these. You are welcome to submit your DA3-based project via PR, and we will review and feature it if applicable.

- [DA3-blender](https://github.com/xy-gao/DA3-blender): Blender addon for DA3-based 3D reconstruction from a set of images. 

- [ComfyUI-DepthAnythingV3](https://github.com/PozzettiAndrea/ComfyUI-DepthAnythingV3): ComfyUI nodes for Depth Anything 3, supporting single/multi-view and video-consistent depth with optional point‑cloud export.

- [DA3-ROS2-Wrapper](https://github.com/GerdsenAI/GerdsenAI-Depth-Anything-3-ROS2-Wrapper): Real-time DA3 depth in ROS2 with multi-camera support. 

- [DA3-ROS2-CPP-TensorRT](https://github.com/ika-rwth-aachen/ros2-depth-anything-v3-trt): DA3 ROS2 C++ TensorRT Inference Node: a ROS2 node for DA3 depth estimation using TensorRT for real-time inference.

- [VideoDepthViewer3D](https://github.com/amariichi/VideoDepthViewer3D): Streaming videos with DA3 metric depth to a Three.js/WebXR 3D viewer for VR/stereo playback.

## 🧑‍💻 Official Codebase Core Contributors and Maintainers

  

    

      

        

      

        


        _{Bingyi Kang}

    

    

      

        

      

        


        _{Haotong Lin}

    

    

      

        

      

        


        _{Sili Chen}

    

    

      

        

       

        


        _{Jun Hao Liew}

    

    

      

        

      

        


        _{Donny Y. Chen}

    

    

      

        

      

        


        _{Kai Deng}

    

  

## 📝 Citations

If you find Depth Anything 3 useful in your research or projects, please cite our work:

```

@article{depthanything3,

  title={Depth Anything 3: Recovering the visual space from any views},

  author={Haotong Lin and Sili Chen and Jun Hao Liew and Donny Y. Chen and Zhenyu Li and Guang Shi and Jiashi Feng and Bingyi Kang},

  journal={arXiv preprint arXiv:2511.10647},

  year={2025}

}

```