An open API service indexing awesome lists of open source software.

https://github.com/donydchen/mvsplat360

🎞️ [NeurIPS'24] MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views
https://github.com/donydchen/mvsplat360

feed-forward-gaussian-splatting gaussian-splatting generative-models neurips-2024 novel-view-synthesis stable-video-diffusion video-diffusion-model

Last synced: about 2 months ago
JSON representation

🎞️ [NeurIPS'24] MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views

Awesome Lists containing this project

README

        


MVSplat360: Feed-Forward 360 Scene Synthesis
from Sparse Views



Yuedong Chen
 · 
Chuanxia Zheng
 · 
Haofei Xu
 · 
Bohan Zhuang

Andrea Vedaldi
 · 
Tat-Jen Cham
 · 
Jianfei Cai


NeurIPS 2024



Paper | Project Page | Pretrained Models

https://github.com/user-attachments/assets/4cfa6654-5bb5-4f72-a264-6941bcf00bed

## Installation

To get started, create a conda virtual environment using Python 3.10+ and install the requirements:

```bash
conda create -n mvsplat360 python=3.10
conda activate mvsplat360
pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 xformers==0.0.25.post1 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
```

## Acquiring Datasets

This project mainly uses [DL3DV](https://github.com/DL3DV-10K/Dataset) and [RealEstate10K](https://google.github.io/realestate10k/index.html) datasets.

The dataset structure aligns with our previous work, [MVSplat](https://github.com/donydchen/mvsplat?tab=readme-ov-file#acquiring-datasets). You may refer to the script [convert_dl3dv.py](src/scripts/convert_dl3dv.py) for converting the DL3DV-10K datasets to the torch chunks used in this project.

You might also want to check out the [DepthSplat's DATASETS.md](https://github.com/cvg/depthsplat/blob/main/DATASETS.md), which provides detailed instructions on pre-processing DL3DV and RealEstate10K for use here (as both projects share the same code base from pixelSplat).

A pre-processed tiny subset of DL3DV (containing 5 scenes) is provided [here](https://huggingface.co/donydchen/mvsplat360/blob/main/dl3dv_tiny.zip) for quick reference. To use it, simply download it and unzip it to `datasets/dl3dv_tiny`.

## Running the Code

### Evaluation

To render novel views,

* get the pre-trained models [dl3dv_480p.ckpt](https://huggingface.co/donydchen/mvsplat360/blob/main/dl3dv_480p.ckpt), and save it to `/checkpoints`

* run the following:

```bash
# dl3dv; requires at least 22G VRAM
python -m src.main +experiment=dl3dv_mvsplat360 \
wandb.name=dl3dv_480P_ctx5_tgt56 \
mode=test \
dataset/view_sampler=evaluation \
dataset.roots=[datasets/dl3dv_tiny] \
checkpointing.load=checkpoints/dl3dv_480p.ckpt
```

* the rendered novel views will be stored under `outputs/test/{wandb.name}`

To evaluate the quantitative performance, kindly refer to [compute_dl3dv_metrics.py](src/scripts/compute_dl3dv_metrics.py)

To render videos from a pre-trained model, run the following

```bash
# dl3dv; requires at least 38G VRAM
python -m src.main +experiment=dl3dv_mvsplat360_video \
wandb.name=dl3dv_480P_ctx5_tgt56_video \
mode=test \
dataset/view_sampler=evaluation \
dataset.roots=[datasets/dl3dv_tiny] \
checkpointing.load=checkpoints/dl3dv_480p.ckpt
```

### Training

* Download the encoder pre-trained weight from [MVSplat](https://github.com/donydchen/mvsplat?tab=readme-ov-file#evaluation) and save it to `checkpoints/re10k.ckpt`.
* Download SVD pre-trained weight from [generative-models](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid/tree/main) and save it to `checkpoints/svd.safetensors`.
* Run the following:

```bash
# train mvsplat360; requires at least 80G VRAM
python -m src.main +experiment=dl3dv_mvsplat360 dataset.roots=[datasets/dl3dv]
```

* Alternatively, you can also fine-tune from our released model by appending `checkpointing.load=checkpoints/dl3dv_480p.ckpt` and `checkpointing.resume=false` to the above command.
* You can also set up your wandb account [here](config/main.yaml) for logging. Have fun.

## Camera Conventions

The camera intrinsic matrices are normalized (the first row is divided by image width, and the second row is divided by image height). More details are at [this comment](https://github.com/donydchen/mvsplat/issues/28#issuecomment-2126416038).

The camera extrinsic matrices are OpenCV-style camera-to-world matrices (+X right, +Y down, +Z camera looks into the screen).

## BibTeX

```bibtex
@article{chen2024mvsplat360,
title = {MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views},
author = {Chen, Yuedong and Zheng, Chuanxia and Xu, Haofei and Zhuang, Bohan and Vedaldi, Andrea and Cham, Tat-Jen and Cai, Jianfei},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2024},
}
```

## Acknowledgements

The project is based on [MVSplat](https://github.com/donydchen/mvsplat), [pixelSplat](https://github.com/dcharatan/pixelsplat), [UniMatch](https://github.com/autonomousvision/unimatch) and [generative-models](https://github.com/Stability-AI/generative-models). Many thanks to these projects for their excellent contributions!