{"id":27435981,"url":"https://github.com/facebookresearch/vggt","last_synced_at":"2025-04-14T19:03:35.016Z","repository":{"id":282090395,"uuid":"934997412","full_name":"facebookresearch/vggt","owner":"facebookresearch","description":"[CVPR 2025 Oral] VGGT: Visual Geometry Grounded Transformer","archived":false,"fork":false,"pushed_at":"2025-04-10T23:37:47.000Z","size":66011,"stargazers_count":4563,"open_issues_count":36,"forks_count":362,"subscribers_count":195,"default_branch":"main","last_synced_at":"2025-04-11T00:27:40.097Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/facebookresearch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-02-18T18:32:14.000Z","updated_at":"2025-04-11T00:26:42.000Z","dependencies_parsed_at":"2025-03-30T04:32:43.339Z","dependency_job_id":null,"html_url":"https://github.com/facebookresearch/vggt","commit_stats":null,"previous_names":["facebookresearch/vggt"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2Fvggt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2Fvggt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2Fvggt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2Fvggt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/facebookresearch","download_url":"https://codeload.github.com/facebookresearch/vggt/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248943443,"owners_count":21186958,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-04-14T19:02:57.287Z","updated_at":"2025-04-14T19:03:34.989Z","avatar_url":"https://github.com/facebookresearch.png","language":"Python","funding_links":[],"categories":["Python","novel","对象检测_分割","🤖 AI \u0026 Machine Learning"],"sub_categories":["资源传输下载"],"readme":"\u003cdiv align=\"center\"\u003e\n\u003ch1\u003eVGGT: Visual Geometry Grounded Transformer\u003c/h1\u003e\n\n\u003ca href=\"https://jytime.github.io/data/VGGT_CVPR25.pdf\" target=\"_blank\" rel=\"noopener noreferrer\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/Paper-VGGT\" alt=\"Paper PDF\"\u003e\n\u003c/a\u003e\n\u003ca href=\"https://arxiv.org/abs/2503.11651\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2503.11651-b31b1b\" alt=\"arXiv\"\u003e\u003c/a\u003e\n\u003ca href=\"https://vgg-t.github.io/\"\u003e\u003cimg src=\"https://img.shields.io/badge/Project_Page-green\" alt=\"Project Page\"\u003e\u003c/a\u003e\n\u003ca href='https://huggingface.co/spaces/facebook/vggt'\u003e\u003cimg src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Demo-blue'\u003e\u003c/a\u003e\n\n\n**[Visual Geometry Group, University of Oxford](https://www.robots.ox.ac.uk/~vgg/)**; **[Meta AI](https://ai.facebook.com/research/)**\n\n\n[Jianyuan Wang](https://jytime.github.io/), [Minghao Chen](https://silent-chen.github.io/), [Nikita Karaev](https://nikitakaraevv.github.io/), [Andrea Vedaldi](https://www.robots.ox.ac.uk/~vedaldi/), [Christian Rupprecht](https://chrirupp.github.io/), [David Novotny](https://d-novotny.github.io/)\n\u003c/div\u003e\n\n```bibtex\n@inproceedings{wang2025vggt,\n  title={VGGT: Visual Geometry Grounded Transformer},\n  author={Wang, Jianyuan and Chen, Minghao and Karaev, Nikita and Vedaldi, Andrea and Rupprecht, Christian and Novotny, David},\n  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},\n  year={2025}\n}\n```\n\n## Updates\n- [Apr 13, 2025] Training code is being gradually cleaned and uploaded to the [training](https://github.com/facebookresearch/vggt/tree/training) branch. It will be merged into the main branch once finalized.\n\n## Overview\n\nVisual Geometry Grounded Transformer (VGGT, CVPR 2025) is a feed-forward neural network that directly infers all key 3D attributes of a scene, including extrinsic and intrinsic camera parameters, point maps, depth maps, and 3D point tracks, **from one, a few, or hundreds of its views, within seconds**.\n\n\n## Quick Start\n\nFirst, clone this repository to your local machine, and install the dependencies (torch, torchvision, numpy, Pillow, and huggingface_hub). \n\n```bash\ngit clone git@github.com:facebookresearch/vggt.git \ncd vggt\npip install -r requirements.txt\n```\n\nAlternatively, you can install VGGT as a package (\u003ca href=\"docs/package.md\"\u003eclick here\u003c/a\u003e for details).\n\n\nNow, try the model with just a few lines of code:\n\n```python\nimport torch\nfrom vggt.models.vggt import VGGT\nfrom vggt.utils.load_fn import load_and_preprocess_images\n\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n# bfloat16 is supported on Ampere GPUs (Compute Capability 8.0+) \ndtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] \u003e= 8 else torch.float16\n\n# Initialize the model and load the pretrained weights.\n# This will automatically download the model weights the first time it's run, which may take a while.\nmodel = VGGT.from_pretrained(\"facebook/VGGT-1B\").to(device)\n\n# Load and preprocess example images (replace with your own image paths)\nimage_names = [\"path/to/imageA.png\", \"path/to/imageB.png\", \"path/to/imageC.png\"]  \nimages = load_and_preprocess_images(image_names).to(device)\n\nwith torch.no_grad():\n    with torch.cuda.amp.autocast(dtype=dtype):\n        # Predict attributes including cameras, depth maps, and point maps.\n        predictions = model(images)\n```\n\nThe model weights will be automatically downloaded from Hugging Face. If you encounter issues such as slow loading, you can manually download them [here](https://huggingface.co/facebook/VGGT-1B/blob/main/model.pt) and load, or:\n\n```python\nmodel = VGGT()\n_URL = \"https://huggingface.co/facebook/VGGT-1B/resolve/main/model.pt\"\nmodel.load_state_dict(torch.hub.load_state_dict_from_url(_URL))\n```\n\n## Detailed Usage\n\nYou can also optionally choose which attributes (branches) to predict, as shown below. This achieves the same result as the example above. This example uses a batch size of 1 (processing a single scene), but it naturally works for multiple scenes.\n\n```python\nfrom vggt.utils.pose_enc import pose_encoding_to_extri_intri\nfrom vggt.utils.geometry import unproject_depth_map_to_point_map\n\nwith torch.no_grad():\n    with torch.cuda.amp.autocast(dtype=dtype):\n        images = images[None]  # add batch dimension\n        aggregated_tokens_list, ps_idx = model.aggregator(images)\n                \n    # Predict Cameras\n    pose_enc = model.camera_head(aggregated_tokens_list)[-1]\n    # Extrinsic and intrinsic matrices, following OpenCV convention (camera from world)\n    extrinsic, intrinsic = pose_encoding_to_extri_intri(pose_enc, images.shape[-2:])\n\n    # Predict Depth Maps\n    depth_map, depth_conf = model.depth_head(aggregated_tokens_list, images, ps_idx)\n\n    # Predict Point Maps\n    point_map, point_conf = model.point_head(aggregated_tokens_list, images, ps_idx)\n        \n    # Construct 3D Points from Depth Maps and Cameras\n    # which usually leads to more accurate 3D points than point map branch\n    point_map_by_unprojection = unproject_depth_map_to_point_map(depth_map.squeeze(0), \n                                                                extrinsic.squeeze(0), \n                                                                intrinsic.squeeze(0))\n\n    # Predict Tracks\n    # choose your own points to track, with shape (N, 2) for one scene\n    query_points = torch.FloatTensor([[100.0, 200.0], \n                                        [60.72, 259.94]]).to(device)\n    track_list, vis_score, conf_score = model.track_head(aggregated_tokens_list, images, ps_idx, query_points=query_points[None])\n```\n\n\nFurthermore, if certain pixels in the input frames are unwanted (e.g., reflective surfaces, sky, or water), you can simply mask them by setting the corresponding pixel values to 0 or 1. Precise segmentation masks aren't necessary - simple bounding box masks work effectively (check this [issue](https://github.com/facebookresearch/vggt/issues/47) for an example).\n\n\n## Visualization\n\nWe provide multiple ways to visualize your 3D reconstructions and tracking results. Before using these visualization tools, install the required dependencies:\n\n```bash\npip install -r requirements_demo.txt\n```\n\n### Interactive 3D Visualization\n\n**Please note:** VGGT typically reconstructs a scene in less than 1 second. However, visualizing 3D points may take tens of seconds due to third-party rendering, independent of VGGT's processing time. The visualization is slow especially when the number of images is large.\n\n\n#### Gradio Web Interface\n\nOur Gradio-based interface allows you to upload images/videos, run reconstruction, and interactively explore the 3D scene in your browser. You can launch this in your local machine or try it on [Hugging Face](https://huggingface.co/spaces/facebook/vggt).\n\n\n```bash\npython demo_gradio.py\n```\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to preview the Gradio interactive interface\u003c/summary\u003e\n\n![Gradio Web Interface Preview](https://jytime.github.io/data/vggt_hf_demo_screen.png)\n\u003c/details\u003e\n\n\n#### Viser 3D Viewer\n\nRun the following command to run reconstruction and visualize the point clouds in viser. Note this script requires a path to a folder containing images. It assumes only image files under the folder. You can set `--use_point_map` to use the point cloud from the point map branch, instead of the depth-based point cloud.\n\n```bash\npython demo_viser.py --image_folder path/to/your/images/folder\n```\n\n\n### Track Visualization\n\nTo visualize point tracks across multiple images:\n\n```python\nfrom vggt.utils.visual_track import visualize_tracks_on_images\ntrack = track_list[-1]\nvisualize_tracks_on_images(images, track, (conf_score\u003e0.2) \u0026 (vis_score\u003e0.2), out_dir=\"track_visuals\")\n```\nThis plots the tracks on the images and saves them to the specified output directory. \n\n\n## Single-view Reconstruction\n\nOur model shows surprisingly good performance on single-view reconstruction, although it was never trained for this task. The model does not need to duplicate the single-view image to a pair, instead, it can directly infer the 3D structure from the tokens of the single view image. Feel free to try it with our demos above, which naturally works for single-view reconstruction.\n\n\nWe did not quantitatively test monocular depth estimation performance ourselves, but [@kabouzeid](https://github.com/kabouzeid) generously provided a comparison of VGGT to recent methods [here](https://github.com/facebookresearch/vggt/issues/36). VGGT shows competitive or better results compared to state-of-the-art monocular approaches such as DepthAnything v2 or MoGe, despite never being explicitly trained for single-view tasks. \n\n\n\n## Runtime and GPU Memory\n\nWe benchmark the runtime and GPU memory usage of VGGT's aggregator on a single NVIDIA H100 GPU across various input sizes. \n\n| **Input Frames** | 1 | 2 | 4 | 8 | 10 | 20 | 50 | 100 | 200 |\n|:----------------:|:-:|:-:|:-:|:-:|:--:|:--:|:--:|:---:|:---:|\n| **Time (s)**     | 0.04 | 0.05 | 0.07 | 0.11 | 0.14 | 0.31 | 1.04 | 3.12 | 8.75 |\n| **Memory (GB)**  | 1.88 | 2.07 | 2.45 | 3.23 | 3.63 | 5.58 | 11.41 | 21.15 | 40.63 |\n\nNote that these results were obtained using Flash Attention 3, which is faster than the default Flash Attention 2 implementation while maintaining almost the same memory usage. Feel free to compile Flash Attention 3 from source to get better performance.\n\n\n## Research Progression\n\nOur work builds upon a series of previous research projects. If you're interested in understanding how our research evolved, check out our previous works:\n\n\n\u003ctable border=\"0\" cellspacing=\"0\" cellpadding=\"0\"\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"left\"\u003e\n      \u003ca href=\"https://github.com/jytime/Deep-SfM-Revisited\"\u003eDeep SfM Revisited\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd style=\"white-space: pre;\"\u003e──┐\u003c/td\u003e\n    \u003ctd\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"left\"\u003e\n      \u003ca href=\"https://github.com/facebookresearch/PoseDiffusion\"\u003ePoseDiffusion\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd style=\"white-space: pre;\"\u003e─────►\u003c/td\u003e\n    \u003ctd\u003e\n      \u003ca href=\"https://github.com/facebookresearch/vggsfm\"\u003eVGGSfM\u003c/a\u003e ──►\n      \u003ca href=\"https://github.com/facebookresearch/vggt\"\u003eVGGT\u003c/a\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"left\"\u003e\n      \u003ca href=\"https://github.com/facebookresearch/co-tracker\"\u003eCoTracker\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd style=\"white-space: pre;\"\u003e──┘\u003c/td\u003e\n    \u003ctd\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n\n## Acknowledgements\n\nThanks to these great repositories: [PoseDiffusion](https://github.com/facebookresearch/PoseDiffusion), [VGGSfM](https://github.com/facebookresearch/vggsfm), [CoTracker](https://github.com/facebookresearch/co-tracker), [DINOv2](https://github.com/facebookresearch/dinov2), [Dust3r](https://github.com/naver/dust3r), [Moge](https://github.com/microsoft/moge), [PyTorch3D](https://github.com/facebookresearch/pytorch3d), [Sky Segmentation](https://github.com/xiongzhu666/Sky-Segmentation-and-Post-processing), [Depth Anything V2](https://github.com/DepthAnything/Depth-Anything-V2), [Metric3D](https://github.com/YvanYin/Metric3D) and many other inspiring works in the community.\n\n## Checklist\n\n- [ ] Release the training code\n- [ ] Release VGGT-500M and VGGT-200M\n\n\n## License\nSee the [LICENSE](./LICENSE.txt) file for details about the license under which this code is made available.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2Fvggt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffacebookresearch%2Fvggt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2Fvggt/lists"}