{"id":23943957,"url":"https://github.com/geetu040/depthpro-beyond-depth","last_synced_at":"2025-09-03T13:34:28.150Z","repository":{"id":270449460,"uuid":"910416000","full_name":"geetu040/depthpro-beyond-depth","owner":"geetu040","description":"Depth Estimation model, DepthPro by Apple, trained for Image Segmentation and Image Super Resolution.","archived":false,"fork":false,"pushed_at":"2025-01-02T16:50:38.000Z","size":49558,"stargazers_count":9,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-05-26T17:58:27.686Z","etag":null,"topics":["computer-vision","depth-estimation","image-segmentation","super-resolution","vision-transformer"],"latest_commit_sha":null,"homepage":"https://medium.com/@raoarmaghanshakir040/depth-pro-beyond-depth-9d822fc557ba","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/geetu040.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-31T08:01:20.000Z","updated_at":"2025-05-14T16:46:27.000Z","dependencies_parsed_at":"2024-12-31T09:27:47.206Z","dependency_job_id":null,"html_url":"https://github.com/geetu040/depthpro-beyond-depth","commit_stats":null,"previous_names":["geetu040/depthpro-beyond-depth"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/geetu040/depthpro-beyond-depth","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/geetu040%2Fdepthpro-beyond-depth","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/geetu040%2Fdepthpro-beyond-depth/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/geetu040%2Fdepthpro-beyond-depth/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/geetu040%2Fdepthpro-beyond-depth/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/geetu040","download_url":"https://codeload.github.com/geetu040/depthpro-beyond-depth/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/geetu040%2Fdepthpro-beyond-depth/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273452954,"owners_count":25108469,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-03T02:00:09.631Z","response_time":76,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","depth-estimation","image-segmentation","super-resolution","vision-transformer"],"created_at":"2025-01-06T06:15:55.274Z","updated_at":"2025-09-03T13:34:28.127Z","avatar_url":"https://github.com/geetu040.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"![Output from each model](assets/readme/all-3-models(1).png)\n\n# DepthPro: Sharp Monocular Depth Estimation\n\n**DepthPro** is a foundational model designed for zero-shot monocular depth estimation. Leveraging a multi-scale vision transformer (ViT-based, Dinov2), the model optimizes for dense predictions by processing images at multiple scales. Each image is split into patches, encoded using a shared patch encoder across scales, then merged, upsampled, and fused via a DPT decoder.\n\n- **Research Paper**: [Depth Pro: Sharp Monocular Metric Depth in Less Than a Second](https://arxiv.org/pdf/2410.02073)\n- **Authors**: [Aleksei Bochkovskii](https://arxiv.org/search/cs?searchtype=author\u0026query=Bochkovskii,+A), [Amaël Delaunoy](https://arxiv.org/search/cs?searchtype=author\u0026query=Delaunoy,+A), et al.\n- **Official Code**: [apple/ml-depth-pro](https://github.com/apple/ml-depth-pro)\n- **Official Weights**: [apple/DepthPro](https://huggingface.co/apple/DepthPro)\n- **Unofficial Weights**: [geetu040/DepthPro](https://huggingface.co/geetu040/DepthPro)\n- **Web UI Interface**: [spaces/geetu040/DepthPro](https://huggingface.co/spaces/geetu040/DepthPro)\n- **Interface in Transformers (Open PR)**: https://github.com/huggingface/transformers/pull/34583\n\n![Depth Pro Teaser](assets/readme/depth-pro-teaser.jpg)\n\n# DepthPro: Beyond Depth Estimation\n\nIn this repository, we use this architechture and the available pretrained weights for depth-estimation, to explore its capabilities in further image processings tasks like **Image Segmentation** and **Image Super Resolution**.\n\n**Quick Links**\n\n| Task                           | Web UI Interface                                                                                  | Code-Based Inference and Weights                                                                      | Training Code on Colab                                                                                                                                  | Training Code on Kaggle                                                                                                                        | Training Logs                                                     | Validation Outputs                                                          |\n| ------------------------------ | ------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------- | --------------------------------------------------------------------------- |\n| **Depth Estimation**           | [DepthPro](https://huggingface.co/spaces/geetu040/DepthPro)                                       | [geetu040 / DepthPro](https://huggingface.co/geetu040/DepthPro)                                       | -                                                                                                                                                       | -                                                                                                                                              | -                                                                 | -                                                                           |\n| **Human Segmentation**         | [DepthPro Segmentation Human](https://huggingface.co/spaces/geetu040/DepthPro_Segmentation_Human) | [geetu040 / DepthPro Segmentation Human](https://huggingface.co/geetu040/DepthPro_Segmentation_Human) | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1IXKoCHqzOwszmRrUiynbbGL_SiCwWKPK) | -                                                                                                                                              | [Training Logs](assets/training_logs/Segmentation_Human.png)      | [Validation Outputs](assets/validation_outputs/Segmentation_Human.jpg)      |\n| **Super Resolution (4x 256p)** | [DepthPro SR 4x 256p](https://huggingface.co/spaces/geetu040/DepthPro_SR_4x_256p)                 | [geetu040 / DepthPro SR 4x 256p](https://huggingface.co/geetu040/DepthPro_SR_4x_256p)                 | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1J4UheUjCLS-oqOuay-JfPkIZZQXpnZGZ) | [![Open in Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://www.kaggle.com/code/sacrum/depthpro-superresolution-4x-256p/) | [Training Logs](assets/training_logs/SuperResolution_4x_256p.png) | [Validation Outputs](assets/validation_outputs/SuperResolution_4x_256p.png) |\n| **Super Resolution (4x 384p)** | [DepthPro SR 4x 384p](https://huggingface.co/spaces/geetu040/DepthPro_SR_4x_384p)                 | [geetu040 / DepthPro SR 4x 384p](https://huggingface.co/geetu040/DepthPro_SR_4x_384p)                 | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1fYqfMxhekHCAlTxkBj5be-dsNgs5LQOK) | [![Open in Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://www.kaggle.com/code/sacrum/depthpro-superresolution-4x-384p/) | [Training Logs](assets/training_logs/SuperResolution_4x_384p.png) | [Validation Outputs](assets/validation_outputs/SuperResolution_4x_384p.png) |\n\n## DepthPro: Image Segmentation (Human)\n\n- For Web UI Interface: [**spaces/geetu040/DepthPro_Segmentation_Human**](https://huggingface.co/spaces/geetu040/DepthPro_Segmentation_Human)\n- For Code-Based Inference and model weights: [**geetu040/DepthPro_Segmentation_Human**](https://huggingface.co/geetu040/DepthPro_Segmentation_Human)\n- For Training, check the notebook on:\n  - [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1IXKoCHqzOwszmRrUiynbbGL_SiCwWKPK)\n  - [Segmentation_Human.ipynb](Segmentation_Human.ipynb)\n\n\u003ctable\u003e\n  \u003cthead\u003e\n    \u003ctr\u003e\n      \u003cth\u003eInput Image\u003c/th\u003e\n      \u003cth\u003eGround Truth\u003c/th\u003e\n      \u003cth\u003ePrediction\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003ctd colspan=\"3\"\u003e\n        \u003ca href=\"assets/validation_outputs/Segmentation_Human.jpg\"\u003e\n          \u003cimg src=\"assets/validation_outputs_brief/Segmentation_Human.png\" alt=\"Segmentation Human\"\"\u003e\n        \u003c/a\u003e\n      \u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\nWe modify [Apple's DepthPro for Monocular Depth Estimation](https://arxiv.org/abs/2410.02073) model for `Image Segmentation Task`.\n1. The pre-trained depth estimation model is used with slight changes in the head layer to make it compatible with the segmentation task.\n2. Hidden features maps have been generated to get the insights of the encoder and fusion stages of the model.\n3. For `training` and `validation`, we use `Human Segmentation Dataset - Supervise.ly`, from kaggle: [tapakah68/supervisely-filtered-segmentation-person-dataset](https://www.kaggle.com/datasets/tapakah68/supervisely-filtered-segmentation-person-dataset)\n   - It contains 2667 samples which are randomly split into 80% training and 20% validation.\n   - each sample contains an image and its corresponding mask.\n4. **The model produces exceptional results on validation set with an `IoU score of 0.964` and `Dice score of 0.982`, beating the [previous state of art IoU score of 0.95](https://www.kaggle.com/code/saeedghamshadzai/person-segmentation-deeplabv3-pytorch) on this dataset.**\n\n\u003cdetails\u003e\n  \u003csummary\u003eSee the training logs\u003c/summary\u003e\n\n  [![Training Logs](assets/training_logs/Segmentation_Human.png)](assets/training_logs/Segmentation_Human.png)\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003eSee all Validation Outputs\u003c/summary\u003e\n\n  [![validation_outputs/Segmentation_Human](assets/validation_outputs/Segmentation_Human.jpg)](assets/validation_outputs/Segmentation_Human.jpg)\n\n\u003c/details\u003e\n\n## DepthPro: Image Super Resolution (4x 256px)\n\n- For Web UI Interface: [**spaces/geetu040/DepthPro_SR_4x_256p**](https://huggingface.co/spaces/geetu040/DepthPro_SR_4x_256p)\n- For Code-Based Inference and model weights: [**geetu040/DepthPro_SR_4x_256p**](https://huggingface.co/geetu040/DepthPro_SR_4x_256p)\n- For Training, check the notebook on:\n  - [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1J4UheUjCLS-oqOuay-JfPkIZZQXpnZGZ)\n  - [![Open in Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://www.kaggle.com/code/sacrum/depthpro-superresolution-4x-256p/)\n  - [SuperResolution_4x_256p.ipynb](SuperResolution_4x_256p.ipynb)\n\n\u003ctable\u003e\n  \u003cthead\u003e\n    \u003ctr\u003e\n      \u003cth\u003eLow Resolution 256px (Input Image)\u003c/th\u003e\n      \u003cth\u003eSuper Resolution 1024px (Depth Pro)\u003c/th\u003e\n      \u003cth\u003eHigh Resolution 1024px (Ground Truth)\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003ctd colspan=\"3\"\u003e\n        \u003ca href=\"assets/validation_outputs/SuperResolution_4x_256p.png\"\u003e\n          \u003cimg src=\"assets/validation_outputs_brief/SuperResolution_4x_256p.png\" alt=\"Super Resolution 4x\"\u003e\n        \u003c/a\u003e\n      \u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\nWe then modify [Apple's DepthPro for Monocular Depth Estimation](https://arxiv.org/abs/2410.02073) model for `Image Super Resolution Task`.\n1. The base model architechture is modified for the task of Image Super Resolution from 256px to 1024px (4x upsampling).\n2. For `training` and `validation`, we use `Div2k` dataset, introduced in [NTIRE 2017 Challenge on Single Image Super-Resolution: Dataset and Study](https://ieeexplore.ieee.org/document/8014884)\n   - It contains high resolution images in 2k resolution, which have been downsampled to `LR_SIZE=256` and `HR_SIZE=1024` for training and validation.\n   - It contains\n      - 800 training samples\n      - 200 validation samples\n   - Dataset has been downloaded from kaggle: [soumikrakshit/div2k-high-resolution-images](https://www.kaggle.com/datasets/soumikrakshit/div2k-high-resolution-images)\n3. For `testing`, we use `Urban100` dataset, introduced in [Single Image Super-Resolution From Transformed Self-Exemplars](https://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Huang_Single_Image_Super-Resolution_2015_CVPR_paper.html)\n   - It contains images in 2 resolutions, 256 (low) and 1024 (high).\n   - It contains 100 samples.\n   - Dataset has been downloaded from kaggle: [harshraone/urban100](https://www.kaggle.com/datasets/harshraone/urban100)\n4. Results:\n   - Model achieves best `PSRN score of 24.80` and `SSIM score of 0.74` on validation set.\n   - `PSRN score of 21.36` and `SSIM score of 0.62` on test set.\n   - Model has been able to restore some of the information from low resolution images.\n   - Results are better than most of the generative techniques applied on kaggle, but still has a long way to go to achieve the state of art results.\n   - This is because of the nature of Vision Transformers, which are not specifically designed for Super Resolution tasks.\n\n\u003cdetails\u003e\n  \u003csummary\u003eSee the training logs\u003c/summary\u003e\n\n  [![training_logs/SuperResolution_4x_256p](assets/training_logs/SuperResolution_4x_256p.png)](assets/training_logs/SuperResolution_4x_256p.png)\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003eSee all Validation Outputs\u003c/summary\u003e\n\n  [![validation_outputs/SuperResolution_4x_256p](assets/validation_outputs/SuperResolution_4x_256p.png)](assets/validation_outputs/SuperResolution_4x_256p.png)\n\n\u003c/details\u003e\n\n## DepthPro: Image Super Resolution (4x 384px)\n\n- For Web UI Interface: [**spaces/geetu040/DepthPro_SR_4x_384p**](https://huggingface.co/spaces/geetu040/DepthPro_SR_4x_384p)\n- For Code-Based Inference and model weights: [**geetu040/DepthPro_SR_4x_384p**](https://huggingface.co/geetu040/DepthPro_SR_4x_384p)\n- For Training, check the notebook on:\n  - [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1fYqfMxhekHCAlTxkBj5be-dsNgs5LQOK)\n  - [![Open in Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://www.kaggle.com/code/sacrum/depthpro-superresolution-4x-384p/)\n  - [SuperResolution_4x_384p.ipynb](SuperResolution_4x_384p.ipynb)\n\n\u003ctable\u003e\n  \u003cthead\u003e\n    \u003ctr\u003e\n      \u003cth\u003eLow Resolution 384px (Input Image)\u003c/th\u003e\n      \u003cth\u003eSuper Resolution 1536px (Depth Pro)\u003c/th\u003e\n      \u003cth\u003eHigh Resolution 1536px (Ground Truth)\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003ctd colspan=\"3\"\u003e\n        \u003ca href=\"assets/validation_outputs/SuperResolution_4x_384p.png\"\u003e\n          \u003cimg src=\"assets/validation_outputs_brief/SuperResolution_4x_384p.png\" alt=\"Super Resolution 4x\"\u003e\n        \u003c/a\u003e\n      \u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\nWe use the modified [Apple's DepthPro for Monocular Depth Estimation](https://arxiv.org/abs/2410.02073) model for `Image Super Resolution Task`.\n1. The base model architechture is modified for the task of Image Super Resolution from 384px to 1536px (4x upsampling).\n2. For `training` and `validation`, we use `Div2k` dataset, introduced in [NTIRE 2017 Challenge on Single Image Super-Resolution: Dataset and Study](https://ieeexplore.ieee.org/document/8014884)\n   - It contains high resolution images in 2k resolution, which have been downsampled to `LR_SIZE=384` and `HR_SIZE=1536` for training and validation.\n   - It contains\n      - 800 training samples\n      - 200 validation samples\n   - Dataset has been downloaded from kaggle: [soumikrakshit/div2k-high-resolution-images](https://www.kaggle.com/datasets/soumikrakshit/div2k-high-resolution-images)\n3. Results:\n   - Model achieves best `PSRN score of 27.19` and `SSIM score of 0.81` on validation set.\n   - Model has been able to restore some of the information from low resolution images.\n   - Results are better than the generative techniques applied on kaggle, but are slightly off to achieve the state of art results.\n   - This is because of the nature of Vision Transformers, which are not specifically designed for Super Resolution tasks.\n\n\u003cdetails\u003e\n  \u003csummary\u003eSee the training logs\u003c/summary\u003e\n\n  [![training_logs/SuperResolution_4x_384p](assets/training_logs/SuperResolution_4x_384p.png)](assets/training_logs/SuperResolution_4x_384p.png)\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003eSee all Validation Outputs\u003c/summary\u003e\n\n  [![validation_outputs/SuperResolution_4x_384p](assets/validation_outputs/SuperResolution_4x_384p.png)](assets/validation_outputs/SuperResolution_4x_384p.png)\n\n\u003c/details\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgeetu040%2Fdepthpro-beyond-depth","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgeetu040%2Fdepthpro-beyond-depth","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgeetu040%2Fdepthpro-beyond-depth/lists"}