{"id":23533533,"url":"https://github.com/harboryuan/polyphonicformer","last_synced_at":"2025-10-25T02:48:57.417Z","repository":{"id":47963697,"uuid":"434543653","full_name":"HarborYuan/PolyphonicFormer","owner":"HarborYuan","description":"[ECCV 2022] 🎵PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation","archived":false,"fork":false,"pushed_at":"2022-12-22T04:17:17.000Z","size":5655,"stargazers_count":56,"open_issues_count":2,"forks_count":4,"subscribers_count":13,"default_branch":"main","last_synced_at":"2025-07-18T22:25:29.649Z","etag":null,"topics":["computer-vision","deep-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/HarborYuan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-12-03T09:47:06.000Z","updated_at":"2025-07-18T12:08:39.000Z","dependencies_parsed_at":"2023-01-30T05:31:45.924Z","dependency_job_id":null,"html_url":"https://github.com/HarborYuan/PolyphonicFormer","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/HarborYuan/PolyphonicFormer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HarborYuan%2FPolyphonicFormer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HarborYuan%2FPolyphonicFormer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HarborYuan%2FPolyphonicFormer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HarborYuan%2FPolyphonicFormer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/HarborYuan","download_url":"https://codeload.github.com/HarborYuan/PolyphonicFormer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HarborYuan%2FPolyphonicFormer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":280897704,"owners_count":26409960,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-25T02:00:06.499Z","response_time":81,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","deep-learning"],"created_at":"2024-12-25T23:27:58.404Z","updated_at":"2025-10-25T02:48:57.387Z","avatar_url":"https://github.com/HarborYuan.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🎵PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation\n\n**PolyphonicFormer is the winner method of the ICCV-2021 SemKITTI-DVPS [Challenge](https://motchallenge.net/workshops/bmtt2021/).** \n\n**PolyphonicFormer is accepted by [ECCV '22](https://eccv2022.ecva.net/), Tel Aviv, Israel.**\n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/polyphonicformer-unified-query-learning-for/depth-aware-video-panoptic-segmentation-on)](https://paperswithcode.com/sota/depth-aware-video-panoptic-segmentation-on?p=polyphonicformer-unified-query-learning-for)\n\n\t\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/polyphonicformer-unified-query-learning-for/depth-aware-video-panoptic-segmentation-on-1)](https://paperswithcode.com/sota/depth-aware-video-panoptic-segmentation-on-1?p=polyphonicformer-unified-query-learning-for)\n\n[Haobo Yuan](https://yuanhaobo.me)\\*,\n[Xiangtai Li](https://lxtgh.github.io)\\*,\n[Yibo Yang](https://iboing.github.io),\n[Guangliang Cheng](https://scholar.google.com/citations?user=FToOC-wAAAAJ),\n[Jing Zhang](https://scholar.google.com/citations?user=9jH5v74AAAAJ), \n[Yunhai Tong](https://eecs.pku.edu.cn/info/1475/9689.htm),\n[Lefei Zhang](https://sites.google.com/site/lzhangpage/),\n[Dacheng Tao](https://www.sydney.edu.au/engineering/about/our-people/academic-staff/dacheng-tao.html).\n\n[[pdf](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136870574.pdf)] [[supp](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136870574-supp.pdf)] [[arxiv](https://arxiv.org/abs/2112.02582)] [[code](https://github.com/HarborYuan/PolyphonicFormer)] [[poster](https://huggingface.co/HarborYuan/PolyphonicFormer/resolve/main/4533.pdf)]\n\n## Demo\n![Demo1](demo/video_demo_1.gif)\n\n![Demo2](demo/video_demo_2.gif)\n\n## Installation (Optional)\nYou do not need to install the environment if you have docker in your environment. We already put the pre-built docker image on [docker hub](https://hub.docker.com/r/harbory/polyphonicformer). If you want to build the docker image by yourself, please run the following command in `scripts/docker_env`.\n```commandline\ndocker build -t polyphonicformer:release . --network=host\n```\nPlease refer to the dockerfile for environment details if you insist on using conda.\n\n## Datasets Preparation\nYou can download the Cityscapes-DVPS datasets [here](https://huggingface.co/HarborYuan/PolyphonicFormer/resolve/main/cityscapes-dvps.zip), and SemKITTI-DVPS datasets [here](https://huggingface.co/HarborYuan/PolyphonicFormer/resolve/main/semkitti-dvps.zip). Suppose your path to datasets is DATALOC, please extract the zip file and make sure the datasets folder looks like this:\n```\nDATALOC\n|── cityscapes-dvps\n│   ├── video_sequence\n│   │   ├── train\n│   │   │   ├── 000000_000000_munster_000105_000004_leftImg8bit.png\n│   │   │   ├── 000000_000000_munster_000105_000004_gtFine_instanceTrainIds.png\n│   │   │   ├── 000000_000000_munster_000105_000004_depth.png\n│   │   │   ├── ...\n│   │   ├── val\n│   │   │   ├── ...\n|── semkitti-dvps\n│   ├── video_sequence\n│   │   ├── train\n│   │   │   ├── 000000_000000_leftImg8bit.png\n│   │   │   ├── 000000_000000_gtFine_class.png\n│   │   │   ├── 000000_000000_gtFine_instance.png\n│   │   │   ├── 000000_000000_depth_718.8560180664062.png\n│   │   │   ├── ...\n│   │   ├── val\n│   │   │   ├── ...\n```\nPlease make sure you know that the Cityscapes-DVPS and SemKITTI-DVPS datasets are created by the authors of [ViP-Deeplab](https://github.com/joe-siyuan-qiao/ViP-DeepLab).\n\n## Docker Container\nAfter you prepared the datasets, you can create and enter a docker container:\n```commandline\nDATALOC={/path/to/datafolder} LOGLOC={/path/to/logfolder} bash tools/docker.sh\n```\nThe DATALOC will be linked to data in the project folder, and the LOGLOC will be linked to `/opt/logger`.\n\n## Getting Start\nLet's go for 🏃‍♀️running code.\n### Image training\n```commandline\nbash tools/dist_train.sh configs/polyphonic_image/poly_r50_cityscapes_2x.py 8 --seed 0 --work-dir /opt/logger/exp001\n```\n### Image testing\n```commandline\nbash tools/dist_test.sh configs/polyphonic_image/poly_r50_cityscapes_2x.py https://huggingface.co/HarborYuan/PolyphonicFormer/resolve/main/polyphonic_r50_image.pth 8\n```\n\n### Video training\n```commandline\nbash tools/dist_train.sh configs/polyphonic_video/poly_r50_cityscapes_1x.py 8 --seed 0 --work-dir /opt/logger/vid001 --no-validate\n```\n\n### Video testing\n```commandline\nPYTHONPATH=. python tools/test_video.py configs/polyphonic_video/poly_r50_cityscapes_1x.py https://huggingface.co/HarborYuan/PolyphonicFormer/resolve/main/polyphonic_r50_video.pth --eval-video DVPQ --video-dir ./tmp\n```\nTo test your own training results, just replace the online checkpoints to your local checkpoints. For example, you can run as the following for video testing:\n```commandline\nPYTHONPATH=. python tools/test_video.py configs/polyphonic_video/poly_r50_cityscapes_1x.py /path/to/checkpoint.pth --eval-video DVPQ --video-dir ./tmp\n```\n\n## Acknowledgements\nThe image segmentation model is based on [K-Net](https://github.com/ZwwWayne/K-Net). The datasets are extracted from [ViP-Deeplab](https://github.com/joe-siyuan-qiao/ViP-DeepLab).\nPlease refer them if you think they are useful.\n```bibtex\n@article{zhang2021k,\n  title={K-Net: Towards Unified Image Segmentation},\n  author={Zhang, Wenwei and Pang, Jiangmiao and Chen, Kai and Loy, Chen Change},\n  journal={NeurIPS},\n  year={2021}\n}\n@inproceedings{qiao2021vip,\n  title={ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic Segmentation},\n  author={Qiao, Siyuan and Zhu, Yukun and Adam, Hartwig and Yuille, Alan and Chen, Liang-Chieh},\n  booktitle={CVPR},\n  year={2021}\n}\n```\n\n## Citation\nIf you think the code are useful in your research, please consider to refer PolyphonicFormer:\n```bibtex\n@inproceedings{yuan2022polyphonicformer,\n  title={Polyphonicformer: Unified Query Learning for Depth-aware Video Panoptic Segmentation},\n  author={Yuan, Haobo and Li, Xiangtai and Yang, Yibo and Cheng, Guangliang and Zhang, Jing and Tong, Yunhai and Zhang, Lefei and Tao, Dacheng},\n  booktitle={ECCV},\n  year={2022},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fharboryuan%2Fpolyphonicformer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fharboryuan%2Fpolyphonicformer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fharboryuan%2Fpolyphonicformer/lists"}