{"id":15034363,"url":"https://github.com/hkchengrex/tracking-anything-with-deva","last_synced_at":"2025-04-08T08:15:50.644Z","repository":{"id":192499321,"uuid":"679825852","full_name":"hkchengrex/Tracking-Anything-with-DEVA","owner":"hkchengrex","description":"[ICCV 2023] Tracking Anything with Decoupled Video Segmentation","archived":false,"fork":false,"pushed_at":"2024-08-01T17:29:22.000Z","size":1101,"stargazers_count":1358,"open_issues_count":7,"forks_count":130,"subscribers_count":16,"default_branch":"main","last_synced_at":"2025-04-08T08:15:40.051Z","etag":null,"topics":["deep-learning","iccv2023","object-tracking","open-vocabulary-segmentation","open-vocabulary-video-segmentation","open-world-video-segmentation","video-editing","video-object-segmentation","video-segmentation"],"latest_commit_sha":null,"homepage":"https://hkchengrex.com/Tracking-Anything-with-DEVA/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hkchengrex.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-17T17:52:59.000Z","updated_at":"2025-04-07T08:42:52.000Z","dependencies_parsed_at":"2023-09-22T04:02:41.957Z","dependency_job_id":"b869f829-f446-481d-8050-92a56d54bec8","html_url":"https://github.com/hkchengrex/Tracking-Anything-with-DEVA","commit_stats":null,"previous_names":["hkchengrex/tracking-anything-with-deva"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hkchengrex%2FTracking-Anything-with-DEVA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hkchengrex%2FTracking-Anything-with-DEVA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hkchengrex%2FTracking-Anything-with-DEVA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hkchengrex%2FTracking-Anything-with-DEVA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hkchengrex","download_url":"https://codeload.github.com/hkchengrex/Tracking-Anything-with-DEVA/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247801175,"owners_count":20998339,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","iccv2023","object-tracking","open-vocabulary-segmentation","open-vocabulary-video-segmentation","open-world-video-segmentation","video-editing","video-object-segmentation","video-segmentation"],"created_at":"2024-09-24T20:24:45.127Z","updated_at":"2025-04-08T08:15:50.577Z","avatar_url":"https://github.com/hkchengrex.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DEVA: Tracking Anything with Decoupled Video Segmentation\n\n![titlecard](https://imgur.com/lw15BGH.png)\n\n[Ho Kei Cheng](https://hkchengrex.github.io/), [Seoung Wug Oh](https://sites.google.com/view/seoungwugoh/), [Brian Price](https://www.brianpricephd.com/), [Alexander Schwing](https://www.alexander-schwing.de/), [Joon-Young Lee](https://joonyoung-cv.github.io/)\n\nUniversity of Illinois Urbana-Champaign and Adobe\n\nICCV 2023\n\n[[arXiV]](https://arxiv.org/abs/2309.03903) [[PDF]](https://arxiv.org/pdf/2309.03903.pdf) [[Project Page]](https://hkchengrex.github.io/Tracking-Anything-with-DEVA/) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1OsyNVoV_7ETD1zIE8UWxL3NXxu12m_YZ?usp=sharing)\n\n## Highlights\n1. Provide long-term, open-vocabulary video segmentation with text-prompts out-of-the-box.\n2. Fairly easy to **integrate your own image model**! Wouldn't you or your reviewers be interested in seeing examples where your image model also works well on videos :smirk:? No finetuning is needed!\n\n***Note (Mar 6 2024):*** We have fixed a major bug (introduced in the last update) that prevented the deletion of unmatched segments in text/eval_with_detections modes. This should greatly reduce the amount of accumulated noisy detection/false positives, especially for long videos. See [#64](https://github.com/hkchengrex/Tracking-Anything-with-DEVA/issues/64).\n\n***Note (Sep 12 2023):*** We have improved automatic video segmentation by not querying the points in segmented regions. We correspondingly increased the number of query points per side to 64 and deprecated the \"engulf\" mode. The old code can be found in the \"legacy_engulf\" branch. The new code should run a lot faster and capture smaller objects. The text-prompted mode is still recommended for better results.\n\n***Note (Sep 11 2023):*** We have removed the \"pluralize\" option as it works weirdly sometimes with GroundingDINO. If needed, please pluralize the prompt yourself.\n\n## Abstract\n\nWe develop a decoupled video segmentation approach (**DEVA**), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation.\nDue to this design, we only need an image-level model for the target task and a universal temporal propagation model which is trained once and generalizes across tasks.\nTo effectively combine these two modules, we propose a (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation.\nWe show that this decoupled formulation compares favorably to end-to-end approaches in several tasks, most notably in large-vocabulary video panoptic segmentation and open-world video segmentation.\n\n## Demo Videos\n\n### Demo with Grounded Segment Anything (text prompt: \"guinea pigs\" and \"chicken\"):\n\nhttps://github.com/hkchengrex/Tracking-Anything-with-DEVA/assets/7107196/457a9a6a-86c3-4c5a-a3cc-25199427cd11\n\nSource: https://www.youtube.com/watch?v=FM9SemMfknA\n\n### Demo with Grounded Segment Anything (text prompt: \"pigs\"):\n\nhttps://github.com/hkchengrex/Tracking-Anything-with-DEVA/assets/7107196/9a6dbcd1-2c84-45c8-ac0a-4ad31169881f\n\nSource: https://youtu.be/FbK3SL97zf8\n\n### Demo with Grounded Segment Anything (text prompt: \"capybara\"):\n\nhttps://github.com/hkchengrex/Tracking-Anything-with-DEVA/assets/7107196/2ac5acc2-d160-49be-a013-68ad1d4074c5\n\nSource: https://youtu.be/couz1CrlTdQ\n\n### Demo with Segment Anything (automatic points-in-grid prompting); original video follows DEVA result overlaying the video:\n\nhttps://github.com/hkchengrex/Tracking-Anything-with-DEVA/assets/7107196/ac6ab425-2f49-4438-bcd4-16e4ccfb0d98\n\nSource: DAVIS 2017 validation set \"soapbox\"\n\n### Demo with Segment Anything on a out-of-domain example; original video follows DEVA result overlaying the video:\n\nhttps://github.com/hkchengrex/Tracking-Anything-with-DEVA/assets/7107196/48542bcd-113c-4454-b512-030df26def08\n\nSource: https://youtu.be/FQQaSyH9hZI\n\n## Installation\n\nTested on Ubuntu only. For installation on Windows WSL2, refer to https://github.com/hkchengrex/Tracking-Anything-with-DEVA/issues/20 (thanks @21pl).\n\n**Prerequisite:**\n- Python 3.9+\n- PyTorch 1.12+ and corresponding torchvision\n\n**Clone our repository:**\n```bash\ngit clone https://github.com/hkchengrex/Tracking-Anything-with-DEVA.git\n```\n\n**Install with pip:**\n```bash\ncd Tracking-Anything-with-DEVA\npip install -e .\n```\n(If you encounter the `File \"setup.py\" not found` error, upgrade your pip with `pip install --upgrade pip`)\n\n**Download the pretrained models:**\n```bash\nbash scripts/download_models.sh\n```\n\n**Required for the text-prompted/automatic demo:**\n\nInstall [our fork of Grounded-Segment-Anything](https://github.com/hkchengrex/Grounded-Segment-Anything). Follow its instructions.\n\nGrounding DINO installation might fail silently.\nTry `python -c \"from groundingdino.util.inference import Model as GroundingDINOModel\"`.\nIf you get a warning about running on CPU mode only, make sure you have `CUDA_HOME` set during Grounding DINO installation.\n\n**(Optional) For fast integer program solving in the semi-online setting:** \n\nGet your [gurobi](https://www.gurobi.com/) licence which is free for academic use. \nIf a license is not found, we fall back to using [PuLP](https://github.com/coin-or/pulp) which is slower and is not rigorously tested by us. All experiments are conducted with gurobi.\n\n\n## Quick Start\n\n[DEMO.md](docs/DEMO.md) contains more details on the input arguments and tips on speeding up inference.\nYou can always look at `deva/inference/eval_args.py` and `deva/ext/ext_eval_args.py` for a full list of arguments.\n\n**With gradio:**\n```bash\npython demo/demo_gradio.py\n```\nThen visit the link that popped up on the terminal. If executing on a remote server, try [port forwarding](https://unix.stackexchange.com/questions/115897/whats-ssh-port-forwarding-and-whats-the-difference-between-ssh-local-and-remot).\n\nWe have prepared an example in `example/vipseg/12_1mWNahzcsAc` (a clip from the VIPSeg dataset).\nThe following two scripts segment the example clip using either Grounded Segment Anything with text prompts or SAM with automatic (points in grid) prompting.\n\n**Script (text-prompted):**\n```bash\npython demo/demo_with_text.py --chunk_size 4 \\\n--img_path ./example/vipseg/images/12_1mWNahzcsAc \\\n--amp --temporal_setting semionline \\\n--size 480 \\\n--output ./example/output --prompt person.hat.horse\n```\n\nWe support different SAM variants in **text-prompted modes**, by default we use original sam version. For **higher-quality** masks prediction, you specify `--sam_variant sam_hq`. For **running efficient** sam usage, you can specify `--sam_variant sam_hq_light` or `--sam_variant mobile`.\n\n**Script (automatic):**\n```bash\npython demo/demo_automatic.py --chunk_size 4 \\\n--img_path ./example/vipseg/images/12_1mWNahzcsAc \\\n--amp --temporal_setting semionline \\\n--size 480 \\\n--output ./example/output\n```\n\n## Training and Evaluation\n\n1. [Running DEVA with your own detection model.](docs/CUSTOM.md)\n2. [Running DEVA with detections to reproduce the benchmark results.](docs/EVALUATION.md)\n3. [Training the DEVA model.](docs/TRAINING.md)\n\n## Limitations\n\n- On closed-set data, DEVA most likely does not work as well as end-to-end approaches. Joint training is (for now) still a better idea when you have enough target data.\n- Positive detections are amplified temporally due to propagation. Having a detector with a lower false positive rate (i.e., a higher threshold) helps.\n- If new objects are coming in and out all the time (e.g., in driving scenes), we will keep a lot of objects in the memory bank which unfortunately increases the false positive rate. Decreasing `max_missed_detection_count` might help since we delete objects from memory more eagerly.\n\n\u003cpicture\u003e\n  \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"https://imgur.com/aouI1WU.png\"\u003e\n  \u003csource media=\"(prefers-color-scheme: light)\" srcset=\"https://imgur.com/aCbrA9S.png\"\u003e\n  \u003cimg alt=\"separator\" src=\"https://imgur.com/aCbrA9S.png\"\u003e\n\u003c/picture\u003e\n\n\n## Citation\n\n```bibtex\n@inproceedings{cheng2023tracking,\n  title={Tracking Anything with Decoupled Video Segmentation},\n  author={Cheng, Ho Kei and Oh, Seoung Wug and Price, Brian and Schwing, Alexander and Lee, Joon-Young},\n  booktitle={ICCV},\n  year={2023}\n}\n```\n\n## References\n\nThe demo would not be possible without :heart: from the community:\n\nGrounded Segment Anything: https://github.com/IDEA-Research/Grounded-Segment-Anything\n\nSegment Anything: https://github.com/facebookresearch/segment-anything\n\nXMem: https://github.com/hkchengrex/XMem\n\nTitle card generated with OpenPano: https://github.com/ppwwyyxx/OpenPano\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhkchengrex%2Ftracking-anything-with-deva","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhkchengrex%2Ftracking-anything-with-deva","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhkchengrex%2Ftracking-anything-with-deva/lists"}