{"id":15038332,"url":"https://github.com/foundationvision/glee","last_synced_at":"2025-05-15T02:10:37.350Z","repository":{"id":212599352,"uuid":"731826187","full_name":"FoundationVision/GLEE","owner":"FoundationVision","description":"[CVPR2024 Highlight]GLEE: General Object Foundation Model for Images and Videos at Scale","archived":false,"fork":false,"pushed_at":"2024-10-21T06:17:43.000Z","size":23408,"stargazers_count":1115,"open_issues_count":44,"forks_count":69,"subscribers_count":35,"default_branch":"main","last_synced_at":"2025-04-14T01:51:55.956Z","etag":null,"topics":["foundation-model","interactive-segmentation","object-detection","open-vocabulary-detection","open-vocabulary-segmentation","open-vocabulary-video-segmentation","open-world","referring-expression-comprehension","referring-expression-segmentation","referring-video-object-segmentation","segment-anything","tracking","video-instance-segmentation","video-object-segmentation","zero-shot-object-detection"],"latest_commit_sha":null,"homepage":"https://glee-vision.github.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FoundationVision.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-12-15T01:12:36.000Z","updated_at":"2025-04-12T04:54:52.000Z","dependencies_parsed_at":"2024-01-01T08:13:34.535Z","dependency_job_id":"8f385115-c4f8-4332-b1d3-04e616f3796c","html_url":"https://github.com/FoundationVision/GLEE","commit_stats":{"total_commits":19,"total_committers":2,"mean_commits":9.5,"dds":"0.42105263157894735","last_synced_commit":"f36a49e88c8f02e19cdc3b3f7563a7dd80fc7ffe"},"previous_names":["foundationvision/glee"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FoundationVision%2FGLEE","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FoundationVision%2FGLEE/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FoundationVision%2FGLEE/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FoundationVision%2FGLEE/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FoundationVision","download_url":"https://codeload.github.com/FoundationVision/GLEE/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254259387,"owners_count":22040821,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["foundation-model","interactive-segmentation","object-detection","open-vocabulary-detection","open-vocabulary-segmentation","open-vocabulary-video-segmentation","open-world","referring-expression-comprehension","referring-expression-segmentation","referring-video-object-segmentation","segment-anything","tracking","video-instance-segmentation","video-object-segmentation","zero-shot-object-detection"],"created_at":"2024-09-24T20:38:03.409Z","updated_at":"2025-05-15T02:10:32.342Z","avatar_url":"https://github.com/FoundationVision.png","language":"Python","readme":"\n# GLEE: General Object Foundation Model for Images and Videos at Scale\n\n\u003e #### Junfeng Wu\\*, Yi Jiang\\*,  Qihao Liu, Zehuan Yuan, Xiang Bai\u003csup\u003e\u0026dagger;\u003c/sup\u003e,and Song Bai\u003csup\u003e\u0026dagger;\u003c/sup\u003e\n\u003e\n\u003e \\* Equal Contribution, \u003csup\u003e\u0026dagger;\u003c/sup\u003eCorrespondence\n\n\\[[Project Page](https://glee-vision.github.io/)\\]  \\[[Paper](https://arxiv.org/abs/2312.09158)\\]    \\[[HuggingFace Demo](https://huggingface.co/spaces/Junfeng5/GLEE_demo)\\]   \\[[Video Demo](https://youtu.be/PSVhfTPx0GQ)\\]  \n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/long-tail-video-object-segmentation-on-burst-1)](https://paperswithcode.com/sota/long-tail-video-object-segmentation-on-burst-1?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/video-instance-segmentation-on-ovis-1)](https://paperswithcode.com/sota/video-instance-segmentation-on-ovis-1?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/referring-video-object-segmentation-on-refer)](https://paperswithcode.com/sota/referring-video-object-segmentation-on-refer?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/referring-expression-segmentation-on-refer-1)](https://paperswithcode.com/sota/referring-expression-segmentation-on-refer-1?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/multi-object-tracking-on-tao)](https://paperswithcode.com/sota/multi-object-tracking-on-tao?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/open-world-instance-segmentation-on-uvo)](https://paperswithcode.com/sota/open-world-instance-segmentation-on-uvo?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/referring-expression-segmentation-on-refcoco)](https://paperswithcode.com/sota/referring-expression-segmentation-on-refcoco?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/referring-expression-segmentation-on-refcocog)](https://paperswithcode.com/sota/referring-expression-segmentation-on-refcocog?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/video-instance-segmentation-on-youtube-vis-1)](https://paperswithcode.com/sota/video-instance-segmentation-on-youtube-vis-1?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/object-detection-on-lvis-v1-0-val)](https://paperswithcode.com/sota/object-detection-on-lvis-v1-0-val?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/instance-segmentation-on-lvis-v1-0-val)](https://paperswithcode.com/sota/instance-segmentation-on-lvis-v1-0-val?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/referring-expression-comprehension-on-refcoco)](https://paperswithcode.com/sota/referring-expression-comprehension-on-refcoco?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/referring-expression-segmentation-on-refcoco-3)](https://paperswithcode.com/sota/referring-expression-segmentation-on-refcoco-3?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/instance-segmentation-on-coco-minival)](https://paperswithcode.com/sota/instance-segmentation-on-coco-minival?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/referring-expression-comprehension-on)](https://paperswithcode.com/sota/referring-expression-comprehension-on?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/instance-segmentation-on-coco)](https://paperswithcode.com/sota/instance-segmentation-on-coco?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/referring-expression-comprehension-on-refcoco-1)](https://paperswithcode.com/sota/referring-expression-comprehension-on-refcoco-1?p=general-object-foundation-model-for-images)\n\n\n\n\n![data_demo](assets/images/glee_func.gif)\n\n## Highlight:\n\n- GLEE is accepted by **CVPR2024** as **Highlight**!\n- GLEE is a general object foundation model jointly trained on over **ten million images** from various benchmarks with diverse levels of supervision.\n- GLEE is capable of addressing **a wide range of object-centric tasks** simultaneously while maintaining **SOTA** performance.\n-  GLEE demonstrates remarkable versatility and robust **zero-shot transferability** across a spectrum of object-level image and video tasks, and able to **serve as a foundational component** for enhancing other architectures or models.\n\n\n\nWe will release the following contents for **GLEE**:exclamation:\n\n- [x] Demo Code\n\n- [x] Model Zoo\n\n- [x] Comprehensive User Guide\n\n- [x] Training Code and Scripts\n\n- [ ] Detailed Evaluation Code and Scripts\n\n- [ ] Tutorial for Zero-shot Testing or Fine-tuning GLEE on New Datasets\n\n  \n\n\n\n## Getting started\n\n1. Installation: Please refer to [INSTALL.md](assets/INSTALL.md) for more details.\n2. Data preparation: Please refer to [DATA.md](assets/DATA.md) for more details.\n3. Training: Please refer to [TRAIN.md](assets/TRAIN.md) for more details.\n4. Testing: Please refer to [TEST.md](assets/TEST.md) for more details. \n5. Model zoo: Please refer to [MODEL_ZOO.md](assets/MODEL_ZOO.md) for more details.\n\n\n\n## Run the demo APP\n\nTry our online demo app on \\[[HuggingFace Demo](https://huggingface.co/spaces/Junfeng5/GLEE_demo)\\] or use it locally:\n\n```bash\ngit clone https://github.com/FoundationVision/GLEE\n# support CPU and GPU running\npython app.py\n```\n\n\n\n# Introduction \n\n\n\nGLEE has been trained on over ten million images from 16 datasets, fully harnessing both existing annotated data and cost-effective automatically labeled data to construct a diverse training set. This extensive training regime endows GLEE with formidable generalization capabilities. \n\n\n\n![data_demo](assets/images/data_demo.png)\n\n\n\nGLEE consists of an image encoder, a text encoder, a visual prompter, and an object decoder, as illustrated in Figure. The text encoder processes arbitrary descriptions related to the task, including **1) object category list 2）object names in any form 3）captions about objects 4）referring expressions**. The visual prompter encodes user inputs such as **1) points 2) bounding boxes 3) scribbles** during interactive segmentation into corresponding visual representations of target objects. Then they are integrated into a detector for extracting objects from images according to textual and visual input.\n\n![pipeline](assets/images/pipeline.png)\n\n\n\nBased on the above designs, GLEE can be used to seamlessly unify a wide range of object perception tasks in images and videos, including object detection, instance segmentation, grounding, multi-target tracking (MOT), video instance segmentation (VIS), video object segmentation (VOS), interactive segmentation and tracking, and supports **open-world/large-vocabulary image and video detection and segmentation** tasks. \n\n\n\n# Results\n\n## Image-level tasks\n\n![imagetask](assets/images/imagetask.png)\n\n![odinw](assets/images/odinw13zero.png)\n\n## Video-level tasks\n\n![videotask](assets/images/videotask.png)\n\n![visvosrvos](assets/images/visvosrvos.png)`\n\n\n\n# Citing GLEE\n\n```\n@misc{wu2023GLEE,\n  author= {Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, Song Bai},\n  title = {General Object Foundation Model for Images and Videos at Scale},\n  year={2023},\n  eprint={2312.09158},\n  archivePrefix={arXiv}\n}\n```\n\n## Acknowledgments\n\n- Thanks [UNINEXT](https://github.com/MasterBin-IIAU/UNINEXT) for the implementation of multi-dataset training and data processing.\n\n- Thanks [VNext](https://github.com/wjf5203/VNext) for providing experience of Video Instance Segmentation (VIS).\n\n- Thanks [SEEM](https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once) for providing the implementation of the visual prompter.\n\n- Thanks [MaskDINO](https://github.com/IDEA-Research/MaskDINO) for providing a powerful detector and segmenter.\n\n  \n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffoundationvision%2Fglee","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffoundationvision%2Fglee","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffoundationvision%2Fglee/lists"}