https://github.com/FoundationVision/GLEE

[CVPR2024 Highlight]GLEE: General Object Foundation Model for Images and Videos at Scale
https://github.com/FoundationVision/GLEE

foundation-model interactive-segmentation object-detection open-vocabulary-detection open-vocabulary-segmentation open-vocabulary-video-segmentation open-world referring-expression-comprehension referring-expression-segmentation referring-video-object-segmentation segment-anything tracking video-instance-segmentation video-object-segmentation zero-shot-object-detection

Last synced: about 1 year ago
JSON representation

[CVPR2024 Highlight]GLEE: General Object Foundation Model for Images and Videos at Scale

Host: GitHub
URL: https://github.com/FoundationVision/GLEE
Owner: FoundationVision
License: mit
Created: 2023-12-15T01:12:36.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-10-21T06:17:43.000Z (almost 2 years ago)
Last Synced: 2024-11-04T21:41:14.475Z (over 1 year ago)
Topics: foundation-model, interactive-segmentation, object-detection, open-vocabulary-detection, open-vocabulary-segmentation, open-vocabulary-video-segmentation, open-world, referring-expression-comprehension, referring-expression-segmentation, referring-video-object-segmentation, segment-anything, tracking, video-instance-segmentation, video-object-segmentation, zero-shot-object-detection
Language: Python
Homepage: https://glee-vision.github.io/
Size: 22.3 MB
Stars: 1,074
Watchers: 48
Forks: 83
Open Issues: 36
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

AiTreasureBox - FoundationVision/GLEE - 11-03_1154_0](https://img.shields.io/github/stars/FoundationVision/GLEE.svg)|【CVPR2024】GLEE: General Object Foundation Model for Images and Videos at Scale| (Repos)

README

          
# GLEE: General Object Foundation Model for Images and Videos at Scale

> #### Junfeng Wu\*, Yi Jiang\*,  Qihao Liu, Zehuan Yuan, Xiang Bai^†,and Song Bai^†

>

> \* Equal Contribution, ^†Correspondence

\[[Project Page](https://glee-vision.github.io/)\]  \[[Paper](https://arxiv.org/abs/2312.09158)\]    \[[HuggingFace Demo](https://huggingface.co/spaces/Junfeng5/GLEE_demo)\]   \[[Video Demo](https://youtu.be/PSVhfTPx0GQ)\]  

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/long-tail-video-object-segmentation-on-burst-1)](https://paperswithcode.com/sota/long-tail-video-object-segmentation-on-burst-1?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/video-instance-segmentation-on-ovis-1)](https://paperswithcode.com/sota/video-instance-segmentation-on-ovis-1?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/referring-video-object-segmentation-on-refer)](https://paperswithcode.com/sota/referring-video-object-segmentation-on-refer?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/referring-expression-segmentation-on-refer-1)](https://paperswithcode.com/sota/referring-expression-segmentation-on-refer-1?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/multi-object-tracking-on-tao)](https://paperswithcode.com/sota/multi-object-tracking-on-tao?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/open-world-instance-segmentation-on-uvo)](https://paperswithcode.com/sota/open-world-instance-segmentation-on-uvo?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/referring-expression-segmentation-on-refcoco)](https://paperswithcode.com/sota/referring-expression-segmentation-on-refcoco?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/referring-expression-segmentation-on-refcocog)](https://paperswithcode.com/sota/referring-expression-segmentation-on-refcocog?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/video-instance-segmentation-on-youtube-vis-1)](https://paperswithcode.com/sota/video-instance-segmentation-on-youtube-vis-1?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/object-detection-on-lvis-v1-0-val)](https://paperswithcode.com/sota/object-detection-on-lvis-v1-0-val?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/instance-segmentation-on-lvis-v1-0-val)](https://paperswithcode.com/sota/instance-segmentation-on-lvis-v1-0-val?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/referring-expression-comprehension-on-refcoco)](https://paperswithcode.com/sota/referring-expression-comprehension-on-refcoco?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/referring-expression-segmentation-on-refcoco-3)](https://paperswithcode.com/sota/referring-expression-segmentation-on-refcoco-3?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/instance-segmentation-on-coco-minival)](https://paperswithcode.com/sota/instance-segmentation-on-coco-minival?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/referring-expression-comprehension-on)](https://paperswithcode.com/sota/referring-expression-comprehension-on?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/instance-segmentation-on-coco)](https://paperswithcode.com/sota/instance-segmentation-on-coco?p=general-object-foundation-model-for-images)[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/referring-expression-comprehension-on-refcoco-1)](https://paperswithcode.com/sota/referring-expression-comprehension-on-refcoco-1?p=general-object-foundation-model-for-images)

![data_demo](assets/images/glee_func.gif)

## Highlight:

- GLEE is accepted by **CVPR2024** as **Highlight**!

- GLEE is a general object foundation model jointly trained on over **ten million images** from various benchmarks with diverse levels of supervision.

- GLEE is capable of addressing **a wide range of object-centric tasks** simultaneously while maintaining **SOTA** performance.

-  GLEE demonstrates remarkable versatility and robust **zero-shot transferability** across a spectrum of object-level image and video tasks, and able to **serve as a foundational component** for enhancing other architectures or models.

We will release the following contents for **GLEE**:exclamation:

- [x] Demo Code

- [x] Model Zoo

- [x] Comprehensive User Guide

- [x] Training Code and Scripts

- [ ] Detailed Evaluation Code and Scripts

- [ ] Tutorial for Zero-shot Testing or Fine-tuning GLEE on New Datasets

  

## Getting started

1. Installation: Please refer to [INSTALL.md](assets/INSTALL.md) for more details.

2. Data preparation: Please refer to [DATA.md](assets/DATA.md) for more details.

3. Training: Please refer to [TRAIN.md](assets/TRAIN.md) for more details.

4. Testing: Please refer to [TEST.md](assets/TEST.md) for more details. 

5. Model zoo: Please refer to [MODEL_ZOO.md](assets/MODEL_ZOO.md) for more details.

## Run the demo APP

Try our online demo app on \[[HuggingFace Demo](https://huggingface.co/spaces/Junfeng5/GLEE_demo)\] or use it locally:

```bash

git clone https://github.com/FoundationVision/GLEE

# support CPU and GPU running

python app.py

```

# Introduction 

GLEE has been trained on over ten million images from 16 datasets, fully harnessing both existing annotated data and cost-effective automatically labeled data to construct a diverse training set. This extensive training regime endows GLEE with formidable generalization capabilities. 

![data_demo](assets/images/data_demo.png)

GLEE consists of an image encoder, a text encoder, a visual prompter, and an object decoder, as illustrated in Figure. The text encoder processes arbitrary descriptions related to the task, including **1) object category list 2）object names in any form 3）captions about objects 4）referring expressions**. The visual prompter encodes user inputs such as **1) points 2) bounding boxes 3) scribbles** during interactive segmentation into corresponding visual representations of target objects. Then they are integrated into a detector for extracting objects from images according to textual and visual input.

![pipeline](assets/images/pipeline.png)

Based on the above designs, GLEE can be used to seamlessly unify a wide range of object perception tasks in images and videos, including object detection, instance segmentation, grounding, multi-target tracking (MOT), video instance segmentation (VIS), video object segmentation (VOS), interactive segmentation and tracking, and supports **open-world/large-vocabulary image and video detection and segmentation** tasks. 

# Results

## Image-level tasks

![imagetask](assets/images/imagetask.png)

![odinw](assets/images/odinw13zero.png)

## Video-level tasks

![videotask](assets/images/videotask.png)

![visvosrvos](assets/images/visvosrvos.png)`

# Citing GLEE

```

@misc{wu2023GLEE,

  author= {Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, Song Bai},

  title = {General Object Foundation Model for Images and Videos at Scale},

  year={2023},

  eprint={2312.09158},

  archivePrefix={arXiv}

}

```

## Acknowledgments

- Thanks [UNINEXT](https://github.com/MasterBin-IIAU/UNINEXT) for the implementation of multi-dataset training and data processing.

- Thanks [VNext](https://github.com/wjf5203/VNext) for providing experience of Video Instance Segmentation (VIS).

- Thanks [SEEM](https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once) for providing the implementation of the visual prompter.

- Thanks [MaskDINO](https://github.com/IDEA-Research/MaskDINO) for providing a powerful detector and segmenter.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/FoundationVision/GLEE

Awesome Lists containing this project

README