https://github.com/foundationvision/generateu
[CVPR2024] Generative Region-Language Pretraining for Open-Ended Object Detection
https://github.com/foundationvision/generateu
mllm multimodality object-detection open-vocabulary open-vocabulary-detection open-world
Last synced: about 1 year ago
JSON representation
[CVPR2024] Generative Region-Language Pretraining for Open-Ended Object Detection
- Host: GitHub
- URL: https://github.com/foundationvision/generateu
- Owner: FoundationVision
- License: mit
- Created: 2024-03-15T09:01:16.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2025-03-29T02:38:54.000Z (about 1 year ago)
- Last Synced: 2025-03-29T03:26:09.657Z (about 1 year ago)
- Topics: mllm, multimodality, object-detection, open-vocabulary, open-vocabulary-detection, open-world
- Language: Python
- Homepage:
- Size: 14.4 MB
- Stars: 165
- Watchers: 7
- Forks: 7
- Open Issues: 15
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Generative Region-Language Pretraining for Open-Ended Object Detection
Monash University
ByteDance Inc.
CVPR 2024
⭐ If GenerateU is helpful to your projects, please help star this repo. Thanks! 🤗
---
## Highlight
- GenerateU is accepted by **CVPR2024**.
- We introduce generative **open-ended object detection**, which is a more general and practical setting where categorical information is not explicitly defined. Such a setting is especially meaningful for scenarios where users lack precise knowledge of object cate- gories during inference.
- Our GenerateU achieves comparable results to the open-vocabulary object detection method GLIP, even though **the category names are not seen by GenerateU during inference**.
## Results
### Zero-shot domain transfer to LVIS

## Visualizations
#### 👨🏻🎨 Pseudo-label Examples

#### 🎨 Zero-shot LVIS

## Overview

## Dependencies and Installation
1. Clone Repo
```bash
git clone https://github.com/clin1223/GenerateU.git
```
2. Create Conda Environment and Install Dependencies
```bash
# create new anaconda env
conda create -n GenerateU python=3.8 -y
conda activate GenerateU
# install python dependencies
pip3 install -e . --user
pip3 install -r requirements.txt
# compile Deformable DETR
cd projects/DDETRS/ddetrs/models/deformable_detr/ops
bash make.sh
```
- CUDA >= 11.3
- PyTorch >= 1.10.0
- Torchvision >= 0.11.1
- Other required packages in `requirements.txt`
## Get Started
### Prepare pretrained models
Download our pretrained models from [here](https://huggingface.co/clin1223/GenerateU/tree/main) to the `weights` folder. For training, prepare the backbone weight Swin-Tiny and Swin-Large following instruction in [tools/convert-pretrained-swin-model-to-d2.py](tools/convert-pretrained-swin-model-to-d2.py)
The directory structure will be arranged as:
```
weights
|- vg_swinT.pth
|- vg_swinL.pth
|- vg_grit5m_swinT.pth
|- vg_grit5m_swinL.pth
|- swin_tiny_patch4_window7_224.pkl
|- swin_large_patch4_window12_384_22k.pkl
```
## Dataset preparation
### VG Dataset
- Download images from [VG official website](https://homes.cs.washington.edu/~ranjay/visualgenome/api.html)
- Download our pre-processed annotations:
[train_from_objects.json](https://huggingface.co/clin1223/GenerateU/tree/main)
### LVIS Dataset
- Download validation images from [COCO official website](https://cocodataset.org/#download)
- Download validation annotations same as [GLIP](https://github.com/microsoft/GLIP/blob/main/DATA.md):
[lvis_v1_minival.json](https://huggingface.co/clin1223/GenerateU/tree/main)
- Download LVIS category [text embedding](https://huggingface.co/clin1223/GenerateU/tree/main) for mapping
### (Optional) GrIT-20M Dataset
- Download images from [GrIT-20M official website](https://github.com/microsoft/unilm/tree/master/kosmos-2#download-data)
- Run Evaluation on GrIT images to generate pseudo lables.
Dataset strcture should look like:
~~~
|-- datasets
`-- |-- vg
|-- |-- images/
|-- |-- train_from_objects.json
`-- |-- lvis
|-- |-- val2017/
|-- |-- lvis_v1_minival.json
|-- |-- lvis_v1_clip_a+cname_ViT-H.npy
~~~
## Training
By default, we train GenerateU using 16 A100 GPUs.
You can also train on a single node, but this might prevent you from reproducing the results presented in the paper.
### Single-Node Training
When pretraining with VG, single node is enough.
On a single node with 8 GPUs, run
```
python3 launch.py --nn 1 --uni 1 \
--config-file projects/DDETRS/configs/vg_swinT.yaml OUTPUT_DIR outputs/${EXP_NAME}
```
### Multiple-Node Training
``` bash
# On node 0, run
python3 launch.py --nn 2 --port --worker_rank 0 --master_address \
--uni 1 --config-file /path/to/config/name.yaml OUTPUT_DIR outputs/${EXP_NAME}
# On node 1, run
python3 launch.py --nn 2 --port --worker_rank 1 --master_address \
--uni 1 --config-file /path/to/config/name.yaml OUTPUT_DIR outputs/${EXP_NAME}
```
`` should be the IP address of node 0. `` should be the same among multiple nodes. If `` is not specifed, programm will generate a random number as ``.
## Evaluation
To evaluate a model with a trained/ pretrained model, run
```shell
python3 launch.py --nn 1 --eval-only --uni 1 --config-file /path/to/config/name.yaml \
OUTPUT_DIR outputs/${EXP_NAME} MODEL.WEIGHTS /path/to/weight.pth
```
## Citation
If you find our repo useful for your research, please consider citing our paper:
```bibtex
@inproceedings{lin2024generateu,
title={Generative Region-Language Pretraining for Open-Ended Object Detection},
author={Chuang, Lin and Yi, Jiang and Lizhen, Qu and Zehuan, Yuan and Jianfei, Cai},
booktitle={Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2024}
}
```
## Contact
If you have any questions, please feel free to reach me out at `chuang.lin@monash.edu`.
## Acknowledgement
This code is based on [UNINEXT](https://github.com/MasterBin-IIAU/UNINEXT/tree/master). Some code are brought from [FlanT5](https://huggingface.co/docs/transformers/model_doc/flan-t5). Thanks for their awesome works.
Special thanks to [Bin Yan](https://github.com/MasterBin-IIAU) and [Junfeng Wu](https://github.com/wjf5203) for their valuable contributions.