{"id":18302991,"url":"https://github.com/foundationvision/generateu","last_synced_at":"2025-04-05T14:31:48.604Z","repository":{"id":227833678,"uuid":"772471743","full_name":"FoundationVision/GenerateU","owner":"FoundationVision","description":"[CVPR2024] Generative Region-Language Pretraining for Open-Ended Object Detection","archived":false,"fork":false,"pushed_at":"2025-03-29T02:38:54.000Z","size":15080,"stargazers_count":165,"open_issues_count":15,"forks_count":7,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-03-29T03:26:09.657Z","etag":null,"topics":["mllm","multimodality","object-detection","open-vocabulary","open-vocabulary-detection","open-world"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FoundationVision.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-15T09:01:16.000Z","updated_at":"2025-03-29T02:38:58.000Z","dependencies_parsed_at":"2024-03-15T11:44:42.372Z","dependency_job_id":"bfd34f0b-0e4e-4fc6-a14f-167187e9a584","html_url":"https://github.com/FoundationVision/GenerateU","commit_stats":null,"previous_names":["foundationvision/generateu"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FoundationVision%2FGenerateU","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FoundationVision%2FGenerateU/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FoundationVision%2FGenerateU/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FoundationVision%2FGenerateU/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FoundationVision","download_url":"https://codeload.github.com/FoundationVision/GenerateU/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247352652,"owners_count":20925309,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["mllm","multimodality","object-detection","open-vocabulary","open-vocabulary-detection","open-world"],"created_at":"2024-11-05T15:23:33.890Z","updated_at":"2025-04-05T14:31:43.592Z","avatar_url":"https://github.com/FoundationVision.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n\u003cdiv class=\"logo\"\u003e\n   \u003ca href=\"\"\u003e\n      \u003cimg src=\"assets/logo.png\"  width=\"180\"\u003e\n   \u003c/a\u003e\n\u003c/div\u003e\n\n\u003ch1\u003eGenerative Region-Language Pretraining for Open-Ended Object Detection\u003c/h1\u003e\n\n\u003cdiv\u003e\n    \u003ca href='https://clin1223.github.io/' target='_blank'\u003eChuang Lin\u003c/a\u003e\u0026emsp;\n    \u003ca href='https://enjoyyi.github.io/' target='_blank'\u003eYi Jiang\u003c/a\u003e\u0026emsp;\n    \u003ca href='https://research.monash.edu/en/persons/lizhen-qu' target='_blank'\u003eLizhen Qu\u003c/a\u003e\u0026emsp;\n    \u003ca href='https://shallowyuan.github.io/' target='_blank'\u003eZehuan Yuan\u003c/a\u003e\u0026emsp;\n    \u003ca href='https://jianfei-cai.github.io/' target='_blank'\u003eJianfei Cai\u003c/a\u003e\n\u003c/div\u003e\n\u003cdiv\u003e\n    Monash University \u0026emsp; \n    ByteDance Inc.\u0026emsp; \n\u003c/div\u003e\n\n\u003cdiv\u003e\n    \u003cstrong\u003eCVPR 2024\u003c/strong\u003e\n\u003c/div\u003e\n\n\u003cdiv\u003e\n    \u003ch4 align=\"center\"\u003e\n        \u003ca href=\"https://arxiv.org/\" target='_blank'\u003e\n        \u003cimg src=\"https://img.shields.io/badge/arXiv-2309.03897-b31b1b.svg\"\u003e\n        \u003c/a\u003e\n        \u003ca href=\"https://clin1223.github.io/\" target='_blank'\u003e\n        \u003cimg src=\"https://img.shields.io/badge/🐳-Project%20Page-blue\"\u003e\n        \u003c/a\u003e\n        \u003cimg src=\"https://visitor-badge.laobi.icu/badge?page_id=FoundationVision/GenerateU\"\u003e\n    \u003c/h4\u003e\n\u003c/div\u003e\n\n⭐ If GenerateU is helpful to your projects, please help star this repo. Thanks! 🤗\n\n\u003c!-- For more visual results, go checkout our \u003ca href=\"https://shangchenzhou.com/projects/ProPainter/\" target=\"_blank\"\u003eproject page\u003c/a\u003e --\u003e\n\n\n---\n\n\u003c/div\u003e\n\n\n\u003c!-- ## Update\n- **2023.11.09**: Integrated to :man_artist: [OpenXLab](https://openxlab.org.cn/apps). Try out online demo! [![OpenXLab](https://img.shields.io/badge/Demo-%F0%9F%91%A8%E2%80%8D%F0%9F%8E%A8%20OpenXLab-blue)](https://openxlab.org.cn/apps/detail/ShangchenZhou/ProPainter)\n- **2023.11.09**: Integrated to :hugs: [Hugging Face](https://huggingface.co/spaces). Try out online demo! [![Hugging Face](https://img.shields.io/badge/Demo-%F0%9F%A4%97%20Hugging%20Face-blue)](https://huggingface.co/spaces/sczhou/ProPainter)\n- **2023.09.24**: We remove the watermark removal demos officially to prevent the misuse of our work for unethical purposes.\n- **2023.09.21**: Add features for memory-efficient inference. Check our [GPU memory](https://github.com/sczhou/ProPainter#-memory-efficient-inference) requirements. 🚀\n- **2023.09.07**: Our code and model are publicly available. 🐳\n- **2023.09.01**: This repo is created. --\u003e\n\n\n\u003c!-- ### TODO\n- [ ] Make a Colab demo.\n- [x] ~~Make a interactive Gradio demo.~~\n- [x] ~~Update features for memory-efficient inference.~~ --\u003e\n## Highlight\n- GenerateU is accepted by **CVPR2024**.\n- We introduce generative **open-ended object detection**, which is a more general and practical setting where categorical information is not explicitly defined. Such a setting is especially meaningful for scenarios where users lack precise knowledge of object cate- gories during inference.\n-  Our GenerateU achieves comparable results to the open-vocabulary object detection method GLIP, even though **the category names are not seen by GenerateU during inference**. \n\n\n## Results\n### Zero-shot domain transfer to LVIS\n![pseudo-label_examples ](assets/table1.png)\n\n\n## Visualizations\n\n#### 👨🏻‍🎨 Pseudo-label Examples \n![pseudo-label_examples ](assets/pl.png)\n\u003c!-- \u003ctable\u003e\n\u003ctr\u003e\n   \u003ctd\u003e \n      \u003cimg src=\"assets/pl.png\"\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e \n      \u003cimg src=\"assets/object_removal2.gif\"\u003e\n   \u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e --\u003e\n\n#### 🎨 Zero-shot LVIS\n![pseudo-label_examples ](assets/lvis.png)\n\u003c!-- \u003ctable\u003e\n\u003ctr\u003e\n   \u003ctd\u003e \n      \u003cimg src=\"assets/video_completion1.gif\"\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e \n      \u003cimg src=\"assets/video_completion2.gif\"\u003e\n   \u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n   \u003ctd\u003e \n      \u003cimg src=\"assets/video_completion3.gif\"\u003e\n   \u003c/td\u003e\n   \u003ctd\u003e \n      \u003cimg src=\"assets/video_completion4.gif\"\u003e\n   \u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e --\u003e\n\n\n\n## Overview\n![overall_structure](assets/overview.png)\n\n\n## Dependencies and Installation\n\n1. Clone Repo\n\n   ```bash\n   git clone https://github.com/clin1223/GenerateU.git\n   ```\n\n2. Create Conda Environment and Install Dependencies\n\n   ```bash\n   # create new anaconda env\n   conda create -n GenerateU python=3.8 -y\n   conda activate GenerateU\n\n   # install python dependencies\n   pip3 install -e . --user\n   pip3 install -r requirements.txt \n\n   # compile Deformable DETR\n   cd projects/DDETRS/ddetrs/models/deformable_detr/ops\n   bash make.sh\n   ```\n\n   - CUDA \u003e= 11.3\n   - PyTorch \u003e= 1.10.0\n   - Torchvision \u003e= 0.11.1\n   - Other required packages in `requirements.txt`\n\n## Get Started\n### Prepare pretrained models\nDownload our pretrained models from [here](https://huggingface.co/clin1223/GenerateU/tree/main) to the `weights` folder. For training, prepare the backbone weight Swin-Tiny and Swin-Large following instruction in [tools/convert-pretrained-swin-model-to-d2.py](tools/convert-pretrained-swin-model-to-d2.py)\n\n\nThe directory structure will be arranged as:\n```\nweights\n   |- vg_swinT.pth\n   |- vg_swinL.pth\n   |- vg_grit5m_swinT.pth\n   |- vg_grit5m_swinL.pth\n   |- swin_tiny_patch4_window7_224.pkl\n   |- swin_large_patch4_window12_384_22k.pkl\n```\n\n## Dataset preparation\n\n### VG Dataset\n- Download images from [VG official website](https://homes.cs.washington.edu/~ranjay/visualgenome/api.html)\n- Download our pre-processed annotations: \n  [train_from_objects.json](https://huggingface.co/clin1223/GenerateU/tree/main) \n\n### LVIS Dataset\n- Download validation images from [COCO official website](https://cocodataset.org/#download)\n- Download validation annotations same as [GLIP](https://github.com/microsoft/GLIP/blob/main/DATA.md):\n  [lvis_v1_minival.json](https://huggingface.co/clin1223/GenerateU/tree/main) \n- Download LVIS category [text embedding](https://huggingface.co/clin1223/GenerateU/tree/main) for mapping\n\n### (Optional) GrIT-20M Dataset\n- Download images from [GrIT-20M official website](https://github.com/microsoft/unilm/tree/master/kosmos-2#download-data)\n- Run Evaluation on GrIT images to generate pseudo lables.\n\nDataset strcture should look like:\n  ~~~\n  |-- datasets\n  `-- |-- vg\n      |-- |-- images/\n      |-- |-- train_from_objects.json\n   `-- |-- lvis\n      |-- |-- val2017/\n      |-- |-- lvis_v1_minival.json\n      |-- |-- lvis_v1_clip_a+cname_ViT-H.npy\n  ~~~\n\n## Training\nBy default, we train GenerateU using 16 A100 GPUs.\nYou can also train on a single node, but this might prevent you from reproducing the results presented in the paper.\n\n\n### Single-Node Training\nWhen pretraining with VG, single node is enough.\nOn a single node with 8 GPUs, run \n```\npython3 launch.py --nn 1 --uni 1 \\\n--config-file projects/DDETRS/configs/vg_swinT.yaml OUTPUT_DIR outputs/${EXP_NAME}\n```\n\n### Multiple-Node Training\n``` bash\n# On node 0, run\npython3 launch.py --nn 2 --port \u003cPORT\u003e --worker_rank 0 --master_address \u003cMASTER_ADDRESS\u003e \\\n--uni 1 --config-file /path/to/config/name.yaml  OUTPUT_DIR outputs/${EXP_NAME}\n# On node 1, run\npython3 launch.py --nn 2 --port \u003cPORT\u003e --worker_rank 1 --master_address \u003cMASTER_ADDRESS\u003e \\\n--uni 1 --config-file /path/to/config/name.yaml OUTPUT_DIR outputs/${EXP_NAME}\n```\n\n`\u003cMASTER_ADDRESS\u003e` should be the IP address of node 0. `\u003cPORT\u003e` should be the same among multiple nodes. If `\u003cPORT\u003e` is not specifed, programm will generate a random number as `\u003cPORT\u003e`.\n\n\n## Evaluation\n\nTo evaluate a model with a trained/ pretrained model, run\n\n```shell\npython3 launch.py --nn 1 --eval-only --uni 1 --config-file /path/to/config/name.yaml  \\\nOUTPUT_DIR outputs/${EXP_NAME}  MODEL.WEIGHTS /path/to/weight.pth\n```\n\n\u003c!-- ### 🏂 Quick test\nWe provide some examples in the [`inputs`](./inputs) folder. \nRun the following commands to try it out:\n```shell\n# The first example (object removal)\npython inference_propainter.py --video inputs/object_removal/bmx-trees --mask inputs/object_removal/bmx-trees_mask \n# The second example (video completion)\npython inference_propainter.py --video inputs/video_completion/running_car.mp4 --mask inputs/video_completion/mask_square.png --height 240 --width 432\n```\n\nThe results will be saved in the `results` folder.\nTo test your own videos, please prepare the input `mp4 video` (or `split frames`) and `frame-wise mask(s)`.\n\nIf you want to specify the video resolution for processing or avoid running out of memory, you can set the video size of `--width` and `--height`:\n```shell\n# process a 576x320 video; set --fp16 to use fp16 (half precision) during inference.\npython inference_propainter.py --video inputs/video_completion/running_car.mp4 --mask inputs/video_completion/mask_square.png --height 320 --width 576 --fp16\n```\n --\u003e\n\n## Citation\n\n   If you find our repo useful for your research, please consider citing our paper:\n\n   ```bibtex\n   @inproceedings{lin2024generateu,\n      title={Generative Region-Language Pretraining for Open-Ended Object Detection},\n      author={Chuang, Lin and Yi, Jiang and Lizhen, Qu and Zehuan, Yuan and Jianfei, Cai},\n      booktitle={Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},\n      year={2024}\n   }\n   ```\n\n## Contact\nIf you have any questions, please feel free to reach me out at `chuang.lin@monash.edu`. \n\n## Acknowledgement\n\nThis code is based on [UNINEXT](https://github.com/MasterBin-IIAU/UNINEXT/tree/master). Some code are brought from [FlanT5](https://huggingface.co/docs/transformers/model_doc/flan-t5). Thanks for their awesome works. \n\nSpecial thanks to [Bin Yan](https://github.com/MasterBin-IIAU) and [Junfeng Wu](https://github.com/wjf5203) for their valuable contributions.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffoundationvision%2Fgenerateu","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffoundationvision%2Fgenerateu","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffoundationvision%2Fgenerateu/lists"}