{"id":18470808,"url":"https://github.com/om-ai-lab/groundvlp","last_synced_at":"2025-04-08T11:32:04.933Z","repository":{"id":214512503,"uuid":"732911173","full_name":"om-ai-lab/GroundVLP","owner":"om-ai-lab","description":"GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection","archived":false,"fork":false,"pushed_at":"2024-01-02T14:56:56.000Z","size":35377,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-01-02T15:42:47.227Z","etag":null,"topics":["multimodal","object-detection","vision-and-language","zero-shot-learning"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/om-ai-lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-12-18T06:25:09.000Z","updated_at":"2024-01-02T12:06:29.000Z","dependencies_parsed_at":"2023-12-28T17:45:35.627Z","dependency_job_id":null,"html_url":"https://github.com/om-ai-lab/GroundVLP","commit_stats":null,"previous_names":["om-ai-lab/groundvlp"],"tags_count":0,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/om-ai-lab%2FGroundVLP","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/om-ai-lab%2FGroundVLP/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/om-ai-lab%2FGroundVLP/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/om-ai-lab%2FGroundVLP/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/om-ai-lab","download_url":"https://codeload.github.com/om-ai-lab/GroundVLP/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223317849,"owners_count":17125605,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["multimodal","object-detection","vision-and-language","zero-shot-learning"],"created_at":"2024-11-06T10:14:55.657Z","updated_at":"2024-11-06T10:14:55.723Z","avatar_url":"https://github.com/om-ai-lab.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# GroundVLP\n**GroundVLP**: A simple yet effective zero-shot method that harnesses visual grounding ability from the existing models trained from image-text pairs and pure object detection data\n\u003cp align=\"center\"\u003e \u003cimg src='docs/introduction3.png' align=\"center\" height=\"400px\"\u003e \u003c/p\u003e\n\n\u003e [**GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection**](https://arxiv.org/abs/2312.15043)               \n\u003e Haozhan Shen, Tiancheng Zhao, Mingwei Zhu, Jianwei Yin              \n\u003e *AAAI 2024 ([arXiv 2312.15043](https://arxiv.org/abs/2312.15043))*  \n\n## Installation\n\n* First, you should install PyTorch ≥ 1.8. Please install them together at [pytorch.org](https://pytorch.org), please check PyTorch version matches that is required by Detectron2.\n* For using Detic, you should install Detectron2. You could follow [Detectron2 installation instructions](https://detectron2.readthedocs.io/tutorials/install.html) to install this.\n* Install requirements:\n```bash\npip install -r requirements.txt\n```\n\nAn example code for setting up the environment:\n```bash\n# create a new environment\nconda create --name groundvlp python=3.8\nconda activate groundvlp\n\ngit clone https://github.com/om-ai-lab/GroundVLP.git\ncd GroundVLP\n\n# install pytorch\npip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html\n\n# install detectron2\npython -m pip install detectron2 -f \\\n  https://dl.fbaipublicfiles.com/detectron2/wheels/cu111/torch1.9/index.html\n\n# install requirements\npip install -r requirements.txt\n```\n\n## Download\n### Checkpoints\nDownload the following checkpoints and place them at the path `checkpoints/` :\n* [ALBEF](https://storage.googleapis.com/sfr-pcl-data-research/ALBEF/ALBEF.pth)\n* [TCL](https://drive.google.com/file/d/1Cb1azBdcdbm0pRMFs-tupKxILTCXlB4O/view)\n* [Detic](https://dl.fbaipublicfiles.com/detic/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth)\n### Json files\nDownload the following link and unzip them at the path `data/` :\n* [RefCOCO/+/g INFO Json files](https://drive.google.com/file/d/1IPACy7Tb1XAK_uWGSXGDZrY-4txCOhSG/view?usp=sharing)\n### Images\nDownload the COCO images and unzip them at the path `images/train2014` :\n* [COCO_train2014](http://images.cocodataset.org/zips/train2014.zip)\n\n\nFinally, the folder tree is that:\n```\nGroundVLP\n  ├── checkpoints                                  \n  │   └── ALBEF.pth\n  │   └── Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth\n  ├── data\n  │   └── refcoco_val_info.json\n  │   └── ...\n  │   └── refcocog_val_info.json\n  │── images\n  │   └── train2014\n  │       └── COCO_train2014_xxx.jpg\n ...\n```\n\n## Run\n\n### Results of RefCOCO/+/g\nRun this command to evaluate GroundVLP on REC datasets using the ground-truth category:\n```bash\npython eval_rec.py \\\n  --image_folder=\"./images/train2014\" \\\n  --eval_data=\"refcoco_val,refcoco_testA,refcoco_testB,refcoco+_val,refcoco+_testA,refcoco+_testB,refcocog_val,refcocog_test\" \\\n  --model_id=\"ALBEF\" \\\n  --use_gt_category\n```\nCurrently, the code we release only supports the ALBEF and TCL models. We will continue to update the code to support more models.\n\nIf you want to get the results using the predicted category, you should get the agent of each query first and map it into the coco label:\n```\npython utils/map_to_coco_label.py\n```\nThen run thie command:\n```bash\npython eval_rec.py \\\n  --image_folder=\"./images/train2014\" \\\n  --eval_data=\"refcoco_val,refcoco_testA,refcoco_testB,refcoco+_val,refcoco+_testA,refcoco+_testB,refcocog_val,refcocog_test\" \\\n  --model_id=\"ALBEF\" \\\n```\n\n### Demo\nRun this command to evaluate GroundVLP on a single image-query pair:\n```bash\npython demo.py \\\n  --image_path=\"./docs/demo.jpg\" \\\n  --query=\"boy with white hair\" \\\n```\nIf setup correctly, the output image in the path `output/demo.jpg` should look like:\n\u003cp align=\"center\"\u003e \u003cimg src='docs/demo_output.jpg' align=\"center\" width=\"400px\"\u003e \u003c/p\u003e\n\n\n## Citations\nIf you find this project useful for your research, please use the following BibTeX entry.\n```\n@article{shen2023groundvlp,\n  title={GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection},\n  author={Shen, Haozhan and Zhao, Tiancheng and Zhu, Mingwei and Yin, Jianwei},\n  journal={arXiv preprint arXiv:2312.15043},\n  year={2023}\n}\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fom-ai-lab%2Fgroundvlp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fom-ai-lab%2Fgroundvlp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fom-ai-lab%2Fgroundvlp/lists"}