{"id":28393054,"url":"https://github.com/visionxlab/geoground","last_synced_at":"2026-01-29T16:34:54.782Z","repository":{"id":263706944,"uuid":"889290077","full_name":"VisionXLab/GeoGround","owner":"VisionXLab","description":"GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding","archived":false,"fork":false,"pushed_at":"2025-05-10T15:59:01.000Z","size":16789,"stargazers_count":50,"open_issues_count":3,"forks_count":1,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-06-01T02:55:40.129Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/VisionXLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-11-16T01:58:49.000Z","updated_at":"2025-05-27T14:38:32.000Z","dependencies_parsed_at":"2025-05-31T15:58:25.176Z","dependency_job_id":"6a3c3592-6619-4faf-96ee-283406fd6f76","html_url":"https://github.com/VisionXLab/GeoGround","commit_stats":null,"previous_names":["zytx121/geoground","visionxlab/geoground"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/VisionXLab/GeoGround","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VisionXLab%2FGeoGround","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VisionXLab%2FGeoGround/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VisionXLab%2FGeoGround/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VisionXLab%2FGeoGround/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/VisionXLab","download_url":"https://codeload.github.com/VisionXLab/GeoGround/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VisionXLab%2FGeoGround/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262022037,"owners_count":23246239,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-05-31T15:31:25.133Z","updated_at":"2026-01-29T16:34:54.735Z","avatar_url":"https://github.com/VisionXLab.png","language":null,"readme":"\u003cdiv align=\"center\"\u003e\n\u003ch1\u003eGeoGround\u003cimg src=\"images/geoground/logo.png\" height=\"50\"\u003e: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding\u003c/h1\u003e\n\n\u003cdiv\u003e\n    \u003ca href='https://zytx121.github.io/' target='_blank'\u003eYue Zhou\u003c/a\u003e\u003csup\u003e1\u003c/sup\u003e\u0026emsp;   \n    \u003ca href='https://mc-lan.github.io/' target='_blank'\u003eMengcheng Lan\u003c/a\u003e\u003csup\u003e1\u003c/sup\u003e\u0026emsp;\n    \u003ca href='https://xiangli.ac.cn/' target='_blank'\u003eXiang Li\u003c/a\u003e\u003csup\u003e2\u003c/sup\u003e\u0026emsp;\n    \u003ca href='https://scholar.google.com.hk/citations?user=PnNAAasAAAAJ\u0026hl=en' target='_blank'\u003eLitong Feng\u003c/a\u003e\u003csup\u003e5\u003c/sup\u003e\u0026emsp;\n    \u003ca href='https://keyiping.wixsite.com/index' target='_blank'\u003eYiping Ke\u003c/a\u003e\u003csup\u003e1\u003c/sup\u003e\u0026emsp;\n    \u003ca href='https://ee.sjtu.edu.cn/FacultyDetail.aspx?id=53\u0026infoid=66' target='_blank'\u003eXue Jiang\u003c/a\u003e\u003csup\u003e3\u003c/sup\u003e\u0026emsp;\n    \u003ca href='https://github.com/Li-Qingyun' target='_blank'\u003eQingyun Li\u003c/a\u003e\u003csup\u003e4\u003c/sup\u003e\u0026emsp;\n    \u003ca href='https://yangxue.site/' target='_blank'\u003eXue Yang\u003c/a\u003e\u003csup\u003e3\u003c/sup\u003e\u0026emsp;\n    \u003ca href='https://www.statfe.com/' target='_blank'\u003eWayne Zhang\u003c/a\u003e\u003csup\u003e5\u003c/sup\u003e\u0026emsp;\n\u003c/div\u003e\n\u003cdiv\u003e\n    \u003csup\u003e1\u003c/sup\u003eNanyang Technological University\u0026emsp; \n    \u003csup\u003e2\u003c/sup\u003eUniversity of Reading\u0026emsp; \n    \u003csup\u003e3\u003c/sup\u003eShanghai Jiaotong University\u0026emsp; \n    \u003csup\u003e4\u003c/sup\u003eHarbin Institute of Technology\u0026emsp; \n    \u003csup\u003e5\u003c/sup\u003eSenseTime Research\u0026emsp;\n\u003c/div\u003e\n\n[![Demo](https://img.shields.io/badge/Online-Demo-red)]()\n[![Website](https://img.shields.io/badge/Project-Website-87CEEB)]()\n[![Paper](https://img.shields.io/badge/arXiv-Paper-\u003cCOLOR\u003e.svg)](http://arxiv.org/abs/2411.11904)\n[![Dataset](https://img.shields.io/badge/HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/erenzhou/refGeo)\n[![Dataset](https://img.shields.io/badge/HuggingFace-Model-blue)](https://huggingface.co/erenzhou/GeoGround)\n\n\n\u003c/div\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://i.imgur.com/waxVImv.png\" alt=\"Oryx Video-ChatGPT\"\u003e\n\u003c/p\u003e\n\n---\n\n## 📢 Latest Updates\n\n\n- [x] **[2025.05.10]** Release the [Model Weights](https://huggingface.co/erenzhou/GeoGround), it can be run directly with [LLaVA](https://github.com/haotian-liu/LLaVA/tree/main).\n- [x] **[2025.01.25]** Release the [refGeo](https://huggingface.co/datasets/erenzhou/refGeo) dataset.\n---\n\n## Abstract\n\n*Remote sensing (RS) visual grounding aims to use natural language expression to locate specific objects (in the form of the bounding box or segmentation mask) in RS images, enhancing human interaction with intelligent RS interpretation systems. Early research in this area was primarily based on horizontal bounding boxes (HBBs), but as more diverse RS datasets have become available, tasks involving oriented bounding boxes (OBBs) and segmentation masks have emerged. In practical applications, different targets require different grounding types: HBB can localize an object's position, OBB provides its orientation, and mask depicts its shape. However, existing specialized methods are typically tailored to a single type of RS visual grounding task and are hard to generalize across tasks. In contrast, large vision-language models (VLMs) exhibit powerful multi-task learning capabilities but struggle to handle dense prediction tasks like segmentation. This paper proposes GeoGround, a novel framework that unifies support for HBB, OBB, and mask RS visual grounding tasks, allowing flexible output selection. Rather than customizing the architecture of VLM, our work aims to elegantly support pixel-level visual grounding output through the Text-Mask technique. We define prompt-assisted and geometry-guided learning to enhance consistency across different signals. To support model training, we present refGeo, a large-scale RS visual instruction-following dataset containing 161k image-text pairs. Experimental results show that GeoGround demonstrates strong performance across four RS visual grounding tasks, matching or surpassing the performance of specialized methods on multiple benchmarks.*\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"images/geoground/framework.jpg\" width=100%\u003e\n  \u003cdiv style=\"display: inline-block; color: #999; padding: 2px;\"\u003e\n      GeoGround unifies box-level and pixel-level visual grounding tasks in remote sensing.\n  \u003c/div\u003e\n\u003c/div\u003e\n\n---\n\n## 🏆 Contributions\n\n- **Framework.** We propose GeoGround, a novel VLM framework that unifies box-level and pixel-level RS visual grounding tasks while maintaining its inherent dialogue and image understanding capabilities.\n\n- **Dataset.** We introduce refGeo, the largest RS visual grounding instruction-following dataset, consisting of 161k image-text pairs and 80k RS images, including a new 3D-aware aerial vehicle visual grounding dataset.\n\n- **Benchmark.** We conduct extensive experiments on various RS visual grounding tasks, providing valuable insights for future RS VLM research and opening new avenues for research in RS visual grounding.\n\n---\n\n## 💬 Text-Mask \u0026 Hybrid Supervision\n\nWe propose the Text-Mask paradigm, which distills and compresses the information embedded in the mask into a compact text sequence that can be efficiently learned by VLMs. Additionally, we introduce hybrid supervision, which incorporates prompt-assisted learning (PAL) and geometry-guided learning (GGL) to fine-tune the model using three types of signals, ensuring output consistency and enhancing the model’s understanding of the relationships between different grounding types.\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"images/geoground/hybrid_supervision.jpg\" width=70%\u003e\n\u003c/p\u003e\n\n---\n\n## 🔍 refGeo Dataset\n\nWe introduce refGeo, a large-scale RS visual grounding instruction-following dataset. It consolidates four existing visual grounding datasets from RS and introduces a new aerial vehicle visual grounding dataset (AVVG). AVVG extends traditional 2D visual grounding to a 3D context, enabling VLMs to perceive 3D space from 2D aerial imagery. For each referred object, we provide HBB, OBB, and mask, with the latter automatically generated by the SAM.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"images/geoground/refgeo.jpg\" width=50%\u003e\n\u003c/p\u003e\n\n---\n\n## 🚀 Qualitative and Quantitative results\n\n### 📷 Referring Expression Comprehension (REC) (HBB)\nGeoGround achieves the best performance across all REC benchmarks, surpassing the specialized model on the DIOR-RSVG test set. Benefiting from the wide range of image resolutions and GSD in refGeo, the fine-tuned model showed significant performance improvements on datasets with a high proportion of small objects, such as RSVG and AVVG.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"images/tables/table_1.png\" width=100% alt=\"Table_1\"\u003e\n\u003c/div\u003e\n\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"images/qualitative_results/rec_hbb.png\" width=100%\u003e\n  \u003cdiv style=\"display: inline-block; color: #999; padding: 2px;\"\u003e\n      Performance on RSVG benchmark.\n  \u003c/div\u003e\n\u003c/div\u003e\n\n---\n\n### 📷 Referring Expression Comprehension (REC) (OBB)\nThe results demonstrate GeoGround's dominance in RS visual grounding tasks based on OBB, further validating the effectiveness of our hybrid supervision approach.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"images/tables/table_2.png\" width=50% alt=\"Table_2\"\u003e\n\u003c/div\u003e\n\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"images/qualitative_results/rec_obb.png\" width=100%\u003e\n  \u003cdiv style=\"display: inline-block; color: #999; padding: 2px;\"\u003e\n      Performance on GeoChat benchmark.\n  \u003c/div\u003e\n\u003c/div\u003e\n\n---\n\n### 📷 Referring Expression Segmentation (RES)\nUnlike the other VLMs, GeoGround does not require the introduction of an additional mask decoder, as it inherently possesses segmentation capabilities. Moreover, we attempt to use SAM to refine the coarse masks generated by GeoGround, which allowed GeoGround to achieve results that match the performance of the best RS referring segmentation model.\n\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"images/tables/table_3.png\" width=50% alt=\"Table_3\"\u003e\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"images/qualitative_results/res_1.png\" width=100%\u003e\n  \u003cdiv style=\"display: inline-block; color: #999; padding: 2px;\"\u003e\n      Performance on RRSIS-D benchmark.\n  \u003c/div\u003e\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"images/qualitative_results/res_2.png\" width=100%\u003e\n  \u003cdiv style=\"display: inline-block; color: #999; padding: 2px;\"\u003e\n      Performance on RRSIS-D benchmark.\n  \u003c/div\u003e\n\u003c/div\u003e\n\n---\n\n### 📷 Generalized Referring Expression Comprehension (GRES) (Multiple Targets)\nWe present an RS Generalized REC benchmark based on AVVG, which differs from standard REC in that one referring expression may correspond to multiple objects.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"images/tables/table_4.png\" width=50% alt=\"Table_4\"\u003e\n\u003c/div\u003e\n\n---\n\n ### 📷 Image Captioning \u0026 Visual Question Answering (VQA)\nOur approach enhances object-level understanding without compromising the holistic image comprehension capabilities of VLMs.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"images/tables/table_5.png\" width=50% alt=\"Table_5\"\u003e\n\u003c/div\u003e\n\n\n\n## 📜 Citation\n```bibtex\n@misc{zhou2024geoground,\n      title={GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding}, \n      author={Yue Zhou and Mengcheng Lan and Xiang Li and Litong Feng and Yiping Ke and Xue Jiang and Qingyun Li and Xue Yang and Wayne Zhang},\n      year={2024},\n      eprint={2411.11904},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https://arxiv.org/abs/2411.11904}, \n}\n```\n\n---\n## 🙏 Acknowledgement\n\n- [LLaVA](https://github.com/haotian-liu/LLaVA/tree/main)\n- [Text4Seg](https://github.com/mc-lan/Text4Seg)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvisionxlab%2Fgeoground","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvisionxlab%2Fgeoground","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvisionxlab%2Fgeoground/lists"}