Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-vlgfm
A Survey on Vision-Language Geo-Foundation Models (VLGFMs)
https://github.com/zytx121/awesome-vlgfm
Last synced: 6 days ago
JSON representation
-
đ Methods: A Survey
-
Datasets & Benchmark
- RS5M - ai-lab/RS5M)|[link](https://huggingface.co/datasets/Zilun/RS5M)|
- NAIP-OSM
- SkyScript
- RemoteCount
- RSICap & RSIEval
- GeoChat-Instruct & GeoChat-Bench - oryx/geochat)|[link](https://huggingface.co/datasets/MBZUAI/GeoChat_Instruct)|
- SkyEye-968k - nwpu/SkyEyeGPT)|N/A|
- MMRS-1M
- LHRS-Align & LHRS-Instruct - LHRS/LHRS-Bot)|N/A|
- LuoJiaHOG
- FineGrip
- RS-GPT4V
- VRSBench
- RSTeller
- MME-RealWorld - realworld.github.io/home_page.html)|[link](https://huggingface.co/datasets/yifanzhang114/MME-RealWorld)|
- Sydney-Captions & UCM-Captions
- RSICD
- RSVQA-LR & RSVQA-HR
- RSVQAxBEN
- FloodNet - Supervised_v1.0)|[link](https://drive.google.com/drive/folders/1leN9eWVQcvWDVYwNb2GCo5ML_wBEycWD)|
- RSITMD
- RSIVQA
- NWPU-Captions - Captions)|[link](https://github.com/HaiyanHuang98/NWPU-Captions)|
- Sydney-Captions & UCM-Captions
- RSICD
- RSVQA-LR & RSVQA-HR
- RSVQAxBEN
- FloodNet - Supervised_v1.0)|[link](https://drive.google.com/drive/folders/1leN9eWVQcvWDVYwNb2GCo5ML_wBEycWD)|
- RSITMD
- RSIVQA
- NWPU-Captions - Captions)|[link](https://github.com/HaiyanHuang98/NWPU-Captions)|
- CRSVQA - data/MQVQA)|[link](https://drive.google.com/file/d/12DQwGzJ5OQK1rU0T5CmpNN9bEs38x_mQ/view)|
- LEVIR-CC - Yang-Liu/RSICC)|[link](https://github.com/Chen-Yang-Liu/LEVIR-CC-Dataset)|
- CDVQA
- UAV-Captions
- RSVG
- CapERA
- DIOR-RSVG - nwpu/RSVG-pytorch)|[link](https://drive.google.com/drive/folders/1hTqtYsC6B-m4ED2ewx5oKuYZV13EoJp_)|
- LAION-EO - EO)|
- SATIN - roberts1/SATIN)|
- ChatEarthNet - xlab/ChatEarthNet)|[link](https://doi.org/10.5281/zenodo.11003436)|
- VLEO-Bench - ei/vleo-benchmark-datasets-65b789b0466555489cce0d70)|
- CRSVQA - data/MQVQA)|[link](https://drive.google.com/file/d/12DQwGzJ5OQK1rU0T5CmpNN9bEs38x_mQ/view)|
- LEVIR-CC - Yang-Liu/RSICC)|[link](https://github.com/Chen-Yang-Liu/LEVIR-CC-Dataset)|
- CDVQA
- UAV-Captions
- RSVG
- CapERA
- DIOR-RSVG - nwpu/RSVG-pytorch)|[link](https://drive.google.com/drive/folders/1hTqtYsC6B-m4ED2ewx5oKuYZV13EoJp_)|
- LAION-EO - EO)|
- SATIN - roberts1/SATIN)|
- EarthVQA
- RRSIS
- RRSIS-D
- ChatEarthNet - xlab/ChatEarthNet)|[link](https://doi.org/10.5281/zenodo.11003436)|
- VLEO-Bench - ei/vleo-benchmark-datasets-65b789b0466555489cce0d70)|
- LuoJiaHOG
- FineGrip
- RS-GPT4V
- VRSBench
- RSTeller
- MME-RealWorld - realworld.github.io/home_page.html)|[link](https://huggingface.co/datasets/yifanzhang114/MME-RealWorld)|
- GeoText
- UrBench
- MMM-RS - RS)|N/A|
- DDFAV
-
Contrastive VLGFMs
- Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment
- GRAFT: Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment
- SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing
- Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment
- RemoteCLIP: A Vision Language Foundation Model for Remote Sensing
-
Conversational VLGFMs
- Rsgpt: A remote sensing vision language model and benchmark
- GeoChat: Grounded Large Vision-Language Model for Remote Sensing - oryx/geochat)|
- SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model - nwpu/SkyEyeGPT)|
- Earthgpt: A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domain
- LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model - LHRS/LHRS-Bot)|
- Large Language Models for Captioning and Retrieving Remote Sensing Images
- H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model
- RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery - KSU/RS-LLaVA)|
- SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding - Z13/SkySenseGPT)|
- EarthMarker: A Visual Prompt Learning Framework for Region-level and Point-level Remote Sensing Imagery Comprehension
- TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data
- H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model
- RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery - KSU/RS-LLaVA)|
- SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding - Z13/SkySenseGPT)|
- EarthMarker: A Visual Prompt Learning Framework for Region-level and Point-level Remote Sensing Imagery Comprehension
- TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data
- Popeye: A Unified Visual-Language Model for Multi-Source Ship Detection from Remote Sensing Imagery
- Aquila: A Hierarchically Aligned Visual-Language Model for Enhanced Remote Sensing Image Comprehension
- GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding
- LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation - LHRS/LHRS-Bot)|
- GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing
- RS-MoE: Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering
-
đšī¸ Application
- A Decoupling Paradigm With Prompt Learning for Remote Sensing Image Change Captioning - Yang-Liu/PromptCC)|
- VLCA: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning
- CLIP-RS: A Cross-modal Remote Sensing Image Retrieval Based on CLIP, a Northern Virginia Case Study
- EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering
- RRSIS: Referring Remote Sensing Image Segmentation
- Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation
- A Decoupling Paradigm With Prompt Learning for Remote Sensing Image Change Captioning - Yang-Liu/PromptCC)|
- VLCA: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning
- CLIP-RS: A Cross-modal Remote Sensing Image Retrieval Based on CLIP, a Northern Virginia Case Study
- Multi-Spectral Remote Sensing Image Retrieval Using Geospatial Foundation Models - sensing-image-retrieval)|
- Time Travelling Pixels: Bitemporal Features Integration with Foundation Model for Remote Sensing Image Change Detection
- ChangeCLIP: Remote sensing change detection with multimodal vision-language representation learning
- A New Learning Paradigm for Foundation Model-Based Remote-Sensing Change Detection
- Change Detection Between Optical Remote Sensing Imagery and Map Data via Segment Anything Model (SAM)
- Segment Any Change
- RS-CLIP: Zero shot remote sensing scene classification via contrastive vision-language supervision - CLIP)|
- Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Vision Foundation Models
- CPSeg: Finer-grained Image Semantic Segmentation via Chain-of-Thought Language Prompting
- Prompt-RSVQA: Prompting visual context to a language model for remote sensing visual question answering
- Learning Generalized Zero-Shot Learners for Open-Domain Image Geolocalization
- CSP: Self-Supervised Contrastive Spatial Pre-Training for Geospatial-Visual Representations - website/)|
- GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization - clip)|
- SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery
- Stable Diffusion For Aerial Object Detection
- Zooming Out on Zooming In: Advancing Super-Resolution for Remote Sensing - super-resolution)|
- Multi-Spectral Remote Sensing Image Retrieval Using Geospatial Foundation Models - sensing-image-retrieval)|
- Time Travelling Pixels: Bitemporal Features Integration with Foundation Model for Remote Sensing Image Change Detection
- ChangeCLIP: Remote sensing change detection with multimodal vision-language representation learning
- A New Learning Paradigm for Foundation Model-Based Remote-Sensing Change Detection
- Change Detection Between Optical Remote Sensing Imagery and Map Data via Segment Anything Model (SAM)
- Segment Any Change
- RS-CLIP: Zero shot remote sensing scene classification via contrastive vision-language supervision - CLIP)|
- Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Vision Foundation Models
- CPSeg: Finer-grained Image Semantic Segmentation via Chain-of-Thought Language Prompting
- Prompt-RSVQA: Prompting visual context to a language model for remote sensing visual question answering
- Learning Generalized Zero-Shot Learners for Open-Domain Image Geolocalization
- CSP: Self-Supervised Contrastive Spatial Pre-Training for Geospatial-Visual Representations - website/)|
- GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization - clip)|
- SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery
- Stable Diffusion For Aerial Object Detection
- Zooming Out on Zooming In: Advancing Super-Resolution for Remote Sensing - super-resolution)|
- Change-Agent: Toward Interactive Comprehensive Remote Sensing Change Interpretation and Analysis - Yang-Liu/Change-Agent)|
-
Generative VLGFMs
- DiffusionSat: A Generative Foundation Model for Satellite Imagery - khanna/DiffusionSat)|
- CRS-Diff: Controllable Generative Remote Sensing Foundation Model
- DiffusionSat: A Generative Foundation Model for Satellite Imagery - khanna/DiffusionSat)|
- CRS-Diff: Controllable Generative Remote Sensing Foundation Model
- HSIGene: A Foundation Model For Hyperspectral Image Generation
- MetaEarth: A Generative Foundation Model for Global-Scale Remote Sensing Image Generation
-
đ Exploration
- An Empirical Study of Remote Sensing Pretraining - Transformer/RSP)|
- Autonomous GIS: the next-generation AI-powered GIS
- GPT4GEO: How a Language Model Sees the World's Geography - roberts1/GPT4GEO)|
- Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs - roberts1/charting-new-territories)|
- The Potential of Visual ChatGPT For Remote Sensing
- An Empirical Study of Remote Sensing Pretraining - Transformer/RSP)|
- Autonomous GIS: the next-generation AI-powered GIS
- GPT4GEO: How a Language Model Sees the World's Geography - roberts1/GPT4GEO)|
- Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs - roberts1/charting-new-territories)|
- The Potential of Visual ChatGPT For Remote Sensing
-
đ¨âđĢ Survey
- An Agenda for Multimodal Foundation Models for Earth Observation
- Self-Supervised Remote Sensing Feature Learning: Learning Paradigms, Challenges, and Future Works
- Large Remote Sensing Model: Progress and Prospects
- An Agenda for Multimodal Foundation Models for Earth Observation
- Self-Supervised Remote Sensing Feature Learning: Learning Paradigms, Challenges, and Future Works
- Brain-Inspired Remote Sensing Foundation Models and Open Problems: A Comprehensive Survey
- On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications
- Vision-Language Models in Remote Sensing: Current Progress and Future Trends
- On the Foundations of Earth and Climate Foundation Models
-