Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/geoaigroup/awesome-vision-language-models-for-earth-observation

A curated list of awesome vision and language resources for earth observation.
https://github.com/geoaigroup/awesome-vision-language-models-for-earth-observation

List: awesome-vision-language-models-for-earth-observation

awesome awesome-list earth-observation multimodal-deep-learning remote-sensing vision-and-language

Last synced: about 2 months ago
JSON representation

A curated list of awesome vision and language resources for earth observation.

Lists

README

        

# A curated list of Visual Language Models papers and resources for Earth Observation (VLM4EO) [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/geoaigroup/awesome-vision-language-models-for-earth-observation/)

This list is created and maintained by [Ali Koteich](https://github.com/alikoteich) and [Hasan Moughnieh](https://geogroup.ai/author/hasan-moughnieh/) from the GEOspatial Artificial Intelligence ([GEOAI](https://geogroup.ai/)) research group at the National Center for Remote Sensing - CNRS, Lebanon.

We encourage you to contribute to this project according to the following [guidelines](https://github.com/sindresorhus/awesome/blob/main/contributing.md).

---**If you find this repository useful, please consider giving it a ⭐**

**Table Of Contents**
- [A curated list of Visual Language Models papers and resources for Earth Observation (VLM4EO) ](#a-curated-list-of-visual-language-models-papers-and-resources-for-earth-observation-vlm4eo-)
- [Foundation Models](#foundation-models)
- [Image Captioning](#image-captioning)
- [Text-Image Retrieval](#text-image-retrieval)
- [Visual Grounding](#visual-grounding)
- [Visual Question Answering](#visual-question-answering)
- [Vision-Language Remote Sensing Datasets](#vision-language-remote-sensing-datasets)
- [Related Repositories \& Libraries](#related-repositories--libraries)

## Foundation Models
| Year | Title | Paper | Code | Venue |
|------|-------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------|-----------------------------------------------|
| 2024 | EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain | [paper](https://arxiv.org/abs/2401.16822) | | |
| 2024 | Remote Sensing ChatGPT: Solving Remote Sensing Tasks with ChatGPT and Visual Models | [paper](https://arxiv.org/abs/2401.09083) | [code](https://github.com/HaonanGuo/Remote-Sensing-ChatGPT) | |
| 2024 | SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model | [paper](https://arxiv.org/abs/2401.09712) | [code](https://github.com/ZhanYang-nwpu/SkyEyeGPT) | |
| 2023 | GeoChat: Grounded Large Vision-Language Model for Remote Sensing | [paper](https://arxiv.org/abs/2311.15826) | [code](https://github.com/mbzuai-oryx/geochat) | |
| 2023 | Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment | [paper](https://export.arxiv.org/abs/2312.06960) | | |

## Image Captioning
| Year | Title | Paper | Code | Venue |
|------|-------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------|-----------------------------------------------|
| 2023 | Captioning Remote Sensing Images Using Transformer Architecture | [paper](https://ieeexplore.ieee.org/document/10067039/) | | International Conference on Artificial Intelligence in Information and Communication |
| 2023 | Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning | [paper](https://www.mdpi.com/2072-4292/15/3/579) | | MDPI Remote Sensing |
| 2023 | Progressive Scale-aware Network for Remote sensing Image Change Captioning | [paper](https://arxiv.org/abs/2303.00355) | | |
| 2023 | Towards Unsupervised Remote Sensing Image Captioning and Retrieval with Pre-Trained Language Models | [paper](https://www.anlp.jp/proceedings/annual_meeting/2023/pdf_dir/B10-4.pdf) | | Proceedings of the Japanese Association for Natural Language Processing |
| 2022 | A Joint-Training Two-Stage Method for Remote Sensing Image Captioning | [paper](https://ieeexplore.ieee.org/document/9961235) | | IEEE TGRS |
| 2022 | A Mask-Guided Transformer Network with Topic Token for Remote Sensing Image Captioning | [paper](https://www.mdpi.com/2072-4292/14/12/2939) | | MDPI Remote Sensing |
| 2022 | Change Captioning: A New Paradigm for Multitemporal Remote Sensing Image Analysis | [paper](https://ieeexplore.ieee.org/document/9847254) | | IEEE TGRS |
| 2022 | Exploring Transformer and Multilabel Classification for Remote Sensing Image Captioning | [paper](https://ieeexplore.ieee.org/document/9855519) | [code](https://gitlab.lrz.de/ai4eo/captioningMultilabel.) | IEEE GRSL |
| 2022 | Generating the captions for remote sensing images: A spatial-channel attention based memory-guided transformer approach | [paper](https://www.sciencedirect.com/science/article/abs/pii/S0952197622002317) | [code](https://github.com/GauravGajbhiye/SCAMET_RSIC) | Engineering Applications of Artificial Intelligence |
| 2022 | Global Visual Feature and Linguistic State Guided Attention for Remote Sensing Image | [paper](https://ieeexplore.ieee.org/document/9632558) | | IEEE TGRS |
| 2022 | High-Resolution Remote Sensing Image Captioning Based on Structured Attention | [paper](https://ieeexplore.ieee.org/document/9400386) | | IEEE TGRS |
| 2022 | Meta captioning: A meta learning based remote sensing image captioning framework | [paper](https://www.sciencedirect.com/science/article/abs/pii/S0924271622000351) | [code](https://github.com/QiaoqiaoYang/MetaCaptioning.) | Elsevier PHOTO |
| 2022 | Multiscale Multiinteraction Network for Remote Sensing Image Captioning | [paper](https://ieeexplore.ieee.org/document/9720234) | | IEEE JSTARS |
| 2022 | NWPU-Captions Dataset and MLCA-Net for Remote Sensing Image Captioning | [paper](https://ieeexplore.ieee.org/document/9866055) | [code](https://github.com/HaiyanHuang98/NWPU-Captions) | IEEE TGRS |
| 2022 | Recurrent Attention and Semantic Gate for Remote Sensing Image Captioning | [paper](https://ieeexplore.ieee.org/document/9515452) | | IEEE TGRS |
| 2022 | Remote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset | [paper](https://ieeexplore.ieee.org/document/9934924) | | IEEE TGRS |
| 2022 | Transforming remote sensing images to textual descriptions | [paper](https://www.sciencedirect.com/science/article/pii/S0303243422000678) | | Int J Appl Earth Obs Geoinf |
| 2022 | Using Neural Encoder-Decoder Models with Continuous Outputs for Remote Sensing Image Captioning | [paper](https://ieeexplore.ieee.org/document/9714367) | | IEEE Access |
| 2021 | A Novel SVM-Based Decoder for Remote Sensing Image Captioning | [paper](https://ieeexplore.ieee.org/document/9521989) | | IEEE TGRS |
| 2021 | SD-RSIC: Summarization Driven Deep Remote Sensing Image Captioning | [paper](https://ieeexplore.ieee.org/document/9239371) | [code](https://git.tu-berlin.de/rsim/SD-RSIC) | IEEE TGRS |
| 2021 | Truncation Cross Entropy Loss for Remote Sensing Image Captioning | [paper](https://ieeexplore.ieee.org/document/9153154) | | IEEE TGRS |
| 2021 | Word-Sentence Framework for Remote Sensing Image Captioning | [paper](https://ieeexplore.ieee.org/document/9308980/?denied=) | | IEEE TGRS |
| 2020 | A multi-level attention model for remote sensing image captions | [paper](https://www.mdpi.com/2072-4292/12/6/939) | | MDPI Remote Sensing |
| 2020 | Remote sensing image captioning via Variational Autoencoder and Reinforcement Learning | [paper](https://www.sciencedirect.com/science/article/abs/pii/S0950705120302586) | | Elservier Knowledge-Based Systems |
| 2020 | Toward Remote Sensing Image Retrieval Under a Deep Image Captioning Perspective | [paper](https://ieeexplore.ieee.org/document/9154525) | | IEEE JSTARS |
| 2019 | LAM: Remote sensing image captioning with attention-based language model | [paper](https://ieeexplore.ieee.org/document/8930629) | | IEEE TGRS |
| 2019 | Learning to Caption Remote Sensing Images by Geospatial Feature Driven Attention Mechanism | [paper](https://ieeexplore.ieee.org/document/8780492) | | IEEE JSTARS |
| 2019 | Remote Sensing Image Captioning by Deep Reinforcement Learning with Geospatial Features | [paper](https://ieeexplore.ieee.org/document/8820076) | | IEEE TGRS |

## Text-Image Retrieval

| Year | Title | Paper | Code | Venue |
|------|--------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------|-----------------------------------------------|
| 2024 | Multi-Spectral Remote Sensing Image Retrieval using Geospatial Foundation Models | [paper](https://arxiv.org/abs/2403.02059) | [code](https://github.com/IBM/remote-sensing-image-retrieval) | |
| 2023 | A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval | [paper](https://dl.acm.org/doi/10.1145/3581783.3612374) | [code](https://github.com/Zjut-MultimediaPlus/PIR-pytorch) | ACM MM 2023 (Oral) |
| 2023 | A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote Sensing | [paper](https://www.mdpi.com/2072-4292/15/18/4637) | | MDPI Remote Sensing |
| 2023 | An End-to-End Framework Based on Vision-Language Fusion for Remote Sensing Cross-Modal Text-Image Retrieval | [paper](https://www.mdpi.com/2227-7390/11/10/2279) | | MDPI Mathematics |
| 2023 | Contrasting Dual Transformer Architectures for Multi-Modal Remote Sensing Image Retrieval | [paper](https://www.mdpi.com/2076-3417/13/1/282) | | MDPI Applied Sciences |
| 2023 | Hypersphere-Based Remote Sensing Cross-Modal Text–Image Retrieval via Curriculum Learning | [paper](https://ieeexplore.ieee.org/document/10261223) | [code](https://github.com/ZhangWeihang99/HVSA) | IEEE TGRS |
| 2023 | Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrieval | [paper](https://ieeexplore.ieee.org/document/10231134) | | IEEE TGRS |
| 2023 | Reducing Semantic Confusion: Scene-aware Aggregation Network for Remote Sensing Cross-modal Retrieval | [paper](https://dl.acm.org/doi/abs/10.1145/3591106.3592236) | [code](https://github.com/kinshingpoon/SWAN-pytorch) | ICMR'23 |
| 2023 | RemoteCLIP: A Vision Language Foundation Model for Remote Sensing | [paper](https://arxiv.org/abs/2306.11029) | [code](https://github.com/ChenDelong1999/RemoteCLIP) | |
| 2022 | A Lightweight Multi-Scale Crossmodal Text-Image Retrieval Method in Remote Sensing | [paper](https://ieeexplore.ieee.org/document/9594840) | [code](https://github.com/xiaoyuan1996/retrievalSystem) | IEEE TGRS |
| 2022 | An Unsupervised Cross-Modal Hashing Method Robust to Noisy Training Image-Text Correspondences in Remote Sensing | [paper](https://ieeexplore.ieee.org/document/9897500) | [code](https://git.tu-berlin.de/rsim/chnr) | IEEE ICIP |
| 2022 | CLIP-RS: A Cross-modal Remote Sensing Image Retrieval Based on CLIP, a Northern Virginia Case Study | [paper](https://vtechworks.lib.vt.edu/handle/10919/110853) | | Virginia Polytechnic Institute and State University |
| 2022 | Knowledge-Aware Cross-Modal Text-Image Retrieval for Remote Sensing Images | [paper](https://ceur-ws.org/Vol-3207/paper4.pdf) | | |
| 2022 | MCRN: A Multi-source Cross-modal Retrieval Network for remote sensing | [paper](https://www.sciencedirect.com/science/article/pii/S156984322200259X) | [code](https://github.com/xiaoyuan1996/MCRN) | Int J Appl Earth Obs Geoinf |
| 2022 | Multilanguage Transformer for Improved Text to Remote Sensing Image Retrieval | [paper](https://ieeexplore.ieee.org/document/9925582) | | IEEE JSTARS |
| 2022 | Multisource Data Reconstruction-Based Deep Unsupervised Hashing for Unisource Remote Sensing Image Retrieval | [Paper](https://ieeexplore.ieee.org/abstract/document/10001754) | [code](https://github.com/sunyuxi/MrHash) | IEEE TGRS |
| 2022 | Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information | [paper](https://ieeexplore.ieee.org/document/9745546) | [code](https://github.com/xiaoyuan1996/GaLR) | IEEE TGRS |
| 2022 | Unsupervised Contrastive Hashing for Cross-Modal Retrieval in Remote Sensing | [paper](https://ieeexplore.ieee.org/document/9746251) | [code](https://git.tu-berlin.de/rsim/duch) | IEEE ICASSP |
| 2021 | Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval | [paper](https://ieeexplore.ieee.org/document/9437331) |[code](https://github.com/xiaoyuan1996/AMFMN) | IEEE TGRS |
| 2020 | Deep unsupervised embedding for remote sensing image retrieval using textual cues | [paper](https://www.mdpi.com/2076-3417/10/24/8931) | | MDPI Applied Sciences |
| 2020 | TextRS: Deep bidirectional triplet network for matching text to remote sensing images | [paper](https://www.mdpi.com/2072-4292/12/3/405) | | MDPI Remote Sensing |
| 2020 | Toward Remote Sensing Image Retrieval under a Deep Image Captioning Perspective | [paper](https://ieeexplore.ieee.org/document/9154525) | | IEEE JSTARS |

## Visual Grounding
| Year | Title | Paper | Code | Venue |
|------|-----------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|-------------------------------------------------------|----------|
| 2023 | LaLGA: Multi-Scale Language-Aware Visual Grounding on Remote Sensing Data | [paper](https://www.researchgate.net/publication/373146282_LaLGA_Multi-Scale_LanguageAware_Visual_Grounding_on_Remote_Sensing_Data) | [code](https://github.com/like413/OPT-RSVG) | |
| 2023 | Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Visual Foundation Models | [paper](https://arxiv.org/abs/2304.10597) | [code](https://github.com/Douglas2Code/Text2Seg) | |
| 2022 | RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data | [paper](https://ieeexplore.ieee.org/document/10056343) | [code](https://github.com/ZhanYang-nwpu/RSVG-pytorch) | IEEE TGRS |
| 2022 | Visual Grounding in Remote Sensing Images | [paper](https://dl.acm.org/doi/abs/10.1145/3503161.3548316) | | ACM MM |

## Visual Question Answering
| Year | Title | Paper | Code | Venue |
|------|-------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|------------------------------------------------------|
| 2023 | A Spatial Hierarchical Reasoning Network for Remote Sensing Visual Question Answering | [paper](https://ieeexplore.ieee.org/document/10018408) | | IEEE TGRS |
| 2023 | EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering | [paper](https://arxiv.org/pdf/2312.12222.pdf) | [code](https://junjue-wang.github.io/homepage/EarthVQA) | AAAI 2024 |
| 2023 | LIT-4-RSVQA: Lightweight Transformer-based Visual Question Answering in Remote Sensing | [paper](https://arxiv.org/abs/2306.00758) | [code](https://git.tu-berlin.de/rsim/lit4rsvqa) | IEEE IGARSS |
| 2023 | Multistep Question-Driven Visual Question Answering for Remote Sensing | [paper](https://ieeexplore.ieee.org/document/10242124) | [code](https://github.com/MeimeiZhang-data/MQVQA) | IEEE TGRS
| 2023 | RSGPT: A Remote Sensing Vision Language Model and Benchmark | [paper](https://arxiv.org/abs/2307.15266) | [code](https://github.com/Lavender105/RSGPT) | |
| 2023 | RSAdapter: Adapting Multimodal Models for Remote Sensing Visual Question Answering | [paper](https://arxiv.org/abs/2310.13120) | [code](https://github.com/Y-D-Wang/RSAdapter) | |
| 2022 | Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing Imagery | [paper](https://ieeexplore.ieee.org/document/9832935) | | IEEE TGRS |
| 2022 | Change Detection Meets Visual Question Answering | [paper](https://ieeexplore.ieee.org/abstract/document/9901476) | [code](https://github.com/YZHJessica/CDVQA) | IEEE TGRS |
| 2022 | From Easy to Hard: Learning Language-guided Curriculum for Visual Question Answering on Remote Sensing Data| [paper](https://ieeexplore.ieee.org/abstract/document/9771224) | [code](https://github.com/YZHJessica/VQA-easy2hard) | IEEE TGRS |
| 2022 | Language Transformers for Remote Sensing Visual Question Answering | [paper](https://ieeexplore.ieee.org/document/9884036) | | IEEE IGARSS |
| 2022 | Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing | [paper](https://arxiv.org/abs/2210.04510) | [code](https://git.tu-berlin.de/rsim/multi-modal-fusion-transformer-for-vqa-in-rs) | SPIE Image and Signal Processing for Remote Sensing |
| 2022 | Mutual Attention Inception Network for Remote Sensing Visual Question Answering | [paper](https://ieeexplore.ieee.org/document/9444570) | [code](https://github.com/spectralpublic/RSIVQA) | IEEE TGRS |
| 2022 | Prompt-RSVQA: Prompting visual context to a language model for Remote Sensing Visual Question Answering | [paper](https://ieeexplore.ieee.org/document/9857471) | | CVPRW |
| 2021 | How to find a good image-text embedding for remote sensing visual question answering? | [paper](https://arxiv.org/abs/2109.11848) | | CEUR Workshop Proceedings |
| 2021 | Mutual Attention Inception Network for Remote Sensing Visual Question Answering | [paper](https://ieeexplore.ieee.org/document/9444570) | [code](https://github.com/spectralpublic/RSIVQA) | IEEE TGRS |
| 2021 | RSVQA meets BigEarthNet: a new, large-scale, visual question answering dataset for remote sensing | [paper](https://ieeexplore.ieee.org/document/9553307) | [code](https://github.com/syvlo/RSVQAxBEN) | IEEE IGARSS |
| 2020 | RSVQA: Visual Question Answering for Remote Sensing Data | [paper](https://ieeexplore.ieee.org/abstract/document/9088993) | [code](https://github.com/syvlo/RSVQA) | IEEE TGRS |

## Vision-Language Remote Sensing Datasets
| Name | Link | Paper Link | Description |
| --- | --- | --- | --- |
| RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model| [Link](https://github.com/om-ai-lab/RS5M) | [Paper Link](https://arxiv.org/abs/2306.11300) | Size: 5 million remote sensing images with English descriptions
Resolution : 256 x 256
Platforms: 11 publicly available image-text paired dataset
|
| Remote Sensing Visual Question Answering Low Resolution Dataset(RSVQA LR)| [Link](https://zenodo.org/record/6344334) | [Paper Link](https://arxiv.org/abs/2003.07333) | Size: 772 images & 77,232 questions and answers
Resolution : 256 x 256
Platforms: Sentinel-2 and Open Street Map
Use: Remote Sensing Visual Question Answering
|
| Remote Sensing Visual Question Answering High Resolution Dataset(RSVQA HR)| [Link](https://zenodo.org/record/6344367) | [Paper Link](https://arxiv.org/abs/2003.07333) | Size: 10,659 images & 955,664 questions and answers
Resolution : 512 x 512
Platforms: USGS and Open Street Map
Use: Remote Sensing Visual Question Answering
|
| Remote Sensing Visual Question Answering BigEarthNet Dataset (RSVQA x BEN)| [Link](https://zenodo.org/record/5084904) | [Paper Link](https://rsvqa.sylvainlobry.com/IGARSS21.pdf) | Size: 140,758,150 image/question/answer triplets
Resolution : High-resolution (15cm)
Platforms: Sentinel-2, BigEarthNet and Open Street Map
Use: Remote Sensing Visual Question Answering
|
| Remote Sensing Image Visual Question Answering (RSIVQA)| [Link](https://github.com/spectralpublic/RSIVQA) | [Paper Link](https://ieeexplore.ieee.org/document/9444570) | Size: 37,264 images and 111,134 image-question-answer triplets
A small part of RSIVQA is annotated by human. Others are automatically generated using existing scene classification datasets and object detection datasets
Use: Remote Sensing Visual Question Answering
|
| FloodNet Visual Question Answering Dataset| [Link](https://drive.google.com/drive/folders/1g1r419bWBe4GEF-7si5DqWCjxiC8ErnY?usp=sharing) | [Paper Link](https://arxiv.org/abs/2012.02951) | Size: 11,000 question-image pairs
Resolution : 224 x 224
Platforms: UAV-DJI Mavic Pro quadcopters, after Hurricane Harvey
Use: Remote Sensing Visual Question Answering
|
| Change Detection-Based Visual Question Answering Dataset| [Link](https://github.com/YZHJessica/CDVQA) | [Paper Link](https://ieeexplore.ieee.org/abstract/document/9901476) | Size: 2,968 pairs of multitemporal images and more than 122,000 question–answer pairs
Classes: 6
Resolution : 512×512 pixels
Platforms: It is based on semantic change detection dataset (SECOND)
Use: Remote Sensing Visual Question Answering
|
| LAION-EO | [link](https://huggingface.co/datasets/mikonvergence/LAION-EO) | [Paper Link](https://arxiv.org/abs/2309.15535) | Size : 24,933 samples with 40.1% english captions as well as other common languages from LAION-5B
mean height of 633.0 pixels (up to 9,999) and mean width of 843.7 pixels (up to 19,687)
Platforms : Based on LAION-5B
|
| CapERA: Captioning Events in Aerial Videos | [Link](https://www.github.com/yakoubbazi/CapEra) | [Paper Link](https://www.mdpi.com/2072-4292/15/8/2139) | Size : 2864 videos and 14,320 captions, where each video is paired with five unique captions |
| Remote Sensing Image Captioning Dataset (RSICap) | [link]( https://github.com/Lavender105/RSGPT) | [Paper Link](https://arxiv.org/abs/2307.15266) | RSICap comprises 2,585 human-annotated captions with rich and high-quality information
This dataset offers detailed descriptions for each image, encompassing scene descriptions (e.g., residential area, airport, or farmland) as well as object information (e.g., color, shape, quantity, absolute position, etc)
|
| Remote Sensing Image Captioning Evaluation Dataset (RSIEval)| [link]( https://github.com/Lavender105/RSGPT) | [Paper Link](https://arxiv.org/abs/2307.15266) | 100 human-annotated captions and 936 visual question-answer pairs with rich information and open-ended questions and answers.
Can be used for Image Captioning and Visual Question-Answering tasks
|
| Revised Remote Sensing Image Captioning Dataset (RSCID)| [Link](https://drive.google.com/open?id=0B1jt7lJDEXy3aE90cG9YSl9ScUk) | [Paper Link](https://arxiv.org/pdf/1712.07835) | Size: 10,921 images with five captions per image
Number of Classes: 30
Resolution : 224 x 224
Platforms: Google Earth, Baidu Map, MapABC and Tianditu
Use: Remote Sensing Image Captioning
|
| Revised University of California Merced dataset (UCM-Captions)| [Link](https://mega.nz/folder/wCpSzSoS#RXzIlrv--TDt3ENZdKN8JA) | [Paper Link](https://ieeexplore.ieee.org/document/7546397) | Size: 2,100 images with five captions per image
Number of Classes: 21
Resolution : 256 x 256
Platforms: USGS National Map Urban Area Imagery collection
Use: Remote Sensing Image Captioning
|
| Revised Sydney-Captions Dataset| [Link](https://pan.baidu.com/s/1hujEmcG) | [Paper Link](https://ieeexplore.ieee.org/document/7546397) | Size: 613 images with five captions per image
Number of Classes: 7
Resolution : 500 x 500
Platforms: GoogleEarth
Use: Remote Sensing Image Captioning
|
| LEVIR-CC dataset| [Link](https://drive.google.com/drive/folders/1cEv-BXISfWjw1RTzL39uBojH7atjLdCG?usp=sharing) | [Paper Link](https://ieeexplore.ieee.org/document/9934924) | Size: 10,077 pairs of RS images and 50,385 corresponding sentences
Number of Classes: 10
Resolution : 1024 × 1024 pixels
Platforms: Beihang University
Use: Remote Sensing Image Captioning
|
| NWPU-Captions dataset| [images_Link](https://pan.baidu.com/s/1hmuWwnfPy2eZxxGxt6XuSg), [info_Link](https://github.com/HaiyanHuang98/NWPU-Captions/blob/main/dataset_nwpu.json) | [Paper Link](https://ieeexplore.ieee.org/document/9866055/) | Size: 31,500 images with 157,500 sentences
Number of Classes: 45
Resolution : 256 x 256 pixels
Platforms: based on NWPU-RESISC45 dataset
Use: Remote Sensing Image Captioning
|
| Remote sensing Image-Text Match dataset (RSITMD)| [Link](https://drive.google.com/file/d/1NJY86TAAUd8BVs7hyteImv8I2_Lh95W6/view?usp=sharing) | [Paper Link](https://ieeexplore.ieee.org/document/9437331) | Size: 23,715 captions for 4,743 images
Number of Classes: 32
Resolution : 500 x 500
Platforms: RSCID and GoogleEarth
Use: Remote Sensing Image-Text Retrieval
|
| PatterNet| [Link](https://nuisteducn1-my.sharepoint.com/:u:/g/personal/zhouwx_nuist_edu_cn/EYSPYqBztbBBqS27B7uM_mEB3R9maNJze8M1Qg9Q6cnPBQ?e=MSf977) | [Paper Link](https://arxiv.org/abs/1706.03424) | Size: 30,400 images
Number of Classes: 38
Resolution : 256 x 256
Platforms: Google Earth imagery and via the Google Map AP
Use: Remote Sensing Image Retrieval
|
| Dense Labeling Remote Sensing Dataset (DLRSD)| [Link](https://nuisteducn1-my.sharepoint.com/:u:/g/personal/zhouwx_nuist_edu_cn/EVjxkus-aXRGnLFxWA5K440B_k-WNNR5-BT1I6LTojuG7g?e=rgSMHi) | [Paper Link](https://www.mdpi.com/2072-4292/10/6/964) | Size: 2,100 images
Number of Classes: 21
Resolution : 256 x 256
Platforms: Extension of the UC Merced
Use: Remote Sensing Image Retrieval (RSIR), Classification and Semantic Segmentation
|
| Dior-Remote Sensing Visual Grounding Dataset (RSVGD) | [Link](https://drive.google.com/drive/folders/1hTqtYsC6B-m4ED2ewx5oKuYZV13EoJp_) | [Paper Link](https://ieeexplore.ieee.org/document/10056343) | Size: 38,320 RS image-query pairs and 17,402 RS images
Number of Classes: 20
Resolution : 800 x 800
Platforms: DIOR dataset
Use: Remote Sensing Visual Grounding
|
| OPT-RSVG Dataset | [link](https://drive.google.com/drive/folders/1e_wOtkruWAB2JXR7aqaMZMrM75IkjqCA?usp=drive_link) | [Paper Link](https://www.researchgate.net/publication/373146282_LaLGA_Multi-Scale_LanguageAware_Visual_Grounding_on_Remote_Sensing_Data) | Size : 25,452 Images and 48,952 expression in English and Chinese
Number of Classes : 14
Resolution : 800 x 800 |
| Visual Grounding in Remote Sensing Images | [link](https://sunyuxi.github.io/publication/GeoVG) | [Paper Link](https://dl.acm.org/doi/abs/10.1145/3503161.3548316) | Size : 4,239 images including 5,994 object instances and 7,933 referring expressions
Images are 1024×1024 pixels
Platforms: multiple sensors and platforms (e.g. Google Earth)
|
| Remote Sensing Image Scene Classification (NWPU-RESISC45) | [Link](https://1drv.ms/u/s!AmgKYzARBl5ca3HNaHIlzp_IXjs) | [Paper Link](https://arxiv.org/pdf/1703.00121v1.pdf) | Size: 31,500 images
Number of Classes: 45
Resolution : 256 x 256 pixels
Platforms: Google Earth
Use: Remote Sensing Image Scene Classification
|

## Related Repositories & Libraries
- [ConfigILM Library](https://github.com/lhackel-tub/ConfigILM)
- [awesome-RSVLM](https://github.com/om-ai-lab/awesome-RSVLM)
- [awesome-remote-sensing-vision-language-models](https://github.com/lzw-lzw/awesome-remote-sensing-vision-language-models)
- [awesome-remote-image-captioning](https://github.com/iOPENCap/awesome-remote-image-captioning)

---**Stay tuned for continuous updates and improvements! 🚀**