https://github.com/gurveervirk/image-captioning-for-remote-sensing-data

Finetuned Salesforce BliP for image captioning on rsicd dataset
https://github.com/gurveervirk/image-captioning-for-remote-sensing-data

ai blip image-captioning llm salesforce transformers vision-transformer

Last synced: 6 months ago
JSON representation

Finetuned Salesforce BliP for image captioning on rsicd dataset

Host: GitHub
URL: https://github.com/gurveervirk/image-captioning-for-remote-sensing-data
Owner: gurveervirk
Created: 2024-03-08T14:32:23.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-03-17T13:25:28.000Z (over 1 year ago)
Last Synced: 2025-02-16T02:24:37.897Z (8 months ago)
Topics: ai, blip, image-captioning, llm, salesforce, transformers, vision-transformer
Language: Jupyter Notebook
Homepage:
Size: 2.08 MB
Stars: 0
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Image Captioning for Remote Sensing Data

## Approach

- Found and compared models on HuggingFace and selected 2 models for finetuning:
- [`nlpconnect/vit-gpt2-image-captioning`](https://huggingface.co/nlpconnect/vit-gpt2-image-captioning)
- [`Salesforce/blip-image-captioning-base`](https://huggingface.co/Salesforce/blip-image-captioning-base)

- Finetuned both on the dataset (splits used for this were **train** and **valid** for training and evaluation purposes)
- Tested them on the **test** split and further tuned the training parameters
- Tested BLEU score for these models ("valid" split used):
- Base models:
- nlpconnect/vit-gpt2-image-captioning: 0.58
- Salesforce/blip-image-captioning-base: 0.51

- Finetuned (best):
- nlpconnect/vit-gpt2-image-captioning: 0.55
- Salesforce/blip-image-captioning-base: 0.56

- Code Explained:
- General:
- Used [`rsicd`](https://huggingface.co/datasets/arampacha/rsicd) dataset from HuggingFace
- learning_rate = 5e-7 is the best for this purpose as it allows the model to understand the mapping properly, but takes a long time to train the model (higher no. of epochs required for effective training of model)
- **AdamW** optimizer used due to it's inherent advantage of effectively optimizing the model based on the training metrics and loss
- **ReduceLROnPlateau** scheduler used to prevent overfitting (reduces learning_rate when the validation loss plateaus)
- num_epochs = 5
- Custom Dataset classes were defined to retrieve data during training
- [`trinit-hackathon-blip-training.ipynb`](https://github.com/gurveervirk/TRINIT-Prometheans-ML/blob/main/training/trinit-hackathon-blip-training.ipynb):
- Used [`accelerate`](https://huggingface.co/docs/accelerate/en/index) from HuggingFace to optimize training. This allowed us to utilize all of the GPU's memory and compute capability. Dropped training time / epoch from 4 hours to 1 hour (Distributed / MultiGPU)
- Used **notebook_launcher** to launch the training function
- [`trinit-hackathon-vit-training.ipynb`](https://github.com/gurveervirk/TRINIT-Prometheans-ML/blob/main/training/trinit-hackathon-vit-training.ipynb):
- Used [`trainer`](https://huggingface.co/docs/transformers/main_classes/trainer) from HuggingFace to optimize training. It is an excellent utility function that very effectively optimizes the training (2.5 hrs for 5 epochs compared to 5 hrs for BLIP), allowing for extensive customization
- Used **Seq2SeqTrainer** and related functions from trainer to train the model

- Conclusion:
- BLIP model was found to be able to capture context better than ViT due to the scores and was also able to generate short and crisp captions which ViT was not able to sometimes
- ViT was much faster to train comparatively, half the time required (2.5 hours to 5 hours)
- AdamW optimizer, ReduceLROnPlateau scheduler were found to be the best with learning rate = 5e-7, albeit slow, was able to better train the model

## Video Demos:

- [`Combined Video`](https://drive.google.com/file/d/1DHO_ek6zn2u86l4WRYvZ5I0N8--bQNrn/view)
- Individual Videos:
- [`Model Training Approach`](https://drive.google.com/file/d/1db2b6i9j7Wlbq7zl5nATT0_TlTeHl5mb/view?usp=sharing)
- [`Website integrated model`](https://drive.google.com/file/d/1tDcs57KPvCCVgJS1L2XMhpj6UXomFKuv/view?usp=drive_link)

## Miscellaneous
- Model Links:
- [`ViT finetuned`](https://www.kaggle.com/datasets/gurveersinghvirk/clip-gpt2-rsicd-finetuned-5-epochs)
- [`BLIP finetuned`](https://huggingface.co/Gurveer05/blip-image-captioning-base-rscid-finetuned/)

- Kaggle Notebook Links:
- [`Training`](https://www.kaggle.com/code/gurveersinghvirk/trinit-hackathon/)
- [`Testing`](https://www.kaggle.com/code/gurveersinghvirk/trinit-hackathon-testing/)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/gurveervirk/image-captioning-for-remote-sensing-data

Awesome Lists containing this project

README