https://github.com/shaadclt/fine-tune-paligemma-image-captioning
This project demonstrates how to fine-tune PaliGemma model for image captioning. The PaliGemma model, developed by Google Research, is designed to handle images and generate corresponding captions.
https://github.com/shaadclt/fine-tune-paligemma-image-captioning
fine-tuning image-captioning paligemma
Last synced: 6 months ago
JSON representation
This project demonstrates how to fine-tune PaliGemma model for image captioning. The PaliGemma model, developed by Google Research, is designed to handle images and generate corresponding captions.
- Host: GitHub
- URL: https://github.com/shaadclt/fine-tune-paligemma-image-captioning
- Owner: shaadclt
- License: mit
- Created: 2024-06-24T02:05:51.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-11-18T15:57:40.000Z (11 months ago)
- Last Synced: 2025-04-10T01:10:30.872Z (6 months ago)
- Topics: fine-tuning, image-captioning, paligemma
- Language: Jupyter Notebook
- Homepage:
- Size: 408 KB
- Stars: 6
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# Fine-tuning PaliGemma for Image Captioning
This project demonstrates how to fine-tune PaliGemma model for image captioning. The PaliGemma model, developed by Google Research, is designed to handle images and generate corresponding captions.## Overview
The notebook includes the following key steps:### 1. Setup Environment:
- Environment variables for Kaggle are set up to download model checkpoints.
- The Big Vision repository is cloned, and necessary dependencies are installed.
- JAX is configured, and GPU/TPU devices are made invisible to TensorFlow to avoid conflicts.### 2. Download Required Files:
- Model checkpoint and tokenizer are downloaded.
- The dataset for training and validation is downloaded.### 3. Model Definition:
- The PaliGemma model is defined using configuration settings.
- The tokenizer for text processing is loaded.### 4. Parameter Loading:
- Model parameters are loaded and sharded across devices if multiple GPUs are available.
- The parameters are converted to float32 where necessary.### 5. Preprocessing Functions:
- Functions for preprocessing images and tokens are defined to prepare data for the model.
### 6. Training and Validation Data:
- Iterators for training and validation data are created.
- Images and captions from the dataset are preprocessed for training and evaluation.### 7. Training Loop:
A short training loop is implemented using stochastic gradient descent (SGD) and a cosine learning rate schedule.
Model parameters are updated, and training loss is reported at each step.### 8. Evaluation:
- The model's performance is evaluated on a validation dataset.
- Predicted captions for validation images are generated and displayed.### 9. Streamlit Interface:
- A Streamlit interface is created to upload images and generate captions using the trained PALIGEMMA model.
## How to Use
1. Clone the Repository:
- Clone the Big Vision repository and install dependencies.2. Download Model and Data:
- Download the PALIGEMMA model checkpoint, tokenizer, and dataset.3. Run the Notebook:
- Execute the notebook to preprocess data, load the model, and perform training and evaluation.4. Streamlit Interface:
- Run the Streamlit interface to upload an image and generate a caption.
## Conclusion
This project provides a comprehensive guide to fine-tune the PaliGemma model for image captioning. By following the steps outlined in the notebook, you can train the model on a custom dataset and use it to generate captions for new images.## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE.txt) file for details.