Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/deepmancer/vit-gpt2-image-captioning
Fine-tuning an encoder-decoder transformer (ViT-Base-Patch16-224-In21k and DistilGPT2) for image captioning on the COCO dataset
https://github.com/deepmancer/vit-gpt2-image-captioning
bert coco-dataset distilbert encoder-decoder gpt-2 image-captioning imagenet pre-trained-language-models pytorch torch transformer vision-transformer
Last synced: about 2 months ago
JSON representation
Fine-tuning an encoder-decoder transformer (ViT-Base-Patch16-224-In21k and DistilGPT2) for image captioning on the COCO dataset
- Host: GitHub
- URL: https://github.com/deepmancer/vit-gpt2-image-captioning
- Owner: deepmancer
- Created: 2023-05-10T20:02:32.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-08-16T12:29:44.000Z (4 months ago)
- Last Synced: 2024-10-05T20:01:18.908Z (3 months ago)
- Topics: bert, coco-dataset, distilbert, encoder-decoder, gpt-2, image-captioning, imagenet, pre-trained-language-models, pytorch, torch, transformer, vision-transformer
- Language: Jupyter Notebook
- Homepage:
- Size: 8.41 MB
- Stars: 4
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ๐ผ๏ธ Image Captioning with Fine-Tuned ViT and GPT-2
Welcome to the **Image Captioning** project! This repository implements an advanced image captioning module that leverages state-of-the-art models, including the **ViT-Base-Patch16-224-In21k** (Vision Transformer) as the encoder and **DistilGPT-2** as the decoder. This project aims to generate descriptive captions for images from the COCO dataset, utilizing the powerful capabilities of the Transformers library.
---
## ๐ Project Description
This project focuses on creating an image captioning system by integrating the following key components:
- **Encoder**: The project uses the Google **ViT-Base-Patch16-224-In21k** pretrained model to encode image features. ViT (Vision Transformer) is known for its superior performance in image classification and feature extraction tasks.
- **Decoder**: The **DistilGPT-2** model, a distilled version of GPT-2, is employed to decode the image features into natural language captions. GPT-2 excels at generating coherent and contextually relevant text.### ๐ฏ Objective
The primary goal is to fine-tune these models on the COCO dataset for the image captioning task. The resulting captions are evaluated using popular NLP metrics like **ROUGE**, **BLEU**, and **BERTScore** to measure their quality and relevance.
---
## ๐ Dataset
The project utilizes the **COCO dataset** (Common Objects in Context), which is a rich dataset consisting of:
- **118,000** training images
- **5,000** validation images
- Each image is paired with **5 corresponding captions**, providing diverse descriptions of the visual content.This dataset is well-suited for training and evaluating image captioning models due to its variety and scale.
---
## โ๏ธ Implementation Details
### Frameworks & Libraries
- **PyTorch**: The deep learning framework used for model implementation and training.
- **Transformers**: Hugging Face's library is employed to access and fine-tune the ViT and GPT-2 models.### Model Architecture
- **Vision Transformer (ViT)**: Acts as the encoder, transforming images into feature-rich embeddings.
- **DistilGPT-2**: Serves as the decoder, generating textual descriptions based on the encoded image features.### Training Process
- **Fine-Tuning**: Both models are fine-tuned on the COCO dataset over **2 epochs**. This process adapts the pretrained models to the specific task of image captioning, optimizing their performance on this task.
---
## ๐งช Evaluation Metrics
The generated captions are evaluated using the following metrics:
- **ROUGE**: Measures the overlap between the predicted and reference captions.
- **BLEU**: Evaluates the precision of n-grams in the generated captions compared to reference captions.
- **BERTScore**: Uses BERT embeddings to assess the semantic similarity between generated