Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/deepmancer/vit-gpt2-image-captioning

Fine-tuning an encoder-decoder transformer (ViT-Base-Patch16-224-In21k and DistilGPT2) for image captioning on the COCO dataset
https://github.com/deepmancer/vit-gpt2-image-captioning

bert coco-dataset distilbert encoder-decoder gpt-2 image-captioning imagenet pre-trained-language-models pytorch torch transformer vision-transformer

Last synced: about 2 months ago
JSON representation

Fine-tuning an encoder-decoder transformer (ViT-Base-Patch16-224-In21k and DistilGPT2) for image captioning on the COCO dataset

Awesome Lists containing this project

README

        

# ๐Ÿ–ผ๏ธ Image Captioning with Fine-Tuned ViT and GPT-2


PyTorch
Hugging Face Transformers
scikit-learn
COCO Dataset
Python
Jupyter Notebook

Welcome to the **Image Captioning** project! This repository implements an advanced image captioning module that leverages state-of-the-art models, including the **ViT-Base-Patch16-224-In21k** (Vision Transformer) as the encoder and **DistilGPT-2** as the decoder. This project aims to generate descriptive captions for images from the COCO dataset, utilizing the powerful capabilities of the Transformers library.

---

## ๐Ÿ“ Project Description

This project focuses on creating an image captioning system by integrating the following key components:

- **Encoder**: The project uses the Google **ViT-Base-Patch16-224-In21k** pretrained model to encode image features. ViT (Vision Transformer) is known for its superior performance in image classification and feature extraction tasks.
- **Decoder**: The **DistilGPT-2** model, a distilled version of GPT-2, is employed to decode the image features into natural language captions. GPT-2 excels at generating coherent and contextually relevant text.

### ๐ŸŽฏ Objective

The primary goal is to fine-tune these models on the COCO dataset for the image captioning task. The resulting captions are evaluated using popular NLP metrics like **ROUGE**, **BLEU**, and **BERTScore** to measure their quality and relevance.

---

## ๐Ÿ“š Dataset

The project utilizes the **COCO dataset** (Common Objects in Context), which is a rich dataset consisting of:

- **118,000** training images
- **5,000** validation images
- Each image is paired with **5 corresponding captions**, providing diverse descriptions of the visual content.

This dataset is well-suited for training and evaluating image captioning models due to its variety and scale.

---

## โš™๏ธ Implementation Details

### Frameworks & Libraries

- **PyTorch**: The deep learning framework used for model implementation and training.
- **Transformers**: Hugging Face's library is employed to access and fine-tune the ViT and GPT-2 models.

### Model Architecture

- **Vision Transformer (ViT)**: Acts as the encoder, transforming images into feature-rich embeddings.
- **DistilGPT-2**: Serves as the decoder, generating textual descriptions based on the encoded image features.

### Training Process

- **Fine-Tuning**: Both models are fine-tuned on the COCO dataset over **2 epochs**. This process adapts the pretrained models to the specific task of image captioning, optimizing their performance on this task.

---

## ๐Ÿงช Evaluation Metrics

The generated captions are evaluated using the following metrics:

- **ROUGE**: Measures the overlap between the predicted and reference captions.
- **BLEU**: Evaluates the precision of n-grams in the generated captions compared to reference captions.
- **BERTScore**: Uses BERT embeddings to assess the semantic similarity between generated