https://github.com/deepmancer/vit-gpt2-image-captioning

Fine-tuning an encoder-decoder transformer (ViT-Base-Patch16-224-In21k and DistilGPT2) for image captioning on the COCO dataset
https://github.com/deepmancer/vit-gpt2-image-captioning

bert coco-dataset distilbert encoder-decoder gpt-2 image-captioning imagenet pre-trained-language-models pytorch torch transformer vision-transformer

Last synced: 4 months ago
JSON representation

Fine-tuning an encoder-decoder transformer (ViT-Base-Patch16-224-In21k and DistilGPT2) for image captioning on the COCO dataset

Host: GitHub
URL: https://github.com/deepmancer/vit-gpt2-image-captioning
Owner: deepmancer
Created: 2023-05-10T20:02:32.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-08-16T12:29:44.000Z (11 months ago)
Last Synced: 2024-10-05T20:01:18.908Z (9 months ago)
Topics: bert, coco-dataset, distilbert, encoder-decoder, gpt-2, image-captioning, imagenet, pre-trained-language-models, pytorch, torch, transformer, vision-transformer
Language: Jupyter Notebook
Homepage:
Size: 8.41 MB
Stars: 4
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# 🖼️ Image Captioning with Fine-Tuned ViT and GPT-2

Welcome to the **Image Captioning** project! This repository implements an advanced image captioning module that leverages state-of-the-art models, including the **ViT-Base-Patch16-224-In21k** (Vision Transformer) as the encoder and **DistilGPT-2** as the decoder. This project aims to generate descriptive captions for images from the COCO dataset, utilizing the powerful capabilities of the Transformers library.

---

## 📝 Project Description

This project focuses on creating an image captioning system by integrating the following key components:

- **Encoder**: The project uses the Google **ViT-Base-Patch16-224-In21k** pretrained model to encode image features. ViT (Vision Transformer) is known for its superior performance in image classification and feature extraction tasks.
- **Decoder**: The **DistilGPT-2** model, a distilled version of GPT-2, is employed to decode the image features into natural language captions. GPT-2 excels at generating coherent and contextually relevant text.

### 🎯 Objective

The primary goal is to fine-tune these models on the COCO dataset for the image captioning task. The resulting captions are evaluated using popular NLP metrics like **ROUGE**, **BLEU**, and **BERTScore** to measure their quality and relevance.

---

## 📚 Dataset

The project utilizes the **COCO dataset** (Common Objects in Context), which is a rich dataset consisting of:

- **118,000** training images
- **5,000** validation images
- Each image is paired with **5 corresponding captions**, providing diverse descriptions of the visual content.

This dataset is well-suited for training and evaluating image captioning models due to its variety and scale.

---

## ⚙️ Implementation Details

### Frameworks & Libraries

- **PyTorch**: The deep learning framework used for model implementation and training.
- **Transformers**: Hugging Face's library is employed to access and fine-tune the ViT and GPT-2 models.

### Model Architecture

- **Vision Transformer (ViT)**: Acts as the encoder, transforming images into feature-rich embeddings.
- **DistilGPT-2**: Serves as the decoder, generating textual descriptions based on the encoded image features.

### Training Process

- **Fine-Tuning**: Both models are fine-tuned on the COCO dataset over **2 epochs**. This process adapts the pretrained models to the specific task of image captioning, optimizing their performance on this task.

---

## 🧪 Evaluation Metrics

The generated captions are evaluated using the following metrics:

- **ROUGE**: Measures the overlap between the predicted and reference captions.
- **BLEU**: Evaluates the precision of n-grams in the generated captions compared to reference captions.
- **BERTScore**: Uses BERT embeddings to assess the semantic similarity between generated

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/deepmancer/vit-gpt2-image-captioning

Awesome Lists containing this project

README