https://github.com/sreyash1mohanty/image-captioning
Image captioning model using Keras
https://github.com/sreyash1mohanty/image-captioning
deep-learning image-captioning keras keras-tensorflow lstm neural-network resnet-50
Last synced: 5 months ago
JSON representation
Image captioning model using Keras
- Host: GitHub
- URL: https://github.com/sreyash1mohanty/image-captioning
- Owner: sreyash1mohanty
- Created: 2025-02-11T14:17:30.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-02-11T14:40:03.000Z (8 months ago)
- Last Synced: 2025-02-11T15:27:44.736Z (8 months ago)
- Topics: deep-learning, image-captioning, keras, keras-tensorflow, lstm, neural-network, resnet-50
- Language: Jupyter Notebook
- Homepage:
- Size: 19.8 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Image Captioning using Deep Learning
## 📌 Overview
This project implements an **Image Captioning Model** using **Keras** and **Transfer Learning**. The model generates captions for images by combining **Convolutional Neural Networks (CNNs)** for feature extraction and **Recurrent Neural Networks (RNNs)** with **LSTM** for sequence generation.The model is trained on Flickr8k Dataset.## 🚀 Key Features
- Uses **ResNet50** (pretrained) for image feature extraction (2048 features per image).
- Applies **Global Average Pooling** to modify the ResNet50 output layer.
- Utilizes **GloVe 6B50d embeddings** for word representation.
- Implements an **LSTM-based decoder** for sequential caption generation.
- Uses **Custom Data Generator** to efficiently preprocess captions and images.
- Trained on a dataset with a vocabulary size of **1848 words**.
- Implements **Dropout Regularization** to prevent overfitting.## 🏗️ Model Architecture
The model consists of two main parts:### 1️⃣ Feature Extractor (CNN - ResNet50)
- **Input:** Image
- **Output:** 2048-dimensional feature vector
- **Modifications:** Replaced final layer with Global Average Pooling### 2️⃣ Caption Generator (RNN - LSTM)
- **Input:** Tokenized captions
- **Embedding Layer:** Uses pre-trained **GloVe 6B50d** word embeddings
- **LSTM Layer:** Generates sequential words based on input captions and image features
- **Fully Connected Layers:** Dense layers for final word prediction### 🔹 Neural Network Layers
```python
# Image Feature generated from ResNet50 are passed here
input_img_features = Input(shape=(2048,))
inp_img1 = Dropout(0.3)(input_img_features)
inp_img2 = Dense(256, activation='relu')(inp_img1)# Caption Processing
input_captions = Input(shape=(max_len,))
inp_cap1 = Embedding(input_dim=vocab_size, output_dim=50, mask_zero=True)(input_captions)
inp_cap2 = Dropout(0.3)(inp_cap1)
inp_cap3 = LSTM(256)(inp_cap2)# Decoder
decoder1 = add([inp_img2, inp_cap3])
decoder2 = Dense(256, activation='relu')(decoder1)
outputs = Dense(vocab_size, activation='softmax')(decoder2)# Model
model = Model(inputs=[input_img_features, input_captions], outputs=outputs)
```## 📊 Data Preprocessing and Trasnfer learning
### 🔹 Captions
- Cleaned captions by removing punctuation and special characters.
- Tokenized captions and built a vocabulary of **1848 unique words**.
- Applied **GloVe 6B50d word embeddings** to map words into vector space.### 🔹 Images
- Resized all images to the required input size for **ResNet50**.
- Extracted **2048-dimensional feature vectors** and using **Resnet50 base **.
- Stored preprocessed image features for efficient training.## 🏋️ Training
### 🔹 Loss Function
- The model is trained using **Categorical Cross-Entropy Loss**.### 🔹 Optimizer
- Used **Adam Optimizer** with a learning rate of `0.001`.### 🔹 Batch Processing
- Used a **Custom Data Generator** to efficiently process large datasets in batches.## ⚡ Future Improvements
- Train on **larger datasets** to improve generalization.### 🔹 Required Libraries
- `TensorFlow / Keras`
- `NLTK`
- `NumPy`
- `Pandas`
- `Matplotlib`