An open API service indexing awesome lists of open source software.

https://github.com/dabasajay/Image-Caption-Generator

A neural network to generate captions for an image using CNN and RNN with BEAM Search.
https://github.com/dabasajay/Image-Caption-Generator

attention attention-mechanism attention-model beam-search bleu bleu-score caption-generation captioning-images cnn-keras convolutional-neural-networks deep-learning flickr-8k flickr-dataset image-caption image-captioning inception-v3 inceptionv3 lstm recurrent-neural-networks vgg16

Last synced: 19 days ago
JSON representation

A neural network to generate captions for an image using CNN and RNN with BEAM Search.

Awesome Lists containing this project

README

          

## Image Caption Generator

[![Issues](https://img.shields.io/github/issues/dabasajay/Image-Caption-Generator.svg?color=%231155cc)](https://github.com/dabasajay/Image-Caption-Generator/issues)
[![Forks](https://img.shields.io/github/forks/dabasajay/Image-Caption-Generator.svg?color=%231155cc)](https://github.com/dabasajay/Image-Caption-Generator/network)
[![Stars](https://img.shields.io/github/stars/dabasajay/Image-Caption-Generator.svg?color=%231155cc)](https://github.com/dabasajay/Image-Caption-Generator/stargazers)
[![Ajay Dabas](https://img.shields.io/badge/Ajay-Dabas-825ee4.svg)](https://dabasajay.github.io/)

A neural network to generate captions for an image using CNN and RNN with BEAM Search.


Examples


Example of Image Captioning


Image Credits : Towardsdatascience

## Table of Contents

1. [Requirements](#1-requirements)
2. [Training parameters and results](#2-training-parameters-and-results)
3. [Generated Captions on Test Images](#3-generated-captions-on-test-images)
4. [Procedure to Train Model](#4-procedure-to-train-model)
5. [Procedure to Test on new images](#5-procedure-to-test-on-new-images)
6. [Configurations (config.py)](#6-configurations-configpy)
7. [Frequently encountered problems](#7-frequently-encountered-problems)
8. [TODO](#8-todo)
9. [References](#9-references)

## 1. Requirements

Recommended System Requirements to train model.


  • A good CPU and a GPU with atleast 8GB memory

  • Atleast 8GB of RAM

  • Active internet connection so that keras can download inceptionv3/vgg16 model weights

Required libraries for Python along with their version numbers used while making & testing of this project


  • Python - 3.6.7

  • Numpy - 1.16.4

  • Tensorflow - 1.13.1

  • Keras - 2.2.4

  • nltk - 3.2.5

  • PIL - 4.3.0

  • Matplotlib - 3.0.3

  • tqdm - 4.28.1

Flickr8k Dataset: Dataset Request Form

UPDATE (April/2019): The official site seems to have been taken down (although the form still works). Here are some direct download links:

Important: After downloading the dataset, put the reqired files in train_val_data folder

## 2. Training parameters and results

#### NOTE

- `batch_size=64` took ~14GB GPU memory in case of *InceptionV3 + AlternativeRNN* and *VGG16 + AlternativeRNN*
- `batch_size=64` took ~8GB GPU memory in case of *InceptionV3 + RNN* and *VGG16 + RNN*
- **If you're low on memory**, use google colab or reduce batch size
- In case of BEAM Search, `loss` and `val_loss` are same as in case of argmax since the model is same

| Model & Config | Argmax | BEAM Search |
| :--- | :--- | :--- |
| **InceptionV3 + AlternativeRNN**


  • Epochs = 20

  • Batch Size = 64

  • Optimizer = Adam

|
    **Crossentropy loss**
    *(Lower the better)*
  • loss(train_loss): 2.4050

  • val_loss: 3.0527
  • **BLEU Scores on Validation data**
    *(Higher the better)*
  • BLEU-1: 0.596818

  • BLEU-2: 0.356009

  • BLEU-3: 0.252489

  • BLEU-4: 0.129536

|
    **k = 3**

    **BLEU Scores on Validation data**
    *(Higher the better)*
  • BLEU-1: 0.606086

  • BLEU-2: 0.359171

  • BLEU-3: 0.249124

  • BLEU-4: 0.126599

|
| **InceptionV3 + RNN**

  • Epochs = 11

  • Batch Size = 64

  • Optimizer = Adam

|
    **Crossentropy loss**
    *(Lower the better)*
  • loss(train_loss): 2.5254

  • val_loss: 3.1769
  • **BLEU Scores on Validation data**
    *(Higher the better)*
  • BLEU-1: 0.601791

  • BLEU-2: 0.344289

  • BLEU-3: 0.230025

  • BLEU-4: 0.108898

|
    **k = 3**

    **BLEU Scores on Validation data**
    *(Higher the better)*
  • BLEU-1: 0.605097

  • BLEU-2: 0.356094

  • BLEU-3: 0.251132

  • BLEU-4: 0.129900

|
| **VGG16 + AlternativeRNN**

  • Epochs = 18

  • Batch Size = 64

  • Optimizer = Adam

|
    **Crossentropy loss**
    *(Lower the better)*
  • loss(train_loss): 2.2880

  • val_loss: 3.1889
  • **BLEU Scores on Validation data**
    *(Higher the better)*
  • BLEU-1: 0.596655

  • BLEU-2: 0.342127

  • BLEU-3: 0.229676

  • BLEU-4: 0.108707

|
    **k = 3**

    **BLEU Scores on Validation data**
    *(Higher the better)*
  • BLEU-1: 0.593876

  • BLEU-2: 0.348569

  • BLEU-3: 0.242063

  • BLEU-4: 0.123221

|
| **VGG16 + RNN**

  • Epochs = 7

  • Batch Size = 64

  • Optimizer = Adam

|
    **Crossentropy loss**
    *(Lower the better)*
  • loss(train_loss): 2.6297

  • val_loss: 3.3486
  • **BLEU Scores on Validation data**
    *(Higher the better)*
  • BLEU-1: 0.557626

  • BLEU-2: 0.317652

  • BLEU-3: 0.216636

  • BLEU-4: 0.105288

|
    **k = 3**

    **BLEU Scores on Validation data**
    *(Higher the better)*
  • BLEU-1: 0.568993

  • BLEU-2: 0.326569

  • BLEU-3: 0.226629

  • BLEU-4: 0.113102

|

## 3. Generated Captions on Test Images

**Model used** - *InceptionV3 + AlternativeRNN*

| Image | Caption |
| :---: | :--- |
| Image 1 |



  • Argmax: A man in a blue shirt is riding a bike on a dirt path.


  • BEAM Search, k=3: A man is riding a bicycle on a dirt path.

|
| Image 2 |


  • Argmax: A man in a red kayak is riding down a waterfall.


  • BEAM Search, k=3: A man on a surfboard is riding a wave.

|

## 4. Procedure to Train Model

1. Clone the repository to preserve directory structure.

`git clone https://github.com/dabasajay/Image-Caption-Generator.git`
2. Put the required dataset files in `train_val_data` folder (files mentioned in readme there).
3. Review `config.py` for paths and other configurations (explained below).
4. Run `train_val.py`.

## 5. Procedure to Test on new images

1. Clone the repository to preserve directory structure.

`git clone https://github.com/dabasajay/Image-Caption-Generator.git`
2. Train the model to generate required files in `model_data` folder (steps given above).
3. Put the test images in `test_data` folder.
4. Review `config.py` for paths and other configurations (explained below).
5. Run `test.py`.

## 6. Configurations (config.py)

**config**

1. **`images_path`** :- Folder path containing flickr dataset images
2. `train_data_path` :- .txt file path containing images ids for training
3. `val_data_path` :- .txt file path containing imgage ids for validation
4. `captions_path` :- .txt file path containing captions
5. `tokenizer_path` :- path for saving tokenizer
6. `model_data_path` :- path for saving files related to model
7. **`model_load_path`** :- path for loading trained model
8. **`num_of_epochs`** :- Number of epochs
9. **`max_length`** :- Maximum length of captions. This is set manually after training of model and required for test.py
10. **`batch_size`** :- Batch size for training (larger will consume more GPU & CPU memory)
11. **`beam_search_k`** :- BEAM search parameter which tells the algorithm how many words to consider at a time.
11. `test_data_path` :- Folder path containing images for testing/inference
12. **`model_type`** :- CNN Model type to use -> inceptionv3 or vgg16
13. **`random_seed`** :- Random seed for reproducibility of results

**rnnConfig**

1. **`embedding_size`** :- Embedding size used in Decoder(RNN) Model
2. **`LSTM_units`** :- Number of LSTM units in Decoder(RNN) Model
3. **`dense_units`** :- Number of Dense units in Decoder(RNN) Model
4. **`dropout`** :- Dropout probability used in Dropout layer in Decoder(RNN) Model

## 7. Frequently encountered problems

- **Out of memory issue**:
- Try reducing `batch_size`
- **Results differ everytime I run script**:
- Due to stochastic nature of these algoritms, results *may* differ slightly everytime. Even though I did set random seed to make results reproducible, results *may* differ slightly.
- **Results aren't very great using beam search compared to argmax**:
- Try higher `k` in BEAM search using `beam_search_k` parameter in config. Note that higher `k` will improve results but it'll also increase inference time significantly.

## 8. TODO

- [X] Support for VGG16 Model. Uses InceptionV3 Model by default.

- [X] Implement 2 architectures of RNN Model.

- [X] Support for batch processing in data generator with shuffling.

- [X] Implement BEAM Search.

- [X] Calculate BLEU Scores using BEAM Search.

- [ ] Implement Attention and change model architecture.

- [ ] Support for pre-trained word vectors like word2vec, GloVe etc.

## 9. References