https://github.com/paarthneekhara/neural-vqa-tensorflow

Visual Question Answering in Tensorflow.
https://github.com/paarthneekhara/neural-vqa-tensorflow

Last synced: 6 months ago
JSON representation

Visual Question Answering in Tensorflow.

Host: GitHub
URL: https://github.com/paarthneekhara/neural-vqa-tensorflow
Owner: paarthneekhara
Created: 2016-08-09T20:14:51.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2019-11-19T01:29:49.000Z (almost 6 years ago)
Last Synced: 2025-03-24T10:11:06.521Z (7 months ago)
Language: Python
Homepage:
Size: 23.4 KB
Stars: 231
Watchers: 15
Forks: 71
Open Issues: 6
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Visual Question Answering in Tensorflow

[![Join the chat at https://gitter.im/neural-vqa-tensorflow/Lobby](https://badges.gitter.im/neural-vqa-tensorflow/Lobby.svg)](https://gitter.im/neural-vqa-tensorflow/Lobby?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)

This is a Tensorflow implementation of the VIS + LSTM visual question answering model from the paper [Exploring Models and Data for Image Question Answering][1]

by Mengye Ren, Ryan Kiros & Richard Zemel. The model architectures vaires slightly from the original - the image embedding is plugged into the last lstm step (after the question) instead of the first. The LSTM model uses the same hyperparameters as those in the [Torch implementation of neural-VQA][2]. 

![Model architecture](http://i.imgur.com/Jvixx2W.jpg)

## Requirements

- Python 2.7.6

- [Tensorflow][3]

- [h5py][4]

#### Datasets

- Download the [MSCOCO][5] train+val images and [VQA][6] data using `Data/download_data.sh`. Extract all the downloaded zip files inside the `Data` folder.

- Download the [pretrained VGG-16 tensorflow model][7] and save it in the `Data` folder.

## Usage

- Extract the fc-7 image features using:

```

python extract_fc7.py --split=train

python extract_fc7.py --split=val

```

- Training

  * Basic usage `python train.py`

  * Options

    - `rnn_size`: Size of LSTM internal state. Default is 512.

    - `num_lstm_layers`: Number of layers in LSTM

    - `embedding_size`: Size of word embeddings. Default is 512.

    - `learning_rate`: Learning rate. Default is 0.001.

    - `batch_size`: Batch size. Default is 200.

    - `epochs`: Number of full passes through the training data. Default is 50.

    - `img_dropout`:  Dropout for image embedding nn. Probability of dropping input. Default is 0.5.

    - `word_emb_dropout`: Dropout for word embeddings. Default is 0.5.

    - `data_dir`: Directory containing the data h5 files. Default is `Data/`.

- Prediction

  * ```python predict.py --image_path="sample_image.jpg" --question="What is the color of the animal shown?" --model_path = "Data/Models/model2.ckpt"```

  * Models are saved during training after each of the complete training data in ```Data/Models```. Supply the path of the trained model in ```model_path``` option.

  

- Evaluation

  * run `python evaluate.py` with the same options as that in train.py, if not the defaults.

## Implementation Details

- fc7 relu layer features from the pretrained VGG-16 model are used for image embeddings. I did not scale these features, and am not sure if that can make a difference.

- Questions are zero padded for fixed length questions, so that batch training may be used. Questions are represented as word indices of a question word vocabulary built during pre processing.

- Answers are mapped to 1000 word vocabulary, covering 87% answers across training and validation datasets.

- The LSTM+VIS model is defined in vis_lstm.py. The input tensors for training are fc7 features, Questions(Word indices upto 22 words), Answers(one hot encoding vector of size 1000). The model depicted in the figure is implemented with 2 LSTM layers by default(num_layers in configurable).

## Results

The model achieved an accuray of 50.8% on the validation dataset after 12 epochs of training across the entire training dataset.

## Sample Predictions

The fun part! Try it for yourself. Make sure you have tensorflow installed. Download the data files/trained model from [this link][9] and save them in the ```Data/``` directory. Also download the [pretrained VGG-16 model][7] and save it as ```Data/vgg16.tfmodel```. You can test for any sample image using:

```

python predict.py --image_path="Data/sample.jpg" --question="Which animal is this?" --model_path="Data/model2.ckpt"

```

| Image        | Question           | Top Answers (left to right)  |

| ------------- |:-------------:| -----:|

| ![](http://i.imgur.com/j4FiEaS.jpg)      | What color is the signal? | red, green, yellow|

| ![](http://i.imgur.com/FUR7k0y.jpg)      | What animal is this? | giraffe, cow, horse|

| ![](http://i.imgur.com/VrGUves.jpg)      | What animal is this? | cat, dog, giraffe|

| ![](http://i.imgur.com/yk53y1Y.jpg)      | What color is the frisbee that is in the dog's mouth? | white, brown, red|

| ![](http://i.imgur.com/yk53y1Y.jpg)      | What color is the frisbee that is upside down? | red, white, blue|

| ![](http://i.imgur.com/ifcccpd.jpg)      | What are they playing with? | frisbee, soccer ball, soccer|

| ![](http://i.imgur.com/VrjUbgH.jpg)      | What is in the standing person's hand? | bat, glove, ball|

| ![](http://i.imgur.com/80foxDZ.jpg)      | What are they doing? | surfing, swimming, parasailing|

| ![](http://i.imgur.com/7ZZi2Xp.jpg)      | What sport is this? | skateboarding, parasailing, surfing|

## References

- [Exploring Models and Data for Image Question Answering][1]

- [Torch implementation of VQA][2]

- [Neural Caption Generator with Attention][8]

[1]: http://arxiv.org/abs/1505.02074

[2]: https://github.com/abhshkdz/neural-vqa/

[3]: https://github.com/tensorflow/tensorflow

[4]: http://www.h5py.org/

[5]: http://mscoco.org/

[6]: http://visualqa.org/

[7]: https://github.com/ry/tensorflow-vgg16

[8]: https://github.com/jazzsaxmafia/show_attend_and_tell.tensorflow

[9]: https://drive.google.com/folderview?id=0B30fmeZ1slbBU1JSRHdiWkF4NUk&usp=sharing

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/paarthneekhara/neural-vqa-tensorflow

Awesome Lists containing this project

README