https://github.com/divelab/vqa-text

Last synced: about 1 year ago
JSON representation

Host: GitHub
URL: https://github.com/divelab/vqa-text
Owner: divelab
Created: 2017-02-11T05:11:11.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2019-02-06T03:40:27.000Z (over 7 years ago)
Last Synced: 2025-04-05T02:21:44.689Z (about 1 year ago)
Language: Python
Size: 25.4 KB
Stars: 19
Watchers: 3
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Learning Convolutional Text Representations for Visual Question Answering

This is the code for our SDM18 paper [Learning Convolutional Text Representations for Visual Question Answering](https://epubs.siam.org/doi/abs/10.1137/1.9781611975321.67). We used it to explore different text representation methods in VQA. The reference code is [vqa-mcb](https://github.com/akirafukui/vqa-mcb).

Created by [Zhengyang Wang](http://people.tamu.edu/~zhengyang.wang/) and [Shuiwang Ji](http://people.tamu.edu/~sji/index.html) at Texas A&M University.

## Citation

If you wish to cite our work, you can use the following bib for now. 

```

@inproceedings{wang2018learning,

  title={Learning Convolutional Text Representations for Visual Question Answering},

  author={Wang, Zhengyang and Ji, Shuiwang},

  booktitle={Proceedings of the 2018 SIAM International Conference on Data Mining},

  pages={594--602},

  year={2018},

  organization={SIAM}

}

```

## Instructions

To replicate our results, do the following prerequisites as in [vqa-mcb](https://github.com/akirafukui/vqa-mcb):

- Compile the `feature/20160617_cb_softattention` branch of [this fork of Caffe](https://github.com/akirafukui/caffe/). This branch contains Yang Gao’s Compact Bilinear layers ([dedicated repo](https://github.com/gy20073/compact_bilinear_pooling), [paper](https://arxiv.org/abs/1511.06062)) released under the [BDD license](https://github.com/gy20073/compact_bilinear_pooling/blob/master/caffe-20160312/LICENSE_BDD), and Ronghang Hu’s Soft Attention layers ([paper](https://arxiv.org/abs/1511.03745)) released under BSD 2-clause.

- Download the [pre-trained ResNet-152 model](https://github.com/KaimingHe/deep-residual-networks).

- Download the [VQA tools](https://github.com/VT-vision-lab/VQA).

- Download the [VQA real-image dataset](http://visualqa.org/download.html).

- Do the [data preprocessing](https://github.com/akirafukui/vqa-mcb/tree/master/preprocess).

**Note:** As explained in our paper, we did not use any additional data such as "GloVe" and "Visual Genome".

To train and test a model, edit the corresponding `config.py` and `qlstm_solver.prototxt` files.

**Note:** Unlike [vqa-mcb](https://github.com/akirafukui/vqa-mcb), in our experiments, different methods require different data provider layers. Use `vqa_data_provider_layer.py` and `visualize_tools.py` in the same folder.

In `config.py`, set `GPU_ID` and `VALIDATE_INTERVAL` (training iterations) properly.

**Note:** As stated in our paper, we trained only on the training set, and tested on the validation set. The code has been modified to do training and testing automatically if you set `VALIDATE_INTERVAL` to the number of iterations for training. The pre-set number is what we used in our results. In our experiments, we split the original training set into new training set and validation set, and used early stopping to determine this number. Then we used this code to train our model on all training data.

In `qlstm_solver.prototxt`, set `snapshot` and `snapshot_prefix`  correctly.

Now just run `python train_xxx.py`. Training can take some time. Snapshots are saved according to the settings in `qlstm_solver.prototxt`. To stop training, just hit `Control + C`.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/divelab/vqa-text

Awesome Lists containing this project

README