{"id":18903907,"url":"https://github.com/abhshkdz/neural-vqa","last_synced_at":"2025-04-06T03:12:24.644Z","repository":{"id":144996033,"uuid":"46271402","full_name":"abhshkdz/neural-vqa","owner":"abhshkdz","description":":grey_question: Visual Question Answering in Torch","archived":false,"fork":false,"pushed_at":"2016-05-03T01:57:35.000Z","size":80,"stargazers_count":487,"open_issues_count":3,"forks_count":90,"subscribers_count":26,"default_branch":"master","last_synced_at":"2025-03-30T02:09:46.359Z","etag":null,"topics":["computer-vision","deep-learning","natural-language-processing","torch"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/1505.02074","language":"Lua","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/abhshkdz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2015-11-16T11:47:41.000Z","updated_at":"2025-02-19T09:28:21.000Z","dependencies_parsed_at":"2023-07-16T21:00:25.106Z","dependency_job_id":null,"html_url":"https://github.com/abhshkdz/neural-vqa","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abhshkdz%2Fneural-vqa","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abhshkdz%2Fneural-vqa/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abhshkdz%2Fneural-vqa/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abhshkdz%2Fneural-vqa/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/abhshkdz","download_url":"https://codeload.github.com/abhshkdz/neural-vqa/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247427012,"owners_count":20937214,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","deep-learning","natural-language-processing","torch"],"created_at":"2024-11-08T09:06:51.257Z","updated_at":"2025-04-06T03:12:24.629Z","avatar_url":"https://github.com/abhshkdz.png","language":"Lua","funding_links":[],"categories":["Deep Learning"],"sub_categories":[],"readme":"# neural-vqa\n\n[![Join the chat at https://gitter.im/abhshkdz/neural-vqa](https://badges.gitter.im/abhshkdz/neural-vqa.svg)](https://gitter.im/abhshkdz/neural-vqa?utm_source=badge\u0026utm_medium=badge\u0026utm_campaign=pr-badge\u0026utm_content=badge)\n\nThis is an experimental Torch implementation of the\nVIS + LSTM visual question answering model from the paper\n[Exploring Models and Data for Image Question Answering][2]\nby Mengye Ren, Ryan Kiros \u0026 Richard Zemel.\n\n![Model architecture](http://i.imgur.com/UXAPlqe.png)\n\n## Setup\n\nRequirements:\n\n- [Torch][10]\n- [loadcaffe][9]\n\nDownload the [MSCOCO][11] train+val images and [VQA][1] data using `sh data/download_data.sh`. Extract all the downloaded zip files inside the `data` folder.\n\n```\nunzip Annotations_Train_mscoco.zip\nunzip Questions_Train_mscoco.zip\nunzip train2014.zip\n\nunzip Annotations_Val_mscoco.zip\nunzip Questions_Val_mscoco.zip\nunzip val2014.zip\n```\n\nIf you had them downloaded already, copy over the `train2014` and `val2014` image folders\nand VQA JSON files to the `data` folder.\n\nDownload the [VGG-19][7] Caffe model and prototxt using `sh models/download_models.sh`.\n\n### Known issues\n\n- To avoid memory issues with LuaJIT, install Torch with Lua 5.1 (`TORCH_LUA_VERSION=LUA51 ./install.sh`).\nMore instructions [here][4].\n- If working with plain Lua, [luaffifb][8] may be needed for [loadcaffe][9],\nunless using pre-extracted fc7 features.\n\n## Usage\n\n### Extract image features\n\n```\nth extract_fc7.lua -split train\nth extract_fc7.lua -split val\n```\n\n#### Options\n\n- `batch_size`: Batch size. Default is 10.\n- `split`: train/val. Default is `train`.\n- `gpuid`: 0-indexed id of GPU to use. Default is -1 = CPU.\n- `proto_file`: Path to the `deploy.prototxt` file for the VGG Caffe model. Default is `models/VGG_ILSVRC_19_layers_deploy.prototxt`.\n- `model_file`: Path to the `.caffemodel` file for the VGG Caffe model. Default is `models/VGG_ILSVRC_19_layers.caffemodel`.\n- `data_dir`: Data directory. Default is `data`.\n- `feat_layer`: Layer to extract features from. Default is `fc7`.\n- `input_image_dir`: Image directory. Default is `data`.\n\n\n### Training\n\n```\nth train.lua\n```\n\n#### Options\n\n- `rnn_size`: Size of LSTM internal state. Default is 512.\n- `num_layers`: Number of layers in LSTM\n- `embedding_size`: Size of word embeddings. Default is 512.\n- `learning_rate`: Learning rate. Default is 4e-4.\n- `learning_rate_decay`: Learning rate decay factor. Default is 0.95.\n- `learning_rate_decay_after`: In number of epochs, when to start decaying the learning rate. Default is 15.\n- `alpha`: Alpha for adam. Default is 0.8\n- `beta`: Beta used for adam. Default is 0.999.\n- `epsilon`: Denominator term for smoothing. Default is 1e-8.\n- `batch_size`: Batch size. Default is 64.\n- `max_epochs`: Number of full passes through the training data. Default is 15.\n- `dropout`:  Dropout for regularization. Probability of dropping input. Default is 0.5.\n- `init_from`: Initialize network parameters from checkpoint at this path.\n- `save_every`: No. of iterations after which to checkpoint. Default is 1000.\n- `train_fc7_file`: Path to fc7 features of training set. Default is `data/train_fc7.t7`.\n- `fc7_image_id_file`: Path to fc7 image ids of training set. Default is `data/train_fc7_image_id.t7`.\n- `val_fc7_file`: Path to fc7 features of validation set. Default is `data/val_fc7.t7`.\n- `val_fc7_image_id_file`: Path to fc7 image ids of validation set. Default is `data/val_fc7_image_id.t7`.\n- `data_dir`: Data directory. Default is `data`.\n- `checkpoint_dir`: Checkpoint directory. Default is `checkpoints`.\n- `savefile`: Filename to save checkpoint to. Default is `vqa`.\n- `gpuid`: 0-indexed id of GPU to use. Default is -1 = CPU.\n\n### Testing\n\n```\nth predict.lua -checkpoint_file checkpoints/vqa_epoch23.26_0.4610.t7 -input_image_path data/train2014/COCO_train2014_000000405541.jpg -question 'What is the cat on?'\n```\n\n#### Options\n\n- `checkpoint_file`: Path to model checkpoint to initialize network parameters from\n- `input_image_path`: Path to input image\n- `question`: Question string\n\n## Sample predictions\n\nRandomly sampled image-question pairs from the VQA test set,\nand answers predicted by the VIS+LSTM model.\n\n![](http://i.imgur.com/V3nHbo9.jpg)\n\nQ: What animals are those?\nA: Sheep\n\n![](http://i.imgur.com/QRBi6qb.jpg)\n\nQ: What color is the frisbee that's upside down?\nA: Red\n\n![](http://i.imgur.com/tiOqJfH.jpg)\n\nQ: What is flying in the sky?\nA: Kite\n\n![](http://i.imgur.com/4ZmOoUF.jpg)\n\nQ: What color is court?\nA: Blue\n\n![](http://i.imgur.com/1D6NxvD.jpg)\n\nQ: What is in the standing person's hands?\nA: Bat\n\n![](http://i.imgur.com/tY9BT1I.jpg)\n\nQ: Are they riding horses both the same color?\nA: No\n\n![](http://i.imgur.com/hzwj0NS.jpg)\n\nQ: What shape is the plate?\nA: Round\n\n![](http://i.imgur.com/n1Kn1vZ.jpg)\n\nQ: Is the man wearing socks?\nA: Yes\n\n![](http://i.imgur.com/dXhNKP6.jpg)\n\nQ: What is over the woman's left shoulder?\nA: Fork\n\n![](http://i.imgur.com/thzv03r.jpg)\n\nQ: Where are the pink flowers?\nA: On wall\n\n## Implementation Details\n\n- Last hidden layer image features from [VGG-19][6]\n- Zero-padded question sequences for batched implementation\n- Training questions are filtered for `top_n` answers,\n`top_n = 1000` by default (~87% coverage)\n\n## Pretrained model and data files\n\nTo reproduce results shown on this page or try your own\nimage-question pairs, download the following and run\n`predict.lua` with the appropriate paths.\n\n- vqa\\_epoch23.26\\_0.4610.t7 (Serialized using Lua51) [[GPU](https://drive.google.com/file/d/0B8qwt8PA_oxpSWhRQ1NKYkxhYnc/view?usp=sharing)] [[CPU](https://drive.google.com/file/d/0B8qwt8PA_oxpbGJQY0EyZ2phYTg/view?usp=sharing)]\n- [answers_vocab.t7](https://drive.google.com/file/d/0B8qwt8PA_oxpNE1RdWlMLWlNcVk/view?usp=sharing)\n- [questions_vocab.t7](https://drive.google.com/file/d/0B8qwt8PA_oxpd2Y4MXIzb0pxSWM/view?usp=sharing)\n- [data.t7](https://drive.google.com/file/d/0B8qwt8PA_oxpejVuTFVsZTJDSUU/view?usp=sharing)\n\n## References\n\n- [Exploring Models and Data for Image Question Answering][2], Ren et al., NIPS15\n- [VQA: Visual Question Answering][3], Antol et al., ICCV15\n\n## License\n\n[MIT][12]\n\n[1]: http://visualqa.org/\n[2]: http://arxiv.org/abs/1505.02074\n[3]: http://arxiv.org/abs/1505.00468\n[4]: https://github.com/torch/distro\n[5]: http://nlp.stanford.edu/projects/glove/\n[6]: http://arxiv.org/abs/1409.1556\n[7]: https://gist.github.com/ksimonyan/3785162f95cd2d5fee77#file-readme-md\n[8]: https://github.com/facebook/luaffifb\n[9]: https://github.com/szagoruyko/loadcaffe\n[10]: http://torch.ch/\n[11]: http://mscoco.org/\n[12]: https://abhshkdz.mit-license.org/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabhshkdz%2Fneural-vqa","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabhshkdz%2Fneural-vqa","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabhshkdz%2Fneural-vqa/lists"}