{"id":17339720,"url":"https://github.com/yukezhu/visual7w-qa-models","last_synced_at":"2025-10-29T22:03:59.591Z","repository":{"id":144751781,"uuid":"55817126","full_name":"yukezhu/visual7w-qa-models","owner":"yukezhu","description":"Visual7W visual question answering models","archived":false,"fork":false,"pushed_at":"2019-10-08T06:18:01.000Z","size":131,"stargazers_count":64,"open_issues_count":4,"forks_count":23,"subscribers_count":3,"default_branch":"master","last_synced_at":"2023-10-20T23:58:10.517Z","etag":null,"topics":["deep-learning","recurrent-neural-networks"],"latest_commit_sha":null,"homepage":"http://ai.stanford.edu/~yukez/visual7w/","language":"Lua","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yukezhu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2016-04-09T00:12:42.000Z","updated_at":"2023-10-20T23:58:10.895Z","dependencies_parsed_at":null,"dependency_job_id":"9a761031-8b57-4e4a-9b6a-85eb6298dc5f","html_url":"https://github.com/yukezhu/visual7w-qa-models","commit_stats":null,"previous_names":[],"tags_count":0,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yukezhu%2Fvisual7w-qa-models","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yukezhu%2Fvisual7w-qa-models/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yukezhu%2Fvisual7w-qa-models/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yukezhu%2Fvisual7w-qa-models/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yukezhu","download_url":"https://codeload.github.com/yukezhu/visual7w-qa-models/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":219843422,"owners_count":16556507,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","recurrent-neural-networks"],"created_at":"2024-10-15T15:42:52.109Z","updated_at":"2025-10-29T22:03:59.507Z","avatar_url":"https://github.com/yukezhu.png","language":"Lua","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Visual7W QA Models\n\n![Visual7W QA samples](http://ai.stanford.edu/~yukez/images/img/visual7w_examples.png \"Visual7W example QAs\")\n\n## Introduction\n\n[Visual7W](http://ai.stanford.edu/~yukez/visual7w/) is a large-scale visual question answering (QA) dataset, with object-level groundings and multimodal answers.\nEach question starts with one of the seven Ws, *what*, *where*, *when*, *who*, *why*, *how* and *which*. Please check out [our CVPR'16 paper](http://ai.stanford.edu/~yukez/papers/cvpr2016.pdf) for more details. This repository provides a [torch](http://torch.ch/) implementation of the attention-based QA model from our paper. Part of the code is adapted from [neuraltalk2](https://github.com/karpathy/neuraltalk2).\n\n## Dataset Overview\nThe [Visual7W](http://ai.stanford.edu/~yukez/visual7w/) dataset is collected on 47,300 COCO images. In total, it has 327,939 QA pairs, together with 1,311,756 human-generated multiple-choices and 561,459 object groundings from 36,579 categories. In addition, we provide complete grounding annotations that link the object mentions in the QA sentences to their bounding boxes in the images and therefore introduce a new QA type with image regions as the visually grounded answers. We refer to questions with textual answers\nas *telling* QA and to such with visual answers as *pointing* QA. The figure above shows some examples in the [Visual7W](http://ai.stanford.edu/~yukez/visual7w/) dataset, where the first row shows *telling* QA examples, and the second row shows *pointing* QA examples.\n\n[Visual7W](http://ai.stanford.edu/~yukez/visual7w/) constitutes a part of the [Visual Genome](http://visualgenome.org/) project. Visual Genome contains 1.7 million QA pairs of the 7W question types, which offers the largest visual QA collection to date for training models. The QA pairs in [Visual7W](http://ai.stanford.edu/~yukez/visual7w/) are a subset of the 1.7 million QA pairs from Visual Genome. Moreover, [Visual7W](http://ai.stanford.edu/~yukez/visual7w/) includes extra annotations such as object groundings, multiple choices and human experiments, making it a clean and complete benchmark for evaluation and analysis.\n\n## Dependencies\n1.  **Python 2.7**\n    - required: [h5py](http://www.h5py.org/), [numpy](http://www.numpy.org/), [skimage](http://scikit-image.org/)\n2.  **Lua 5.2**\n    -  required: [torch](http://torch.ch/), [nn](https://github.com/torch/nn), [nngraph](https://github.com/torch/nngraph), [hdf5](https://github.com/deepmind/torch-hdf5), [loadcaffe](https://github.com/szagoruyko/loadcaffe), [cjson](https://github.com/mpx/lua-cjson), [image](https://github.com/torch/image)\n    -  optional: [cutorch](https://github.com/torch/cutorch), [cunn](https://github.com/torch/cunn), [cudnn](https://github.com/soumith/cudnn.torch)  (for GPU support)\n\nTo install these dependencies after installing Python and Lua:\n```Shell\n# Install Python dependencies using pip\npip install numpy h5py scikit-image\n\n# Install most Lua dependencies using luarocks\nluarocks install torch\nluarocks install nn\nluarocks install nngraph\nluarocks install lua-cjson\nluarocks install image # required for demo.lua\n\n# Install torch-hdf5 from git repo\ngit clone https://github.com/deepmind/torch-hdf5\ncd torch-hdf5\nluarocks make hdf5-0-0.rockspec\n\n# (Optional) Install packages for GPU support, which require CUDA 6.5 or higher.\nluarocks install cutorch\nluarocks install cunn\nluarocks install cudnn\n```\n\n## How to Use\nIn this section, we describe the steps to set up the codebase for training new QA models as well as evaluating their performances. You can use similar procedures to develop new models, and test on your own data.\n\n**Step 1**: Get the code base and submodules (using the --recursive flag).\n```bash\ngit clone --recursive https://github.com/yukezhu/visual7w-qa-models.git\n```\n\n**Step 2**: Simply run the downloading script in the root folder. It takes care of downloading everything needed to run the whole pipeline, including the QA data, images and a pretrained CNN model (VGGNet-16).\n```bash\n./download_data.sh\n```\n\n**Step 3**: Process the raw dataset into a single hdf5 file that is easy to parse by torch. By default, it will create `qa_data.h5` and `qa_data.json` in the `data` folder. Make sure the QA data and images are in the right place (from Step 2) before runing this script.\n```bash\npython prepare_dataset.py\n```\n\n**Step 4**: We are all set. Now let's have fun training and evaluating.\n```bash\n# Training Mode\n# the default parameters work with the default setup\n# we strongly recommend you to use GPU mode for training\n# use flag -h to see helper infomation\nth train_telling.lua -h\n\n# default command for training the model on GPU #0 without finetuning the CNN\n# it should train a model that has very similar performances as reported in our paper\nth train_telling.lua -gpuid 0 -mc_evaluation -verbose -finetune_cnn_after -1\n\n# Evaluation Mode\n# you need to specify which model you want to evaluate\n# use flag -h to see helper infomation\nth eval_telling.lua -model \u003cpath-to-model\u003e -mc_evaluation\n```\n\n## Model Zoo\nTo make it easy, we have released a list of pre-trained QA models for you to play with.\nThese models are trained on the *telling* QA tasks, using the [Visual7W](http://ai.stanford.edu/~yukez/visual7w/) dataset and the larger Visual Genome dataset. You can download these models in both CPU and GPU modes below.\n\nDataset                       | Num. QA  | What  | Where | When  | Who  | Why  | How  | Overall |\n----------------------------- |-------------------| ------| ------| ------| -----| -----| -----| --------|\nVisual7W telling ([data](http://ai.stanford.edu/~yukez/papers/resources/dataset_v7w_telling.zip)\\|[gpu](http://vision.stanford.edu/yukezhu/model_visual7w_telling_gpu.t7)\\|[cpu](http://vision.stanford.edu/yukezhu/model_visual7w_telling_cpu.t7)) | 139,868 | 0.529\t| 0.560\t| 0.743\t| 0.602\t| 0.522\t| 0.466\t| 0.541 |\nVisual Genome telling ([data](http://ai.stanford.edu/~yukez/papers/resources/dataset_visualgenome_telling.zip)\\|[gpu](http://vision.stanford.edu/yukezhu/model_visualgenome_telling_gpu.t7)\\|[cpu](http://vision.stanford.edu/yukezhu/model_visualgenome_telling_cpu.t7))    | 1,359,108 | 0.572\t| 0.613\t| 0.760\t| 0.624\t| 0.590\t| 0.531\t| 0.587 |\n\n**Note:**\n- Visual7W QA is a subset of Visual Genome QA, but has additional annotations (such as *multiple choices* and *object groundings*) for evaluation and analysis. The numbers are multiple-choice accuracies reported on the Visual7W test set.\n- You can use the script `gpu_to_cpu.lua` to convert a GPU model to a CPU copy.\n\n## Visual QA Demo\nWe have provided a demo script for you to run a pretrained QA model on your own image and ask your own questions. `demo.lua` has provided a pipeline for answering a list of sample questions (written in `demo.lua`) on a [demo image](https://raw.githubusercontent.com/yukezhu/visual7w-qa-models/master/data/demo.jpg). Use the following commands to run the QA demo.\n```bash\n# run demo script on GPU mode\nwget http://vision.stanford.edu/yukezhu/model_visual7w_telling_gpu.t7 -P checkpoints\nth demo.lua -model checkpoints/model_visual7w_telling_gpu.t7 -gpuid 0\n\n# alternatively, run demo script on CPU mode\nwget http://vision.stanford.edu/yukezhu/model_visual7w_telling_cpu.t7 -P checkpoints\nth demo.lua -model checkpoints/model_visual7w_telling_cpu.t7 -gpuid -1\n```\n\nYou will see the QA model produces reasonable answers on the [demo image](https://raw.githubusercontent.com/yukezhu/visual7w-qa-models/master/data/demo.jpg) below. Feel free to try your own images or ask your own questions :)\n\n![Visual7W QA demo](https://raw.githubusercontent.com/yukezhu/visual7w-qa-models/master/data/demo.jpg \"Visual7W QA demo\")\n\n```\n** QA demo on data/demo.jpg **\n\nQ: how many people are there ?\nA: two .\n\nQ: what animal can be seen in the picture ?\nA: elephant .\n\nQ: who is wearing a red shirt ?\nA: the man on the right .\n\nQ: what color is the elephant ?\nA: gray .\n\nQ: when is the picture taken ?\nA: daytime .\n```\n\n## Reference\nPlease acknowledge the our CVPR'16 paper if you are using this code.\n```\n@InProceedings{zhu2016cvpr,\n  title = {{Visual7W: Grounded Question Answering in Images}},\n  author = {Yuke Zhu and Oliver Groth and Michael Bernstein and Li Fei-Fei},\n  booktitle = {{IEEE Conference on Computer Vision and Pattern Recognition}},\n  year = 2016,\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyukezhu%2Fvisual7w-qa-models","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyukezhu%2Fvisual7w-qa-models","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyukezhu%2Fvisual7w-qa-models/lists"}