{"id":13526200,"url":"https://github.com/Cadene/vqa.pytorch","last_synced_at":"2025-04-01T07:31:42.251Z","repository":{"id":37734089,"uuid":"91613727","full_name":"Cadene/vqa.pytorch","owner":"Cadene","description":"Visual Question Answering in Pytorch","archived":false,"fork":false,"pushed_at":"2019-12-11T23:54:10.000Z","size":1817,"stargazers_count":711,"open_issues_count":20,"forks_count":177,"subscribers_count":33,"default_branch":"master","last_synced_at":"2024-08-02T06:19:47.866Z","etag":null,"topics":["clevr","coco","deep-learning","pytorch","resnet","skipthoughts","torch","vgenome","vqa"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Cadene.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-05-17T19:41:04.000Z","updated_at":"2024-08-01T06:24:11.000Z","dependencies_parsed_at":"2022-08-27T01:01:57.274Z","dependency_job_id":null,"html_url":"https://github.com/Cadene/vqa.pytorch","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Cadene%2Fvqa.pytorch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Cadene%2Fvqa.pytorch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Cadene%2Fvqa.pytorch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Cadene%2Fvqa.pytorch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Cadene","download_url":"https://codeload.github.com/Cadene/vqa.pytorch/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":222709408,"owners_count":17026761,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clevr","coco","deep-learning","pytorch","resnet","skipthoughts","torch","vgenome","vqa"],"created_at":"2024-08-01T06:01:26.316Z","updated_at":"2024-11-02T11:30:19.000Z","avatar_url":"https://github.com/Cadene.png","language":"Python","funding_links":[],"categories":["Image VQA","Python","Paper implementations｜论文实现","Paper implementations","Deep Learning Projects","Paper Implementations"],"sub_categories":["2017","Other libraries｜其他库:","Other libraries:"],"readme":"# Visual Question Answering in pytorch\n\n**/!\\ New version of pytorch for VQA available here:** https://github.com/Cadene/block.bootstrap.pytorch\n\nThis repo was made by [Remi Cadene](http://remicadene.com) (LIP6) and [Hedi Ben-Younes](https://twitter.com/labegne) (LIP6-Heuritech), two PhD Students working on VQA at [UPMC-LIP6](http://lip6.fr) and their professors [Matthieu Cord](http://webia.lip6.fr/~cord) (LIP6) and [Nicolas Thome](http://webia.lip6.fr/~thomen) (LIP6-CNAM). We developed this code in the frame of a research paper called [MUTAN: Multimodal Tucker Fusion for VQA](https://arxiv.org/abs/1705.06676) which is (as far as we know) the current state-of-the-art on the [VQA 1.0 dataset](http://visualqa.org).\n\nThe goal of this repo is two folds:\n- to make it easier to reproduce our results,\n- to provide an efficient and modular code base to the community for further research on other VQA datasets.\n\nIf you have any questions about our code or model, don't hesitate to contact us or to submit any issues. Pull request are welcome!\n\n#### News:\n\n- 16th january 2018: a pretrained vqa2 model and web demo\n- 18th july 2017: VQA2, VisualGenome, FBResnet152 (for pytorch) added [v2.0 commit msg](https://github.com/Cadene/vqa.pytorch/commit/42391fd4a39c31e539eb6cb73ecd370bac0f010a)\n- 16th july 2017: paper accepted at ICCV2017\n- 30th may 2017: poster accepted at CVPR2017 (VQA Workshop)\n\n#### Summary:\n\n* [Introduction](#introduction)\n    * [What is the task about?](#what-is-the-task-about)\n    * [Quick insight about our method](#quick-insight-about-our-method)\n* [Installation](#installation)\n    * [Requirements](#requirements)\n    * [Submodules](#submodules)\n    * [Data](#data)\n* [Reproducing results on VQA 1.0](#reproducing-results-on-vqa-10)\n    * [Features](#features)\n    * [Pretrained models](#pretrained-models)\n* [Reproducing results on VQA 2.0](#reproducing-results-on-vqa-20)\n    * [Features](#features-20)\n    * [Pretrained models](#pretrained-models-20)\n* [Documentation](#documentation)\n    * [Architecture](#architecture)\n    * [Options](#options)\n    * [Datasets](#datasets)\n    * [Models](#models)\n* [Quick examples](#quick-examples)\n    * [Extract features from COCO](#extract-features-from-coco)\n    * [Extract features from VisualGenome](#extract-features-from-visualgenome)\n    * [Train models on VQA 1.0](#train-models-on-vqa-10)\n    * [Train models on VQA 2.0](#train-models-on-vqa-20)\n    * [Train models on VQA + VisualGenome](#train-models-on-vqa-10-or-20--visualgenome)\n    * [Monitor training](#monitor-training)\n    * [Restart training](#restart-training)\n    * [Evaluate models on VQA](#evaluate-models-on-vqa)\n    * [Web demo](#web-demo)\n* [Citation](#citation)\n* [Acknowledgment](#acknowledgment)\n\n## Introduction\n\n### What is the task about?\n\nThe task is about training models in a end-to-end fashion on a multimodal dataset made of triplets:\n\n- an **image** with no other information than the raw pixels,\n- a **question** about visual content(s) on the associated image,\n- a short **answer** to the question (one or a few words). \n\nAs you can see in the illustration bellow, two different triplets (but same image) of the VQA dataset are represented. The models need to learn rich multimodal representations to be able to give the right answers.\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://raw.githubusercontent.com/Cadene/vqa.pytorch/master/doc/vqa_task.png\" width=\"600\"/\u003e\n\u003c/p\u003e\n\nThe VQA task is still on active research. However, when it will be solved, it could be very useful to improve human-to-machine interfaces (especially for the blinds).\n\n### Quick insight about our method\n\nThe VQA community developped an approach based on four learnable components:\n\n- a question model which can be a LSTM, GRU, or pretrained Skipthoughts,\n- an image model which can be a pretrained VGG16 or ResNet-152,\n- a fusion scheme which can be an element-wise sum, concatenation, [MCB](https://arxiv.org/abs/1606.01847), [MLB](https://arxiv.org/abs/1610.04325), or [Mutan](https://arxiv.org/abs/1705.06676),\n- optionally, an attention scheme which may have several \"glimpses\".\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://raw.githubusercontent.com/Cadene/vqa.pytorch/master/doc/mutan.png\" width=\"400\"/\u003e\n\u003c/p\u003e\n\nOne of our claim is that the multimodal fusion between the image and the question representations is a critical component. Thus, our proposed model uses a Tucker Decomposition of the correlation Tensor to model richer multimodal interactions in order to provide proper answers. Our best model is based on :\n\n- a pretrained Skipthoughts for the question model,\n- features from a pretrained Resnet-152 (with images of size 3x448x448) for the image model,\n- our proposed Mutan (based on a Tucker Decomposition) for the fusion scheme,\n- an attention scheme with two \"glimpses\".\n\n## Installation\n\n### Requirements\n\nFirst install python 3 (we don't provide support for python 2). We advise you to install python 3 and pytorch with Anaconda:\n\n- [python with anaconda](https://www.continuum.io/downloads)\n- [pytorch with CUDA](http://pytorch.org)\n\n```\nconda create --name vqa python=3\nsource activate vqa\nconda install pytorch torchvision cuda80 -c soumith\n```\n\nThen clone the repo (with the `--recursive` flag for submodules) and install the complementary requirements:\n\n```\ncd $HOME\ngit clone --recursive https://github.com/Cadene/vqa.pytorch.git \ncd vqa.pytorch\npip install -r requirements.txt\n```\n\n### Submodules\n\nOur code has two external dependencies:\n\n- [VQA](https://github.com/Cadene/VQA) is used to evaluate results files on the valset with the OpendEnded accuracy,\n- [skip-thoughts.torch](https://github.com/Cadene/skip-thoughts.torch) is used to import pretrained GRUs and embeddings,\n- [pretrained-models.pytorch](https://github.com/Cadene/pretrained-models.pytorch) is used to load pretrained convnets.\n\n### Data\n\nData will be automaticaly downloaded and preprocessed when needed. Links to data are stored in `vqa/datasets/vqa.py`, `vqa/datasets/coco.py` and `vqa/datasets/vgenome.py`.\n\n\n## Reproducing results on VQA 1.0\n\n### Features\n\nAs we first developped on Lua/Torch7, we used the features of [ResNet-152 pretrained with Torch7](https://github.com/facebook/fb.resnet.torch). We ported the pretrained resnet152 trained with Torch7 in pytorch in the v2.0 release. We will provide all the extracted features soon. Meanwhile, you can download the coco features as following:\n\n```\nmkdir -p data/coco/extract/arch,fbresnet152torch\ncd data/coco/extract/arch,fbresnet152torch\nwget https://data.lip6.fr/coco/trainset.hdf5\nwget https://data.lip6.fr/coco/trainset.txt\nwget https://data.lip6.fr/coco/valset.hdf5\nwget https://data.lip6.fr/coco/valset.txt\nwget https://data.lip6.fr/coco/testset.hdf5\nwget https://data.lip6.fr/coco/testset.txt\n```\n\n/!\\ There are currently 3 versions of ResNet152:\n\n- fbresnet152torch which is the torch7 model,\n- fbresnet152 which is the porting of the torch7 in pytorch,\n- [resnet152](https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py) which is the pretrained model from torchvision (we've got lower results with it).\n\n### Pretrained VQA models\n\nWe currently provide three models trained with our old Torch7 code and ported to Pytorch:\n\n- MutanNoAtt trained on the VQA 1.0 trainset,\n- MLBAtt trained on the VQA 1.0 trainvalset and VisualGenome,\n- MutanAtt trained on the VQA 1.0 trainvalset and VisualGenome.\n\n```\nmkdir -p logs/vqa\ncd logs/vqa\nwget http://webia.lip6.fr/~cadene/Downloads/vqa.pytorch/logs/vqa/mutan_noatt_train.zip \nwget http://webia.lip6.fr/~cadene/Downloads/vqa.pytorch/logs/vqa/mlb_att_trainval.zip \nwget http://webia.lip6.fr/~cadene/Downloads/vqa.pytorch/logs/vqa/mutan_att_trainval.zip \n```\n\nEven if we provide results files associated to our pretrained models, you can evaluate them once again on the valset, testset and testdevset using a single command:\n\n```\npython train.py -e --path_opt options/vqa/mutan_noatt_train.yaml --resume ckpt\npython train.py -e --path_opt options/vqa/mlb_noatt_trainval.yaml --resume ckpt\npython train.py -e --path_opt options/vqa/mutan_att_trainval.yaml --resume ckpt\n```\n\nTo obtain test and testdev results on VQA 1.0, you will need to zip your result json file (name it as `results.zip`) and to submit it on the [evaluation server](https://competitions.codalab.org/competitions/6961).\n\n\n## Reproducing results on VQA 2.0\n\n### Features 2.0\n\nYou must download the coco dataset (and visual genome if needed) and then extract the features with a convolutional neural network.\n\n### Pretrained VQA models 2.0\n\nWe currently provide three models trained with our current pytorch code on VQA 2.0\n\n- MutanAtt trained on the trainset with the fbresnet152 features,\n- MutanAtt trained on thetrainvalset with the fbresnet152 features.\n\n```\ncd $VQAPYTORCH\nmkdir -p logs/vqa2\ncd logs/vqa2\nwget http://data.lip6.fr/cadene/vqa.pytorch/vqa2/mutan_att_train.zip \nwget http://data.lip6.fr/cadene/vqa.pytorch/vqa2/mutan_att_trainval.zip \n```\n\n## Documentation\n\n### Architecture\n\n```\n.\n├── options        # default options dir containing yaml files\n├── logs           # experiments dir containing directories of logs (one by experiment)\n├── data           # datasets directories\n|   ├── coco       # images and features\n|   ├── vqa        # raw, interim and processed data\n|   ├── vgenome    # raw, interim, processed data + images and features\n|   └── ...\n├── vqa            # vqa package dir\n|   ├── datasets   # datasets classes \u0026 functions dir (vqa, coco, vgenome, images, features, etc.)\n|   ├── external   # submodules dir (VQA, skip-thoughts.torch, pretrained-models.pytorch)\n|   ├── lib        # misc classes \u0026 func dir (engine, logger, dataloader, etc.)\n|   └── models     # models classes \u0026 func dir (att, fusion, notatt, seq2vec, convnets)\n|\n├── train.py       # train \u0026 eval models\n├── eval_res.py    # eval results files with OpenEnded metric\n├── extract.py     # extract features from coco with CNNs\n└── visu.py        # visualize logs and monitor training\n```\n\n### Options\n\nThere are three kind of options:\n\n- options from the yaml options files stored in the `options` directory which are used as default (path to directory, logs, model, features, etc.)\n- options from the ArgumentParser in the `train.py` file which are set to None and can overwrite default options (learning rate, batch size, etc.)\n- options from the ArgumentParser in the `train.py` file which are set to default values (print frequency, number of threads, resume model, evaluate model, etc.)\n\nYou can easly add new options in your custom yaml file if needed. Also, if you want to grid search a parameter, you can add an ArgumentParser option and modify the dictionnary in `train.py:L80`.\n\n### Datasets\n\nWe currently provide four datasets:\n\n- [COCOImages](http://mscoco.org/) currently used to extract features, it comes with three datasets: trainset, valset and testset\n- [VisualGenomeImages]() currently used to extract features, it comes with one split: trainset\n- [VQA 1.0](http://www.visualqa.org/vqa_v1_download.html) comes with four datasets: trainset, valset, testset (including test-std and test-dev) and \"trainvalset\" (concatenation of trainset and valset)\n- [VQA 2.0](http://www.visualqa.org) same but twice bigger (however same images than VQA 1.0)\n\nWe plan to add:\n\n- [CLEVR](http://cs.stanford.edu/people/jcjohns/clevr/)\n\n### Models\n\nWe currently provide four models:\n\n- MLBNoAtt: a strong baseline (BayesianGRU + Element-wise product)\n- [MLBAtt](https://arxiv.org/abs/1610.04325): the previous state-of-the-art which adds an attention strategy\n- MutanNoAtt: our proof of concept (BayesianGRU + Mutan Fusion)\n- MutanAtt: the current state-of-the-art\n\nWe plan to add several other strategies in the futur.\n\n## Quick examples\n\n### Extract features from COCO\n\nThe needed images will be automaticaly downloaded to `dir_data` and the features will be extracted with a resnet152 by default.\n\nThere are three options for `mode` :\n\n- `att`: features will be of size 2048x14x14,\n- `noatt`: features will be of size 2048,\n- `both`: default option.\n\nBeware, you will need some space on your SSD:\n\n- 32GB for the images,\n- 125GB for the train features,\n- 123GB for the test features,\n- 61GB for the val features.\n\n```\npython extract.py -h\npython extract.py --dir_data data/coco --data_split train\npython extract.py --dir_data data/coco --data_split val\npython extract.py --dir_data data/coco --data_split test\n```\n\nNote: By default our code will share computations over all available GPUs. If you want to select only one or a few, use the following prefix:\n\n```\nCUDA_VISIBLE_DEVICES=0 python extract.py\nCUDA_VISIBLE_DEVICES=1,2 python extract.py\n```\n\n### Extract features from VisualGenome\n\nSame here, but only train is available:\n\n```\npython extract.py --dataset vgenome --dir_data data/vgenome --data_split train\n```\n\n\n### Train models on VQA 1.0\n\nDisplay help message, selected options and run default. The needed data will be automaticaly downloaded and processed using the options in `options/vqa/default.yaml`.\n\n```\npython train.py -h\npython train.py --help_opt\npython train.py\n``` \n\nRun a MutanNoAtt model with default options.\n\n```\npython train.py --path_opt options/vqa/mutan_noatt_train.yaml --dir_logs logs/vqa/mutan_noatt_train\n```\n\nRun a MutanAtt model on the trainset and evaluate on the valset after each epoch.\n\n```\npython train.py --vqa_trainsplit train --path_opt options/vqa/mutan_att_trainval.yaml \n``` \n\nRun a MutanAtt model on the trainset and valset (by default) and run throw the testset after each epoch (produce a results file that you can submit to the evaluation server).\n\n```\npython train.py --vqa_trainsplit trainval --path_opt options/vqa/mutan_att_trainval.yaml\n``` \n\n### Train models on VQA 2.0\n\nSee options of [vqa2/mutan_att_trainval](https://github.com/Cadene/vqa.pytorch/blob/master/options/vqa2/mutan_att_trainval.yaml):\n\n```\npython train.py --path_opt options/vqa2/mutan_att_trainval.yaml\n``` \n\n### Train models on VQA (1.0 or 2.0) + VisualGenome\n\nSee options of [vqa2/mutan_att_trainval_vg](https://github.com/Cadene/vqa.pytorch/blob/master/options/vqa2/mutan_att_trainval_vg.yaml):\n\n```\npython train.py --path_opt options/vqa2/mutan_att_trainval_vg.yaml\n``` \n\n### Monitor training\n\nCreate a visualization of an experiment using `plotly` to monitor the training, just like the picture bellow (**click the image to access the html/js file**):\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://rawgit.com/Cadene/vqa.pytorch/master/doc/mutan_noatt.html\"\u003e\n        \u003cimg src=\"https://raw.githubusercontent.com/Cadene/vqa.pytorch/master/doc/mutan_noatt.png\" width=\"600\"/\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\nNote that you have to wait until the first open ended accuracy has finished processing and then the html file will be created and will pop out on your default browser. The html will be refreshed every 60 seconds. However, you will currently need to press F5 on your browser to see the change.\n\n```\npython visu.py --dir_logs logs/vqa/mutan_noatt\n```\n\nCreate a visualization of multiple experiments to compare them or monitor them like the picture bellow (**click the image to access the html/js file**):\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://rawgit.com/Cadene/vqa.pytorch/master/doc/mutan_noatt_vs_att.html\"\u003e\n        \u003cimg src=\"https://raw.githubusercontent.com/Cadene/vqa.pytorch/master/doc/mutan_noatt_vs_att.png\" width=\"600\"/\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\n```\npython visu.py --dir_logs logs/vqa/mutan_noatt,logs/vqa/mutan_att\n```\n\n\n\n### Restart training\n\nRestart the model from the last checkpoint.\n\n```\npython train.py --path_opt options/vqa/mutan_noatt.yaml --dir_logs logs/vqa/mutan_noatt --resume ckpt\n```\n\nRestart the model from the best checkpoint.\n\n```\npython train.py --path_opt options/vqa/mutan_noatt.yaml --dir_logs logs/vqa/mutan_noatt --resume best\n```\n\n### Evaluate models on VQA\n\nEvaluate the model from the best checkpoint. If your model has been trained on the training set only (`vqa_trainsplit=train`), the model will be evaluate on the valset and will run throw the testset. If it was trained on the trainset + valset (`vqa_trainsplit=trainval`), it will not be evaluate on the valset.\n\n```\npython train.py --vqa_trainsplit train --path_opt options/vqa/mutan_att.yaml --dir_logs logs/vqa/mutan_att --resume best -e\n```\n\n### Web demo\n\nYou must set your local ip address and port in `demo_server.py`  line 169 and your global ip address and port in `demo_web/js/custom.js` line 51.\nThe port associated to the global ip address must redirect to your local ip address.\n\nLaunch your API:\n```\nCUDA_VISIBLE_DEVICES=0 python demo_server.py\n```\n\nOpen `demo_web/index.html` on your browser to access the API with a human interface.\n\n## Citation\n\nPlease cite the arXiv paper if you use Mutan in your work:\n\n```\n@article{benyounescadene2017mutan,\n  author = {Hedi Ben-Younes and \n    R{\\'{e}}mi Cad{\\`{e}}ne and\n    Nicolas Thome and\n    Matthieu Cord},\n  title = {MUTAN: Multimodal Tucker Fusion for Visual Question Answering},\n  journal = {ICCV},\n  year = {2017},\n  url = {http://arxiv.org/abs/1705.06676}\n}\n```\n\n## Acknowledgment\n\nSpecial thanks to the authors of [MLB](https://arxiv.org/abs/1610.04325) for providing some [Torch7 code](https://github.com/jnhwkim/MulLowBiVQA), [MCB](https://arxiv.org/abs/1606.01847) for providing some [Caffe code](https://github.com/akirafukui/vqa-mcb), and our professors and friends from LIP6 for the perfect working atmosphere.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCadene%2Fvqa.pytorch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FCadene%2Fvqa.pytorch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCadene%2Fvqa.pytorch/lists"}