{"id":13936186,"url":"https://github.com/batra-mlp-lab/visdial-rl","last_synced_at":"2026-02-22T17:38:42.672Z","repository":{"id":84481443,"uuid":"137548743","full_name":"batra-mlp-lab/visdial-rl","owner":"batra-mlp-lab","description":"PyTorch code for Learning Cooperative Visual Dialog Agents using Deep Reinforcement Learning","archived":false,"fork":false,"pushed_at":"2018-10-10T21:28:04.000Z","size":1409,"stargazers_count":169,"open_issues_count":7,"forks_count":39,"subscribers_count":12,"default_branch":"master","last_synced_at":"2025-03-16T16:51:41.358Z","etag":null,"topics":["computer-vision","deep-learning","natural-language-processing","pytorch","visual-dialog"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/batra-mlp-lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-06-16T02:59:15.000Z","updated_at":"2025-01-19T12:54:48.000Z","dependencies_parsed_at":"2023-03-12T23:07:22.882Z","dependency_job_id":null,"html_url":"https://github.com/batra-mlp-lab/visdial-rl","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/batra-mlp-lab/visdial-rl","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/batra-mlp-lab%2Fvisdial-rl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/batra-mlp-lab%2Fvisdial-rl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/batra-mlp-lab%2Fvisdial-rl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/batra-mlp-lab%2Fvisdial-rl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/batra-mlp-lab","download_url":"https://codeload.github.com/batra-mlp-lab/visdial-rl/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/batra-mlp-lab%2Fvisdial-rl/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266019657,"owners_count":23864916,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","deep-learning","natural-language-processing","pytorch","visual-dialog"],"created_at":"2024-08-07T23:02:27.044Z","updated_at":"2026-02-22T17:38:42.625Z","avatar_url":"https://github.com/batra-mlp-lab.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Visdial-RL-PyTorch\n\nPyTorch implementation of the paper:\n\n**[Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning](https://arxiv.org/abs/1703.06585)**  \nAbhishek Das*, Satwik Kottur*, José Moura, Stefan Lee, and Dhruv Batra  \n\u003chttps://arxiv.org/abs/1703.06585\u003e  \nICCV 2017 (Oral)  \n\nVisual Dialog requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual\ncontent. Given an image, dialog history, and a follow-up question about the image, the AI agent has to answer the question.\n\nThis repository contains code for training the **questioner** and **answerer** bots described in the paper, in both **supervised** fashion and via **deep reinforcement learning** on the Visdial 0.5 dataset for the cooperative visual dialog task of _GuessWhich_.   \n\n![models](images/model_figure.jpg)\n\nTable of Contents\n=================\n\n   * [Setup and Dependencies](#setup-and-dependencies)\n   * [Usage](#usage)\n      * [Preprocessing VisDial](#preprocessing-visdial)\n      * [Extracting image features](#extracting-image-features)\n      * [Download preprocessed data](#download-preprocessed-data)\n      * [Pre-trained checkpoints](#pre-trained-checkpoints)\n      * [Demo](#demo)\n      * [Training](#training)\n      * [Logging](#logging)\n      * [Evaluation](#evaluation)\n      * [Benchmarks](#benchmarks)\n      * [Visualizing Results](#visualizing-results)\n   * [Reference](#reference)\n   * [License](#license)\n\n## Setup and Dependencies\n\nOur code is implemented in PyTorch (v0.3.1). To setup, do the following:\n\n1. Install [Python 3.6](https://www.python.org/downloads/release/python-365/)\n2. Install [PyTorch](https://pytorch.org/) v0.3.1, preferably with CUDA – running with GPU acceleration is highly recommended for this code. Note that PyTorch 0.4 is not supported.\n3. If you would also like to extract your own image features, install [Torch](http://torch.ch/), [torch-hdf5](https://github.com/deepmind/torch-hdf5), [torch/image/](https://github.com/torch/image), [torch/loadcaffe/](https://github.com/szagoruyko/loadcaffe), and optionally [torch/cutorch/](https://github.com/torch/cutorch), [torch/cudnn/](https://github.com/soumith/cudnn.torch), and [torch/cunn/](https://github.com/torch/cunn) for GPU acceleration. Alternatively, you could directly use the precomputed features provided below.\n4. Get the source:\n```\ngit clone https://github.com/batra-mlp-lab/visdial-rl.git visdial-pytorch\n```\n5. Install requirements into the `visdial-rl-pytorch` virtual environment, using [Anaconda](https://anaconda.org/anaconda/python):\n```\nconda env create -f env.yml\n```\n\n## Usage\n\nPreprocess data using the following scripts OR directly download preprocessed data [below](#download-preprocessed-data).\n\n### Preprocessing VisDial\n\nDownload and preprocess VisDial as described in the [visdial](https://github.com/batra-mlp-lab/visdial.git) repo.\nNote: This requires [Torch](http://torch.ch/) to run. Scroll down further if you would like to directly use precomputed features.\n\n```\ncd data/\npython prepro.py -version 0.5 -download 1\n\n# To process VisDial v0.9, run:\n# python prepro.py -version 0.9 -download 1 -input_json_train visdial_0.9_train.json \\\n#                                           -input_json_val visdial_0.9_val.json\n\ncd ..\n\n```\n\nThis will generate the files `data/visdial/chat_processed_data.h5` (containing tokenized captions, questions, answers, and image indices), and `data/visdial/chat_processed_params.json` (containing vocabulary mappings and COCO image ID's).\n\n### Extracting image features\n\nTo extract image features using VGG-19, run the following:\n\n```\nsh data/download_model.sh vgg 19\ncd data\n\nth prepro_img_vgg19.lua -imageRoot /path/to/coco/images -gpuid 0\n\n```\nSimilary, to extract features using [ResNet](https://github.com/facebook/fb.resnet.torch/tree/master/pretrained), run:\n\n```\nsh data/download_model.sh resnet 200\ncd data\nth prepro_img_resnet.lua -imageRoot /path/to/coco/images -cnnModel /path/to/t7/model -gpuid 0\n```\n\nRunning either of the above will generate `data/visdial/data_img.h5` containing features for COCO `train` and `val` splits.\n\n### Download preprocessed data\n\nDownload preprocessed dataset and extracted features:\n\n```\nsh scripts/download_preprocessed.sh\n```\n\n### Pre-trained checkpoints\n\nDownload pre-trained checkpoints:\n\n```\nsh scripts/download_checkpoints.sh\n```\n\n### Demo\n\nA demo of inference for question and answer generation is available in [inference.ipynb](inference.ipynb).\n\n### Training\n\nThe model definitions supported for training models are included in the `models/` folder -- we presently support the `hre-ques-lateim-hist` encoder with `generative` decoding.\n\nThe arguments to `train.py` are listed in `options.py`. There are three training modes available - `sl-abot`, `sl-qbot` for supervised learning (pre-training) of A-Bot and Q-Bot respectively, and `rl-full-QAf` for RL fine-tuning of both A-Bot and Q-Bot beginning from specified SL pre-trained checkpoints. `full-QAf` denotes that all components of each agent (dialog component and/or image prediction component) are fine-tuned.\n\nFor supervised pre-training:\n\n```\npython train.py -useGPU -trainMode sl-abot\n```\n\nFor RL fine-tuning:\n\n```\npython train.py -useGPU \\\n   -trainMode rl-full-QAf \\\n   -startFrom checkpoints/abot_sl_ep60.vd \\\n   -qstartFrom checkpoints/qbot_sl_ep60.vd\n```\n\n### Logging\n\nThe code supports logging several metrics via [visdom](https://github.com/facebookresearch/visdom). These include train and val loss curves, VisDial metrics (mean rank, reciprocal mean rank, recall@1/5/10 for the answerer, and percentile mean rank for the questioner). To enable visdom logging, use the `enableVisdom` option along with other visdom server settings in `options.py`. A standalone visdom server can be started using:\n\n```\npython -m visdom.server -p \u003cport\u003e\n```\n\nNow you can navigate to `localhost:\u003cport\u003e` on your local machine and select the appropriate environment from the drop down to visualize the plots. For example, if I want to start a `sl-abot` job which logs plots by connecting to the above visdom server (hosted at `localhost:\u003cport\u003e`), the following command may be used to create an environment titled `my-abot-job`:\n\n```\npython train.py -useGPU \\\n    -trainMode sl-abot \\\n    -enableVisdom 1 \\\n    -visdomServer http://127.0.0.1 \\\n    -visdomServerPort \u003cport\u003e \\\n    -visdomEnv my-abot-job\n```\n\n\n### Evaluation\n\nThe three types of evaluation - (1) ranking A-Bot's answers, (2) ranking Q-Bot's image predictions and (3) ranking Q-Bot's predictions when interacting with an A-Bot, are arguments `QBotRank`, `ABotRank` and `QABotsRank` respectively to `evalMode`. Any subset of them can be given as a list to `evalMode`. All evaluation outputs are displayed on visdom, so an active visdom server is required to run evaluation.\n\nFor evaluation of Q-Bot on image guessing and A-Bot on answer ranking on human-human dialog (ground truth captions, questions and answers), the following command can be used:\n\n```\npython evaluate.py -useGPU \\\n    -startFrom checkpoints/abot_sl_ep60.vd \\\n    -qstartFrom checkpoints/qbot_sl_ep60.vd \\\n    -enableVisdom 1 \\\n    -visdomServer http://127.0.0.1 \\\n    -visdomServerPort \u003cport\u003e \\\n    -visdomEnv my-eval-job \\\n    -evalMode ABotRank QBotRank\n```\n\nFor evaluation of Q-Bot on image guessing when interacting with an A-Bot, the following command can be used. Since no human-human dialog (ground truth) is shown to the agents at this stage, ground truth captions are not used. Instead, captions need to be read from `chat_processed_data_gencaps.h5`, which contains preprocessed captions generated from [neuraltalk2](https://github.com/karpathy/neuraltalk2). This file provides the VisDial 0.5 test split where original ground truth captions are replaced by generated captions.\n\n```\npython evaluate.py -useGPU \\\n    -inputQues data/visdial/chat_processed_data_gencaps.h5 \\\n    -startFrom checkpoints/abot_sl_ep60.vd \\\n    -qstartFrom checkpoints/qbot_sl_ep60.vd \\\n    -enableVisdom 1 \\\n    -visdomServer http://127.0.0.1 \\\n    -visdomServerPort \u003cport\u003e \\\n    -visdomEnv my-eval-job \\\n    -evalMode QABotsRank\n```\n\n\n### Benchmarks\n\nHere are some benchmarked results for both Agents on the VisDial 0.5 test split.\n\n**Questioner**\n\nThe plots below show percentile mean rank (PMR) numbers obtained on evaluating the questioner for the SL-pretrained and RL-full-QAf settings, when evaluated on generated dialog (with the two agents interacting with each other along with being provided a generated caption instead of ground truth) based image retrieval.\n\nWe have also experimented with other hyperparameter settings and found that scaling the cross entropy loss lead to a significant improvement in PMR. Namely, setting `CELossCoeff` to `1` and `lrDecayRate` to `0.999962372474343` lead to the PMR values shown on the right. The corresponding pre-trained checkpoints are available for download and are denoted by a `_delta` suffix.\n\nNote that RL fine tuning begins with annealing i.e. the RL objective is gradually eased in from the last round (round 10) to the first round of dialog. Every epoch after the first one begins be decreasing the number of rounds for which supervised pre-training is used. The following plots show the RL-Full-QAf model results at epoch 10 (when annealing ends) as well as epoch 20.\n\n\u003cdiv align=center\u003e\n  \u003cimg src='images/qbot-match-gen.png' alt='qbot-generated' height='250px' /\u003e\n  \u003cimg src='images/qbot-ours-gen.png' alt='qbot-delta-generated' height='250px' /\u003e\n\u003c/div\u003e\n\n**Answerer**\n\nThe table below shows evaluation performance of the trained answerer on the VisDial answering metrics. These metrics measure the answer retrieval performance of the A-Bot given image, human-human (ground truth) dialog and ground truth caption as input. Note that the epoch number is consistent across A-Bot and Q-Bot. The SL-pretraining epoch denotes the checkpoint from which the corresponding RL-finetuning was started.\n\n|  Checkpoint | Epoch |   MR  |  MRR  |   R1  |   R5  |  R10  |\n|:-----------:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|\n| SL-Pretrain |   60  | 21.94 | 0.432 | 33.21 | 52.67 | 59.23 |\n| RL-Full_QAf |   10  | 21.63 | 0.434 | 33.29 | 53.10 | 59.70 |\n| RL-Full-QAf |   20  | 21.58 | 0.433 | 33.22 | 53.09 | 59.64 |\n\nSimilar as above, we find better performance for the Δ (Delta) hyperparameter setting (which downscales scales the cross entropy loss).\n\n|     Checkpoint    | Epoch |   MR  |  MRR  |   R1  |   R5  |  R10  |\n|:-----------------:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|\n| SL-Pretrain-Delta |   15  | 21.02 | 0.434 | 33.02 | 53.51 | 60.45 |\n| RL-Full_QAf-Delta |   10  | 21.61 | 0.410 | 30.38 | 51.61 | 59.30 |\n| RL-Full-QAf-Delta |   20  | 22.80 | 0.377 | 26.64 | 49.48 | 57.46 |\n\n### Visualizing Results\n\nTo generate dialog for visualization, run `evaluate.py` with `evalMode` set to `dialog`.\n\n```\npython evaluate.py -useGPU \\\n    -startFrom checkpoints/abot_rl_ep20.vd \\\n    -qstartFrom checkpoints/qbot_rl_ep20.vd \\\n    -inputQues data/visdial/chat_processed_data_gencaps.h5 \\\n    -evalMode dialog \\\n    -cocoDir /path/to/coco/images/ \\\n    -cocoInfo /path/to/coco.json \\\n    -beamSize 5\n```\n\nThis generates a json file `dialog_output/results/results.json`. Now to visualize the generated dialog, run:\n```\ncd dialog_output/\npython -m http.server 8000\n```\n\nNavigate to `localhost:8000` in your browser to see the results! The page should look as follows.\n\n![visualization](images/visualization.jpg)\n\n## Reference\n\nIf you use this code as part of any published research,  please cite this repo as well as Das and Kottur et. al., Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning.\n\n```\n@misc{modhe2018visdialrlpytorch\n   author = {Modhe, Nirbhay and Prabhu, Viraj and Cogswell, Michael and Kottur, Satwik and Das, Abhishek and Lee, Stefan and Parikh, Devi and Batra, Dhruv },\n   title = {VisDial-RL-PyTorch},\n   year = {2018},\n   publisher = {GitHub}.\n   journal = {GitHub repository},\n   howpublished = {\\url{https://github.com/batra-mlp-lab/visdial-rl.git}}\n}\n\n@inproceedings{das2017visdialrl,\n  title={Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning},\n  author={Abhishek Das and Satwik Kottur and Jos\\'e M.F. Moura and\n    Stefan Lee and Dhruv Batra},\n  booktitle={Proceedings of the IEEE International Conference on Computer Vision (ICCV)},\n  year={2017}\n}\n```\n\n## Acknowledgements\n\nWe would like to thank [Ayush Shrivastava](https://github.com/ayshrv) for his help with testing this codebase.\n\n## License\n\nBSD\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbatra-mlp-lab%2Fvisdial-rl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbatra-mlp-lab%2Fvisdial-rl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbatra-mlp-lab%2Fvisdial-rl/lists"}