{"id":13958459,"url":"https://github.com/ShannonAI/OpenViDial","last_synced_at":"2025-07-21T00:30:49.892Z","repository":{"id":43326488,"uuid":"325251799","full_name":"ShannonAI/OpenViDial","owner":"ShannonAI","description":"Code, Models and Datasets for OpenViDial Dataset","archived":false,"fork":false,"pushed_at":"2022-01-22T08:02:48.000Z","size":4041,"stargazers_count":131,"open_issues_count":9,"forks_count":23,"subscribers_count":8,"default_branch":"main","last_synced_at":"2024-11-28T02:34:46.785Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ShannonAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-12-29T10:07:12.000Z","updated_at":"2024-08-12T07:16:50.000Z","dependencies_parsed_at":"2022-09-13T12:41:02.163Z","dependency_job_id":null,"html_url":"https://github.com/ShannonAI/OpenViDial","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ShannonAI/OpenViDial","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ShannonAI%2FOpenViDial","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ShannonAI%2FOpenViDial/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ShannonAI%2FOpenViDial/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ShannonAI%2FOpenViDial/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ShannonAI","download_url":"https://codeload.github.com/ShannonAI/OpenViDial/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ShannonAI%2FOpenViDial/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266221246,"owners_count":23894964,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-08T13:01:36.550Z","updated_at":"2025-07-21T00:30:49.535Z","avatar_url":"https://github.com/ShannonAI.png","language":"Python","funding_links":[],"categories":["其他_机器视觉","Datasets"],"sub_categories":["网络服务_其他"],"readme":"# OpenViDial\nThis repo contains downloading instructions for the two **OpenViDial** datasets in:\n* OpenViDial 1.0: **[OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset  with Visual Contexts](https://arxiv.org/pdf/2012.15015.pdf)**\n* OpenViDial 2.0: **[OpenViDial 2.0: A Larger-Scale, Open-Domain Dialogue Generation Dataset with Visual Contexts](https://arxiv.org/pdf/2109.12761.pdf)**\n\nand the code to reproduce results based on the two **OpenViDial** datasets in the paper **[Modeling Text-visual Mutual Dependency for Multi-modal dialog Generation](https://arxiv.org/pdf/2105.14445.pdf)**\n\n## Introduction\nWhen humans converse, what a speaker will\nsay next significantly depends on what he sees. OpenViDial is a largescale\nmulti-module dialogue dataset for this purpose. Thes dialogue\nturns and visual contexts are extracted\nfrom movies and TV series, where each dialogue\nturn is paired with the corresponding\nvisual context in which it takes place. Up to **2022.01.22** OpenViDial has two verseion: **OpenViDial 1.0** and **OpenViDial 2.0**. For **OpenViDial 1.0**, it contains a total number of 1.1 million\ndialogue turns, and thus 1.1 million visual contexts\nstored in images. For **OpenViDial 2.0**, it is much larger than the previous version OpenViDial 1.0 containing a total number of 5.6 million\ndialogue turns along with 5.6 million visual contexts\nstored in images.\n\nThe following are  two short conversations where visual contexts are crucial.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"demo_data/dataset.png\"/\u003e\n\u003c/div\u003e\n\n## Detailed and Downloading Instructions\nFor the detailed and downloading instructions for the two **OpenViDial** datasets (OpenViDial 1.0, OpenViDial 2.0) can be found [here](datasets/README.md)\n\n#### Noted\nIf you'd like to take a glance at the a sample of the dataset instead of downloading the full dataset, we provide a data sample [here](https://drive.google.com/drive/folders/17XjJ612wMolkrU-ESW5yv6MnbaclrzoM?usp=sharing). The small size data are sampled from OpenViDial 1.0 dataset, and can be used for debug or any other operations.\n\n## Vanilla Visual Dialog Models\nWe proposed three models for this dataset. Please refer to the paper for details:\n* **Model #1 - NoVisual**: use only dialog texts without visual information\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"demo_data/model1.png\"/\u003e\n\u003c/div\u003e\n\n* **Model #2 - CoarseVisual**: use texts and a pretrained ResNet50 on ImageNet to compute 1000-d feature from each picture\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"demo_data/model2.png\"/\u003e\n\u003c/div\u003e\n\n* **Model #3 - FineVisual**: use texts and a pretrained Faster R-CNN on Genome to compute 2048-d * K objects features from each picture\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"demo_data/model3.png\"/\u003e\n\u003c/div\u003e\n\n### Requirements\n* python \u003e= 3.6\n* `pip install -r requirements.txt`\n\n### Preprocess directory structure\n`preprocessed_data_dir` is a directory that contains all the preprocessed files (text, image feature mmap, offsets, etc.)\ngenerated from [origin_data_dir](#detailed-and-downloading-instructions) and we use them in training models. \nThe directory structure is shown below.\n\n**Note: every `train*` file or directory should have a 'valid' and a 'test' counterpart, we ignore them below for simplicity.**\n```\n├──preprocessed_data_dir\n      └── train.features.mmap  // numpy mmap array file of shape [num_sents, 1000], each row is a 1000-d ResNet-50 feature\n      └── train.objects.mmap  // numpy mmap array file of shape [num_sents, 20, 2048],  faster-rcnn object feature file, each row contain 20 objects feature, which is 2048-d\n      └── train.objects_mask.mmap  // numpy mmap array file of shape [num_sents, 20],  faster-rcnn mask file, each row contain 20 objects mask, 1 for valid, 0 for mask\n      └── train.offsets.npy  // numpy array file of shape [num_episodes], each item is the offsets of one episode\n      └── train.sent_num.npy // numpy array file of shape [num_episodes], each item is the sentence number of one episode\n```\n\n### Preprocess text data\nWe use Moses Tokenizer to tokenize texts and generate offsets arrays:\n`bash ./scripts/preprocess_video_data.sh`\nand followed with byte-pair-encoding and fairseq-preprocess binarization:\n`bash ./scripts/preprocess_text_data.sh`\n\n**Note: You need to change `DATA_DIR, ORIGIN_DIR, OUTPUT_DIR` to your own path.**\n\n#### Download the pre-computed CNN features and Faster-RCNN features\nCNN-pooling features is used for Model #2 - CoarseVisual and Faster R-CNN features is used for Model #3 - FineVisual. You can directly download the pre-computed files for CNN and Faster-RCNN features [here](./datasets/README.md) for either OpenViDial 1.0 dataset or OpenViDial 2.0 dataset.\n\n#### (Optional) Extract features on your own\nIf you want to extract some feature on your own, or you'd like to know details of extracting visual features, \nsee [video_dialogue_model/extract_features/extract_features.md](video_dialogue_model/extract_features/extract_features.md)\n\n**Note: Extracting features will take you too much time.**\n\n### Train and Evaluate Model #1 - NoVisual\n`bash scripts/reproduce_baselines/text_only.sh` will train and evaluate NoVisual, \nRemember to change `MODEL_DIR` and `DATA_DIR` for your setup. \n\n**Note:** `fairseq` may use all gpus on your machine and the actual batch size is times by number of gpus.\nTherefore, if you use multiple gpus, batch size should be devided by number of gpus.\n\n### Train and Evaluate Model #2 - CoarseVisual\n`bash scripts/reproduce_baselines/text_and_img_feature.sh` will train and evaluate CoarseVisual.\nRemember to change `MODEL_DIR` and `DATA_DIR` for your setup. Please make sure you use one single gpu to reproduce our results.\n\n### Train and Evaluate Model #3 - FineVisual\n`bash scripts/reproduce_baselines/text_and_img_objects.sh` will train and evaluate FineVisual, \nRemember to change `MODEL_DIR` and `DATA_DIR` for your setup. Please make sure you use one single gpu to reproduce our results.\n\n## MMI\n### Prepare training data\nFor NV seeing [./mmi/text/README.md](./mmi/text/README.md). The structure of training data used in both CV and FV is same as the former part.\n\n### Train and Evaluate Model #4 - MI-NV\n`bash ./mmi/text/train.sh \u0026\u0026 bash ./mmi/text/mmi_generate.sh` will train and evaluate MI-NV. Remember to change all the `MODEL_DIR` and `DATA_DIR` for your setup. Please make sure you use one signle gpu to reproduce our results.\n\n### Train and Evaluate Model #5 - MI-CV\n`bash ./mmi/feature/scrtpts/train_image.sh \u0026\u0026 bash ./mmi/feature/scrtpts/mmi_feature_generate.sh` will train and evaluate MI-CV. Remember to change all the `MODEL_DIR` and `DATA_DIR` for your setup. Please make sure you use one signle gpu to reproduce our results.\n\n### Train and Evaluate Model #6 - MI-NV\n`bash ./mmi/feature/scrtpts/train_object.sh \u0026\u0026 bash ./mmi/feature/scrtpts/mmi_object_generate.sh` will train and evaluate MI-FV. Remember to change all the `MODEL_DIR` and `DATA_DIR` for your setup. Please make sure you use one signle gpu to reproduce our results.\n\n### Other Statistics\n* get diversity statistics of system output: `train/stats.py`\n* get rouge statistics of system output: `train/rouge.py`\n\n### Model benchmark\n#### 1. On OpenViDial 1.0 Dataset\n| Model | BLEU-1 | BLEU-2 | BLEU-4 | Dis-1 | Dis-2 | Dis-3 | Dis-4 | ROUGE-1 | ROUGE-2 | ROUGE-4 |\n| - | - | - | - | - | - | - | - | - | - | - |\n| 1-NV | 14.06 | 3.80 | 0.95 | 0.0006 | 0.0019 | 0.0031 | 0.0043 | 0.06787 | 0.01464 | 0.00224 |\n| 2-CV | 14.70 | 4.38 | 1.14 | 0.0023 | 0.0090 | 0.0178 | 0.0272 | 0.08773 | 0.02067 | 0.00347 |\n| 3-FV | 14.85 | 4.61 | 1.19 | 0.0026 | 0.0112 | 0.0246 | 0.0406 | 0.09083 | 0.02085 | 0.00329 |\n| 4-MI-NV | 14.27 | 3.89 | 0.99 | 0.0006 | 0.0022 | 0.0036 | 0.0043 | 0.06918 | 0.01497 | 0.00238 |\n| 5-MI-CV | 14.77 | 4.46 | 1.16 | 0.0023 | 0.0091 | 0.0181 | 0.0272 | 0.08791 | 0.02077 | 0.00350 |\n| 6-MI-FV | 14.95 | 4.67 | 1.22 | 0.0027 | 0.0117 | 0.0261 | 0.0433 | 0.09100 | 0.02090 | 0.00338 |\n\n#### 2. On OpenViDial 2.0 Dataset\n| Model | BLEU-4 | Dis-1 | Dis-2 | Dis-3 | Dis-4 |\n| - | - | - | - | - | - |\n| 1-NV | 1.95 | 0.0037 | 0.0302 | 0.0929 | 0.1711 |\n| 2-CV | 1.97 | 0.0041 | 0.0353 | 0.0999 | 0.1726 |\n| 3-FV | 1.99 | 0.0056 | 0.0431 | 0.1250 | 0.2215 |\n| 4-MI-NV | 1.96 | 0.0039 | 0.0311 | 0.0953 | 0.1630 |\n| 5-MI-CV | 1.98 | 0.0047 | 0.0392 | 0.1093 | 0.1774 |\n| 6-MI-FV | 2.00 | 0.0060 | 0.0460 | 0.1321 | 0.2311 |\n\n#### Noted\nThe size of OpenViDial 2.0 dataset is too much larger than that of OpenViDial 1.0 dataset. To make the results reproducibility we didn't use the all features for CoarseVisual and FineVisual model (only 5% in this experiments), since the full features will occupy too much memory and may not avaliable for most researchers.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FShannonAI%2FOpenViDial","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FShannonAI%2FOpenViDial","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FShannonAI%2FOpenViDial/lists"}