{"id":13754070,"url":"https://github.com/salesforce/DialogStudio","last_synced_at":"2025-05-09T22:30:44.806Z","repository":{"id":177044916,"uuid":"649979991","full_name":"salesforce/DialogStudio","owner":"salesforce","description":"DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection and Instruction-Aware Models for Conversational AI","archived":false,"fork":false,"pushed_at":"2025-01-27T13:38:31.000Z","size":13660,"stargazers_count":495,"open_issues_count":0,"forks_count":34,"subscribers_count":12,"default_branch":"main","last_synced_at":"2025-04-08T00:37:32.254Z","etag":null,"topics":["conversational-ai","dataset","dialog","instruction-tuning","language-model","natural-language-generation","natural-language-understanding","open-domain-dialog","open-source","question-answering"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/salesforce.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-06-06T04:20:39.000Z","updated_at":"2025-04-04T02:18:54.000Z","dependencies_parsed_at":"2025-03-10T20:39:52.865Z","dependency_job_id":"d894c0e0-cd2a-4160-907f-035c5549ff2c","html_url":"https://github.com/salesforce/DialogStudio","commit_stats":null,"previous_names":["salesforce/dialogstudio"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2FDialogStudio","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2FDialogStudio/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2FDialogStudio/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2FDialogStudio/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/salesforce","download_url":"https://codeload.github.com/salesforce/DialogStudio/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253335220,"owners_count":21892629,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["conversational-ai","dataset","dialog","instruction-tuning","language-model","natural-language-generation","natural-language-understanding","open-domain-dialog","open-source","question-answering"],"created_at":"2024-08-03T09:01:38.447Z","updated_at":"2025-05-09T22:30:44.798Z","avatar_url":"https://github.com/salesforce.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话","Python"],"sub_categories":["大语言对话模型及数据"],"readme":"\u003cp align=\"center\"\u003e\n    \u003cbr\u003e\n    \u003cimg src=\"figures/logo.png\" width=\"510\"/\u003e\n    \u003cbr\u003e\n\u003c!-- \u003cp\u003e\n\u003cdiv align=\"center\"\u003e --\u003e\n\u003ca href=\"https://arxiv.org/pdf/2307.10172.pdf\" style=\"font-size:20px;\"\u003ePaper\u003c/a\u003e,\n\u003ca href=\"https://huggingface.co/datasets/Salesforce/dialogstudio\" style=\"font-size:20px;\"\u003eHuggingface\u003c/a\u003e,\n\u003ca href=\"#model\" style=\"font-size:20px;\"\u003eModel\u003c/a\u003e,\n\u003ca href=\"https://twitter.com/JianguoZhang3\" style=\"font-size:20px\"\u003eTwitter\u003c/a\u003e \n\u003c!-- \u003c/div\u003e --\u003e\n\u003cp\u003e\n\n# DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection and Instruction-Aware Models for Conversational AI\n\n\n## News!\n* 🎉 [AI Agent] **March 18, 2024: Update xLAM for AI Agent**. Check [xLAM](https://github.com/SalesforceAIResearch/xLAM) for the latest data and models relevant to AI Agent!\n* 🎉 [Dataset Viewer]. **March 17 2024: Update for dataset viewer issues on HuggingFace:**  Please refer to this repo for view of each dataset, where we provide 5 converted examples along with 5 original examples under each data folder. For example, [ShareGPT](https://github.com/salesforce/DialogStudio/tree/main/open-domain-dialogues/ShareGPT ) contains two files: [converted_examples.json](https://github.com/salesforce/DialogStudio/blob/main/open-domain-dialogues/ShareGPT/converted_example.json) and [original_example.json](https://github.com/salesforce/DialogStudio/blob/main/open-domain-dialogues/ShareGPT/original_example.json).\n* [Upload models] **Aug 18, 2023**. We upload version 1.0 models ([dialogstudio-t5-base-v1.0](https://huggingface.co/Salesforce/dialogstudio-t5-base-v1.0), [dialogstudio-t5-large-v1.0](https://huggingface.co/Salesforce/dialogstudio-t5-large-v1.0), [dialogstudio-t5-3b-v1.0](https://huggingface.co/Salesforce/dialogstudio-t5-3b-v1.0)) trained on a few selected DialogStudio datasets and more than 1000 general tasks.\n* [Version 1.0.1] **Aug 1, 2023**.  We resolved minor issues in a few dialogues, added prompts for selected knowledge-grounded datasets, removed requirements for HuggingFace login, and made updates to SODA and ShareGPT datasets.\n* [Initial Release] **July 2023**. We're thrilled to the initial release of the largest unified Dialog dataset collection. The full list of all available datasets is [here](./Dataset_Stats.csv).  \n\n\n## Contents\n\n- [Introduction](#introduction)\n- [Loading Data](#loading-data)\n- [Datasets](#datasets)\n- [Model](#model)\n- [License](#license)\n- [Citation](#citation)\n\n## Introduction\n\n\u003c!-- Check [DialogStudio_datasets.csv](https://docs.google.com/spreadsheets/d/10U9I4GoHFTYxl3OlzbbV0gmXerMT9Itn2MZs8t6AIK0/edit#gid=461625820) for all supported datasets. --\u003e\nDialogStudio is a large collection and unified dialog datasets. \nThe figure below provides a summary of the general statistics associated with DialogStudio. DialogStudio unified each dataset while preserving its original information, and this aids in supporting research on both individual datasets and Large Language Model (LLM) training. The full list of all available datasets is [here](./Dataset_Stats.csv).\n\nThe data are downloadable through Huggingface as introduced in [Loading Data](#loading-data). We also provide examples for each dataset in this repo. For more granular and category-specific details, please refer to the individual folders corresponding to each category within the DialogStudio collection, e.g. [MULTIWOZ2_2](./task-oriented-dialogues/MULTIWOZ2_2/) dataset under the [task-oriented-dialogues](./task-oriented-dialogues/) category. \n\u003cp align=\"center\"\u003e\n    \u003cbr\u003e\n    \u003cimg src=\"figures/DialogStudio_Stats.png\" width=\"730\"/\u003e\n    \u003cbr\u003e\n\u003cp\u003e\n\nDialogStudio evaluates dialogue quality based on six critical criteria, namely Understanding, Relevance, Correctness, Coherence, Completeness, and Overall Quality. Each criterion is scored on a scale of 1 to 5, with the highest scores reserved for exceptional dialogues.\n\nGiven the vast number of datasets incorporated into DialogStudio, we utilized 'gpt-3.5-turbo' to assess 33 distinct datasets. The corresponding script used for this evaluation can be accessed through the [link](https://github.com/salesforce/DialogStudio/blob/main/code/openai_dialog_quality_evaluation.py). \n\nThe results of our dialogue quality assessment are presented below. We intend to release evaluation scores for individually selected dialogues in the upcoming period.\n\u003cp align=\"center\"\u003e\n    \u003cbr\u003e\n    \u003cimg src=\"figures/DialogStudio_Quality_Scores.png\" width=\"700\"/\u003e\n    \u003cbr\u003e\n\u003cp\u003e\n\n\n## Loading Data\n\nYou can load any dataset in the DialogStudio from the [HuggingFace hub](https://huggingface.co/datasets/Salesforce/dialogstudio) by claiming the `{dataset_name}`, which is exactly the dataset folder name. All available datasets are described in [dataset content](./Dataset_Stats.csv).\n\nBelow is one example to load the [MULTIWOZ2_2](./task-oriented-dialogues/MULTIWOZ2_2/) dataset under the [task-oriented-dialogues](./task-oriented-dialogues/) category:\n\n\u003c!-- Agree Licenses on the [HuggingFace hub](https://huggingface.co/datasets/Salesforce/dialogstudio). Ensure you're also logged into your HuggingFace account on local. If you haven't logged in yet, you can do so by running the following command in your terminal:\n```python\nhuggingface-cli login\n``` --\u003e\n\nLoad the dataset\n```python\nfrom datasets import load_dataset\n\ndataset = load_dataset('Salesforce/dialogstudio', 'MULTIWOZ2_2')\n```\nHere is the output structure of MultiWOZ 2.2\n```python\nDatasetDict({\n    train: Dataset({\n        features: ['original dialog id', 'new dialog id', 'dialog index', 'original dialog info', 'log', 'prompt', 'external knowledge non-flat', 'external knowledge', 'dst knowledge', 'intent knowledge'],\n        num_rows: 8437\n    })\n    validation: Dataset({\n        features: ['original dialog id', 'new dialog id', 'dialog index', 'original dialog info', 'log', 'prompt', 'external knowledge non-flat', 'external knowledge', 'dst knowledge', 'intent knowledge'],\n        num_rows: 1000\n    })\n    test: Dataset({\n        features: ['original dialog id', 'new dialog id', 'dialog index', 'original dialog info', 'log', 'prompt', 'external knowledge non-flat', 'external knowledge', 'dst knowledge', 'intent knowledge'],\n        num_rows: 1000\n    })\n})\n```\n \n\n## Datasets\n\nThe datasets are split into several categories in this GitHub repository and [HuggingFace hub](https://huggingface.co/datasets/Salesforce/dialogstudio). You can check the [table of dataset](./Dataset_Stats.csv) for more information. And you can click into each folder to check a few examples:\n\n- [Knowledge-Grounded-Dialogues](./knowledge-grounded-dialogues/)\n- [Natural-Language-Understanding](./natural-language-understanding/)\n- [Open-Domain-Dialogues](./open-domain-dialogues/)\n- [Task-Oriented-Dialogues](./task-oriented-dialogues/)\n- [Dialogue-Summarization](./dialogue-summarization/)\n- [Conversational-Recommendation-Dialogs](./conversational-recommendation-dialogues/)\n\n\u003c!-- ```\nDatasets/\n├── Knowledge-Grounded-Dialogues\n├── Natural-Language-Understanding\n├── Open-Domain-Dialogues\n├── Task-Oriented-Dialogues\n├── Dialogue-Summarization\n├── Conversational-Recommendation-Dialogs\n``` --\u003e\n\n\n## Model\n\nWe've rolled out version 1.0 of models ([dialogstudio-t5-base-v1.0](https://huggingface.co/Salesforce/dialogstudio-t5-base-v1.0), [dialogstudio-t5-large-v1.0](https://huggingface.co/Salesforce/dialogstudio-t5-large-v1.0), [dialogstudio-t5-3b-v1.0](https://huggingface.co/Salesforce/dialogstudio-t5-3b-v1.0)) trained on a few selected DialogStudio datasets. Check each [Model Card](https://huggingface.co/Salesforce/dialogstudio-t5-base-v1.0) for more details. \n\nBelow is one example for running model on CPU:\n\n```python\nfrom transformers import AutoTokenizer, AutoModelForSeq2SeqLM\n\ntokenizer = AutoTokenizer.from_pretrained(\"Salesforce/dialogstudio-t5-base-v1.0\")\nmodel = AutoModelForSeq2SeqLM.from_pretrained(\"Salesforce/dialogstudio-t5-base-v1.0\")\n\ninput_text = \"Answer the following yes/no question by reasoning step-by-step. Can you write 200 words in a single tweet?\"\ninput_ids = tokenizer(input_text, return_tensors=\"pt\").input_ids\n\noutputs = model.generate(input_ids, max_new_tokens=256)\nprint(tokenizer.decode(outputs[0], skip_special_tokens=True))\n```\n\n## License\n\nOur project follows the following structure with respect to licensing:\n\n1. For all the modified datasets in DialogStudio: \n   - A portion of these datasets is under the [Apache License 2.0](LICENSE.txt).\n   - Some retain their original licenses even after modification.\n   - For a few datasets that lacked a license, we have cited the relevant papers.\n2. Original dataset licenses: For reference, we also put the originally available licenses for each dataset into their respective dataset folders.\n3. Code: Our codebase is under the [Apache License 2.0](LICENSE.txt).\n\nFor detailed licensing information, please refer to the specific licenses accompanying the original datasets. It is important to familiarize yourself with these terms as we do not assume responsibility for licensing issues.\n\n## Acknowledgement\nWe sincerely thank all dataset authors who have contributed to the Conversational AI field. Despite careful efforts, inaccuracies in our citations or references may occur. If you spot any errors or omissions, please raise an issue or submit a pull request to help us improve. Thank you!\n\n## Citation\n\nThe data and code in this repository is mostly developed for or derived from the paper below. If you utilize datasets from DialogStudio, we kindly request you cite both the original work and our own work (Accepted by EACL 2024 Findings as a long paper).\n\n```\n@article{zhang2023dialogstudio,\n  title={DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI},\n  author={Zhang, Jianguo and Qian, Kun and Liu, Zhiwei and Heinecke, Shelby and Meng, Rui and Liu, Ye and Yu, Zhou and Savarese, Silvio and Xiong, Caiming},\n  journal={arXiv preprint arXiv:2307.10172},\n  year={2023}\n}\n```\n\n## Contribution\n\nWe enthusiastically invite contributions from the community! Join us in our shared mission to propel the field of conversational AI forward!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsalesforce%2FDialogStudio","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsalesforce%2FDialogStudio","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsalesforce%2FDialogStudio/lists"}