{"id":27483810,"url":"https://github.com/iwangjian/TopDial","last_synced_at":"2025-04-16T15:50:20.111Z","repository":{"id":199672397,"uuid":"661598053","full_name":"iwangjian/TopDial","owner":"iwangjian","description":"Code and data for \"Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation\" (EMNLP 2023)","archived":false,"fork":false,"pushed_at":"2024-04-22T21:02:22.000Z","size":1214,"stargazers_count":26,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-04-22T22:23:48.231Z","etag":null,"topics":["data-curation","dialogue-systems","personalization"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/iwangjian.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-07-03T08:26:04.000Z","updated_at":"2024-04-22T22:23:49.862Z","dependencies_parsed_at":"2024-01-11T14:09:25.114Z","dependency_job_id":"6776700f-5866-46e9-ad53-015c556794f2","html_url":"https://github.com/iwangjian/TopDial","commit_stats":null,"previous_names":["iwangjian/topdial"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iwangjian%2FTopDial","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iwangjian%2FTopDial/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iwangjian%2FTopDial/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iwangjian%2FTopDial/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/iwangjian","download_url":"https://codeload.github.com/iwangjian/TopDial/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249256991,"owners_count":21239099,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-curation","dialogue-systems","personalization"],"created_at":"2025-04-16T15:50:11.891Z","updated_at":"2025-04-16T15:50:20.104Z","avatar_url":"https://github.com/iwangjian.png","language":"Python","funding_links":[],"categories":["📚 Datasets and Evaluation"],"sub_categories":["4. Agentic RAG"],"readme":"# TopDial\nThis repository contains code and data for the paper [Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation](http://arxiv.org/abs/2310.07397) accepted by EMNLP 2023.\n\n## Overview\n\n\u003cp align=\"center\"\u003e\u003cimg width=\"75%\" src=\"./imgs/framework.png\" /\u003e\u003c/p\u003e\n\nTarget-oriented dialogue systems, designed to proactively steer conversations toward predefined targets or accomplish specific system-side goals, are an exciting area in conversational AI. In this work, by formulating a \u003cdialogue act, topic\u003e pair as the conversation target, we explore a novel problem of personalized target-oriented dialogue by considering personalization during the target accomplishment process. However, there remains an emergent need for high-quality datasets, and building one from scratch requires tremendous human effort. To address this, we propose an automatic dataset curation framework using a role-playing approach. Based on this framework, we construct a large-scale personalized target-oriented dialogue dataset, **TopDial**, which comprises about 18K multi-turn dialogues.\n\n\n## Dataset\nWe upload the curated **TopDial** dataset to the OneDrive cloud. Please download it from this OneDrive [link](https://connectpolyu-my.sharepoint.com/:u:/g/personal/21037774r_connect_polyu_hk/EftqMq3DT99PprYnTMA_NrUBN3BxoxY2-5CLTjkYJS9rmg?e=R73KO4).\n\n\n## Dataset Curation\n\n\n### Requirements\nWe use [Neo4j](https://neo4j.com/) as the graph database tool to process domain knowledge graph in the seed dataset. Please install it by following the [official guide](https://neo4j.com/docs/operations-manual/current/installation/). The required Python packages are listed in `requirements.txt`. Please install them by running:\n```bash\npip install -r requirements.txt\n```\n\n### Seed Dataset\nWe use the [re-purposed version](https://github.com/iwangjian/Color4Dial) of the DuRecDial 2.0 dataset as the seed dataset. For convenience of preprocessing, please download it from this OneDrive [link](https://connectpolyu-my.sharepoint.com/:u:/g/personal/21037774r_connect_polyu_hk/EfbBtbnDmfxMmSfkvVDQ810B_59L7UmdBeo-CMwuq89X6w?e=M8yocS).\n\n\n### Step 1: Preprocessing the seed dataset\n```python\npython data_preprocess.py --seed_dataset_dir ${seed_dataset_dir} --cache_dir ${cache_dir}\n```\nRunning this script will generate the following files in the specified cache dir:\n`cache_dialogue_{train|dev|test_seen|test_unseen}.jsonl`\n\n\n### Step 2: Dataset curation\n```python\n# set your OpenAI API key\nexport OPENAI_API_KEY=\"\"\n\npython -u dialog_simulation.py --cached_seed_path ${cached_seed_path} \\\n    --output_dir ${output_dir} \\\n    --max_interaction_step ${max_interaction_step}\n```\nRunning the above script will be like:\n\u003cp align=\"center\"\u003e\u003cimg width=\"100%\" src=\"./imgs/demo.gif\" /\u003e\u003c/p\u003e\n\nIf you hope NOT to show the instructions and the synthesized conversations in the console, please set `--show_description` and `--show_message` to `false`.\n\n\n## Acknowledgement\nOur code is partially based on the implementation of [ChatArena](https://github.com/Farama-Foundation/chatarena). We thank the authors for their excellent work.\n\n\n## Citation\nIf you use our data or code in your work, please kindly cite our work as:\n```bibtex\n@inproceedings{wang-etal-2023-target,\n    title = \"Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation\",\n    author = \"Wang, Jian  and\n      Cheng, Yi  and\n      Lin, Dongding  and\n      Leong, Chak Tou and\n      Li, Wenjie\",\n    booktitle = \"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)\",\n    month = dec,\n    year = \"2023\",\n    address = \"Singapore\",\n    publisher = \"Association for Computational Linguistics\",\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiwangjian%2FTopDial","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fiwangjian%2FTopDial","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiwangjian%2FTopDial/lists"}