https://github.com/iwangjian/TopDial
Code and data for "Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation" (EMNLP 2023)
https://github.com/iwangjian/TopDial
data-curation dialogue-systems personalization
Last synced: 8 months ago
JSON representation
Code and data for "Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation" (EMNLP 2023)
- Host: GitHub
- URL: https://github.com/iwangjian/TopDial
- Owner: iwangjian
- License: apache-2.0
- Created: 2023-07-03T08:26:04.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-04-22T21:02:22.000Z (over 1 year ago)
- Last Synced: 2024-04-22T22:23:48.231Z (over 1 year ago)
- Topics: data-curation, dialogue-systems, personalization
- Language: Python
- Homepage:
- Size: 1.16 MB
- Stars: 26
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# TopDial
This repository contains code and data for the paper [Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation](http://arxiv.org/abs/2310.07397) accepted by EMNLP 2023.
## Overview

Target-oriented dialogue systems, designed to proactively steer conversations toward predefined targets or accomplish specific system-side goals, are an exciting area in conversational AI. In this work, by formulating a pair as the conversation target, we explore a novel problem of personalized target-oriented dialogue by considering personalization during the target accomplishment process. However, there remains an emergent need for high-quality datasets, and building one from scratch requires tremendous human effort. To address this, we propose an automatic dataset curation framework using a role-playing approach. Based on this framework, we construct a large-scale personalized target-oriented dialogue dataset, **TopDial**, which comprises about 18K multi-turn dialogues.
## Dataset
We upload the curated **TopDial** dataset to the OneDrive cloud. Please download it from this OneDrive [link](https://connectpolyu-my.sharepoint.com/:u:/g/personal/21037774r_connect_polyu_hk/EftqMq3DT99PprYnTMA_NrUBN3BxoxY2-5CLTjkYJS9rmg?e=R73KO4).
## Dataset Curation
### Requirements
We use [Neo4j](https://neo4j.com/) as the graph database tool to process domain knowledge graph in the seed dataset. Please install it by following the [official guide](https://neo4j.com/docs/operations-manual/current/installation/). The required Python packages are listed in `requirements.txt`. Please install them by running:
```bash
pip install -r requirements.txt
```
### Seed Dataset
We use the [re-purposed version](https://github.com/iwangjian/Color4Dial) of the DuRecDial 2.0 dataset as the seed dataset. For convenience of preprocessing, please download it from this OneDrive [link](https://connectpolyu-my.sharepoint.com/:u:/g/personal/21037774r_connect_polyu_hk/EfbBtbnDmfxMmSfkvVDQ810B_59L7UmdBeo-CMwuq89X6w?e=M8yocS).
### Step 1: Preprocessing the seed dataset
```python
python data_preprocess.py --seed_dataset_dir ${seed_dataset_dir} --cache_dir ${cache_dir}
```
Running this script will generate the following files in the specified cache dir:
`cache_dialogue_{train|dev|test_seen|test_unseen}.jsonl`
### Step 2: Dataset curation
```python
# set your OpenAI API key
export OPENAI_API_KEY=""
python -u dialog_simulation.py --cached_seed_path ${cached_seed_path} \
--output_dir ${output_dir} \
--max_interaction_step ${max_interaction_step}
```
Running the above script will be like:

If you hope NOT to show the instructions and the synthesized conversations in the console, please set `--show_description` and `--show_message` to `false`.
## Acknowledgement
Our code is partially based on the implementation of [ChatArena](https://github.com/Farama-Foundation/chatarena). We thank the authors for their excellent work.
## Citation
If you use our data or code in your work, please kindly cite our work as:
```bibtex
@inproceedings{wang-etal-2023-target,
title = "Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation",
author = "Wang, Jian and
Cheng, Yi and
Lin, Dongding and
Leong, Chak Tou and
Li, Wenjie",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
}
```