{"id":13754230,"url":"https://github.com/songhaoyu/BoB","last_synced_at":"2025-05-09T22:31:33.443Z","repository":{"id":47131931,"uuid":"376204689","full_name":"songhaoyu/BoB","owner":"songhaoyu","description":"The released codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'","archived":false,"fork":false,"pushed_at":"2021-09-13T00:19:51.000Z","size":1184,"stargazers_count":136,"open_issues_count":8,"forks_count":24,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-11-16T07:33:13.463Z","etag":null,"topics":["acl2021","bert","dialogue-model","personachat"],"latest_commit_sha":null,"homepage":"https://aclanthology.org/2021.acl-long.14/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/songhaoyu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-06-12T04:53:18.000Z","updated_at":"2024-08-18T16:11:16.000Z","dependencies_parsed_at":"2022-09-05T16:01:37.038Z","dependency_job_id":null,"html_url":"https://github.com/songhaoyu/BoB","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/songhaoyu%2FBoB","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/songhaoyu%2FBoB/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/songhaoyu%2FBoB/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/songhaoyu%2FBoB/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/songhaoyu","download_url":"https://codeload.github.com/songhaoyu/BoB/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253335689,"owners_count":21892713,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["acl2021","bert","dialogue-model","personachat"],"created_at":"2024-08-03T09:01:51.043Z","updated_at":"2025-05-09T22:31:31.461Z","avatar_url":"https://github.com/songhaoyu.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话"],"sub_categories":["其他_文本生成_文本对话"],"readme":"## BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data\n[\u003cimg src=\"_static/pytorch-logo.png\" width=\"10%\"\u003e](https://github.com/pytorch/pytorch) [\u003cimg src=\"https://www.apache.org/img/ASF20thAnniversary.jpg\" width=\"6%\"\u003e](https://www.apache.org/licenses/LICENSE-2.0)\n\n[\u003cimg align=\"right\" src=\"_static/scir.png\" width=\"20%\"\u003e](http://ir.hit.edu.cn/)\n\nThis repository provides the implementation details for the ACL 2021 main conference paper:\n\n**BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data**. [[paper]](https://aclanthology.org/2021.acl-long.14/)\n\n\n## 1. Data Preparation\nIn this work, we carried out persona-based dialogue generation experiments under a persona-dense scenario (English **PersonaChat**) and a persona-sparse scenario (Chinese **PersonalDialog**), with the assistance of a series of auxiliary inference datasets. Here we summarize the key information of these datasets and provide the links to download these datasets if they are directly accessible.\n\n* **For Persona-Dense Experiments**\n\n\t|  Dataset\t  | Type  | Language | Usage | Where to Download |\n\t|  ----  \t\t  | ----  | ----  | ----  | ----  |\n\t|  ConvAI2 PersonaChat | Dialogue Generation  | English   |  Training | [https://www.aclweb.org/anthology/P18-1205.pdf](https://www.aclweb.org/anthology/P18-1205.pdf) train\\_self\\_original\\_no\\_cands \u0026 valid\\_self\\_original\\_no\\_cands (7801 test dialogues)  |\n\t|  MNLI | Non-dialogue Inference  | English  | Training  | [https://cims.nyu.edu/~sbowman/multinli/multinli_1.0.zip](https://cims.nyu.edu/~sbowman/multinli/multinli_1.0.zip) entailment \u0026 contradiction  |\n\t|  DNLI | Dialogue Inference  | English  | Evaluation  | [https://www.aclweb.org/anthology/P19-1363.pdf](https://www.aclweb.org/anthology/P19-1363.pdf) |\n\n\n* **For Persona-Sparse Experiments**\n\n\t|  Dataset\t  | Type  | Language | Usage | Where to Download |\n\t|  ----  \t\t  | ----  | ----  | ----  | ----  |\n\t|  ECDT2019 PersonalDialog | Dialogue Generation  | Chinese   |  Training |    [https://arxiv.org/pdf/1901.09672.pdf](https://arxiv.org/pdf/1901.09672.pdf) dialogues\\_train.json \u0026 test\\_data\\_random.json \u0026 test\\_data\\_biased.json |\n\t|  CMNLI | Non-dialogue Inference  | Chinese  | Training  | [https://github.com/CLUEbenchmark/CLUECorpus2020/](https://github.com/CLUEbenchmark/CLUECorpus2020/) entailment \u0026 contradiction |\n\t|  KvPI | Dialogue Inference  | Chinese  | Evaluation  | [https://github.com/songhaoyu/KvPI](https://github.com/songhaoyu/KvPI) |\n\t\n\t\n* **Download Pre-trained BERT**\n\n\tThe BoB model is initialized from public BERT checkpoints:\n \n\t* **English BERT**: [https://huggingface.co/bert-base-uncased/tree/main](https://huggingface.co/bert-base-uncased/tree/main)\n\t* **Chinese BERT**: [https://huggingface.co/bert-base-chinese/tree/main](https://huggingface.co/bert-base-chinese/tree/main)\n\n## 2. How to Run\n\nThe `setup.sh` script contains the necessary dependencies to run this project. Simply run `./setup.sh` would install these dependencies. Here\nwe take the English PersonaChat dataset as an example to illustrate how to run the dialogue generation experiments. Generally, there are three steps, i.e., **tokenization**, **training** and **inference**:\n\n* **Preprocessing**\n\n\t```\n\tpython preprocess.py --dataset_type convai2 \\\n\t--trainset ./data/ConvAI2/train_self_original_no_cands.txt \\\n\t--testset ./data/ConvAI2/valid_self_original_no_cands.txt \\\n\t--nliset ./data/ConvAI2/ \\\n\t--encoder_model_name_or_path ./pretrained_models/bert/bert-base-uncased/ \\\n\t--max_source_length 64 \\\n\t--max_target_length 32\n\t```\n\tWe have provided some data examples (dozens of lines) at the `./data` directory to show the data format. `preprocess.py` reads different datasets and tokenizes the raw data into a series of vocab IDs to facilitate model training. The `--dataset_type` could be either `convai2` (for English PersonaChat) or `ecdt2019` (for Chinese PersonalDialog). Finally, the tokenized data will be saved as a series of JSON files.\n\n* **Model Training**\n\n\t```\n\tCUDA_VISIBLE_DEVICES=0 python bertoverbert.py --do_train \\\n\t--encoder_model ./pretrained_models/bert/bert-base-uncased/ \\\n\t--decoder_model ./pretrained_models/bert/bert-base-uncased/ \\\n\t--decoder2_model ./pretrained_models/bert/bert-base-uncased/ \\\n\t--save_model_path checkpoints/ConvAI2/bertoverbert --dataset_type convai2 \\\n\t--dumped_token ./data/ConvAI2/convai2_tokenized/ \\\n\t--learning_rate 7e-6 \\\n\t--batch_size 32\n\t```\n\t\n\tHere we initialize encoder and both decoders from the same downloaded BERT checkpoint. And more parameter settings could be found at `bertoverbert.py`.\n\t\n* **Evaluations**\n\n\t```\n\tCUDA_VISIBLE_DEVICES=0 python bertoverbert.py --dumped_token ./data/ConvAI2/convai2_tokenized/ \\\n\t--dataset_type convai2 \\\n\t--encoder_model ./pretrained_models/bert/bert-base-uncased/  \\\n\t--do_evaluation --do_predict \\\n\t--eval_epoch 7\n\t```\n\t\n\tEmpirically, in the PersonaChat experiment with default hyperparameter settings, the best-performing checkpoint should be found between epoch 5 and epoch 9. If the training procedure goes fine, there should be some results like:\n\t\n\t```\n\tPerplexity on test set is 21.037 and 7.813.\n\t```\n\twhere `21.037` is the ppl from the first decoder and `7.813` is the final ppl from the second decoder. And the generated results is redirected to `test_result.tsv`, here is a generated example from the above checkpoint:\n\t\n\t```\n\tpersona:i'm terrified of scorpions. i am employed by the us postal service. i've a german shepherd named barnaby. my father drove a car for nascar.\n\tquery:sorry to hear that. my dad is an army soldier.\n\tgold:i thank him for his service.\n\tresponse_from_d1:that's cool. i'm a train driver.\n\tresponse_from_d2:that's cool. i'm a bit of a canadian who works for america.  \n\t```\n\twhere `d1` and `d2` are the two BERT decoders, respectively.\n\t\n\n* **Computing Infrastructure:**\n\t* The released codes were tested on **NVIDIA Tesla V100 32G** and **NVIDIA PCIe A100 40G** GPUs. Notice that with a `batch_size=32`, the BoB model will need at least 20Gb GPU resources for training.\n\n\n\n## MISC\n* Build upon 🤗 [Transformers](https://github.com/huggingface/transformers).\n\n* Bibtex:\n\n\t\u003cpre\u003e\n\t@inproceedings{song-etal-2021-bob,\n\t    title = \"{B}o{B}: {BERT} Over {BERT} for Training Persona-based Dialogue Models from Limited Personalized Data\",\n\t    author = \"Song, Haoyu  and\n\t      Wang, Yan  and\n\t      Zhang, Kaiyan  and\n\t      Zhang, Wei-Nan  and\n\t      Liu, Ting\",\n\t    booktitle = \"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)\",\n\t    month = aug,\n\t    year = \"2021\",\n\t    address = \"Online\",\n\t    publisher = \"Association for Computational Linguistics\",\n\t    url = \"https://aclanthology.org/2021.acl-long.14\",\n\t    doi = \"10.18653/v1/2021.acl-long.14\",\n\t    pages = \"167--177\",\n\t}\n\t\u003c/pre\u003e\n\n* Email: *hysong@ir.hit.edu.cn*.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsonghaoyu%2FBoB","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsonghaoyu%2FBoB","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsonghaoyu%2FBoB/lists"}