{"id":13535310,"url":"https://github.com/nouhadziri/DialogEntailment","last_synced_at":"2025-04-02T01:30:33.091Z","repository":{"id":51716773,"uuid":"181122698","full_name":"nouhadziri/DialogEntailment","owner":"nouhadziri","description":"The implementation of the paper \"Evaluating Coherence in Dialogue Systems using Entailment\"","archived":false,"fork":false,"pushed_at":"2024-09-21T18:52:13.000Z","size":87,"stargazers_count":74,"open_issues_count":0,"forks_count":5,"subscribers_count":7,"default_branch":"master","last_synced_at":"2024-11-02T23:32:47.958Z","etag":null,"topics":["bert","dialogue-evaluation","evaluation-framework","natural-language-inference"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/1904.03371","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nouhadziri.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-04-13T04:51:33.000Z","updated_at":"2024-09-21T18:52:16.000Z","dependencies_parsed_at":"2022-08-22T12:50:46.236Z","dependency_job_id":null,"html_url":"https://github.com/nouhadziri/DialogEntailment","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nouhadziri%2FDialogEntailment","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nouhadziri%2FDialogEntailment/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nouhadziri%2FDialogEntailment/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nouhadziri%2FDialogEntailment/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nouhadziri","download_url":"https://codeload.github.com/nouhadziri/DialogEntailment/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246738383,"owners_count":20825775,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","dialogue-evaluation","evaluation-framework","natural-language-inference"],"created_at":"2024-08-01T08:00:53.204Z","updated_at":"2025-04-02T01:30:32.811Z","avatar_url":"https://github.com/nouhadziri.png","language":"Python","funding_links":[],"categories":["BERT Text Match:"],"sub_categories":[],"readme":"This repository hosts the implementation of the paper \n\"[Evaluating Coherence in Dialogue Systems using Entailment](https://arxiv.org/abs/1904.03371)\",\npublished in NAACL'19.\n\n# DialogEntailment\n\n\u003c!-- [![CircleCI](https://circleci.com/gh/nouhadziri/DialogEntailment.svg?style=svg)](https://circleci.com/gh/nouhadziri/DialogEntailment) --\u003e\n\nDialogEntailment is a microframework to automatically evaluate coherence in dialogue systems. Our implementation includes the following metrics:\n - __Semantic Similarity__, derived from [\\[Dziri et al., 2018\\]](https://arxiv.org/abs/1811.01063), estimates the correspondence \n between the utterances in the conversation history and the generated response. The metric is acquired by computing the cosine\ndistance between the embedding vectors of the test utterances in the dialogue history and the generated response.  \n - __Word-level metrics__, introduced in [\\[Liu et al., 2016\\]](https://aclweb.org/anthology/D16-1230), incorporates word embeddings to measure three metrics: A (average), G (greedy), and E (extrema) (will be added later to the repo)\n - __Consistency by textual entailment__: we cast a generated response as the hypothesis and the conversation history as the\npremise, projecting thus the automatic evaluation into an natural language inference (NLI) task.\n\nNote that in the paper, we reported distance for the semantic similarity, but in the code, we named the metric [SemanticDistance](dialogentail/semantic_distance.py) (i.e., the lower the better). We also provided [SemanticSimilarity](dialogentail/semantic_similarity.py) that actually computes the similarity.\n\n## Installation\nDialogEntailment is shipped as a Python package and can be installed using `pip`: \n```\ngit clone git@github.com:nouhadziri/DialogEntailment.git\npip install -e .\npython -m spacy link en_core_web_lg en\n```\n\n### Dependencies\n- Python \u003e= 3.6\n- SpaCy \u003e= 2.1.0\n- allennlp \u003e= 0.8.3\n- pytorch-pretrained-bert\n- scikit-learn\n- tqdm\n- smart_open\n- pandas\n- seaborn\n\n\n## Dataset\nWe build a syntenthized entailment corpus, namely InferConvAI, \nfrom the ConvAI dialogue data [\\[Zhang et al., 2018\\]](https://arxiv.org/abs/1801.07243), described in details in the paper. The dataset is formatted in both tsv (similar to [MultiNLI](https://www.nyu.edu/projects/bowman/multinli/)) and jsonl (following [SNLI](https://nlp.stanford.edu/projects/snli/)). To download InferConvAI, please use the following links:\n - [InferConvAI_v1.3_tsv.tar.gz](https://drive.google.com/file/d/16mxLm1fqkguYVjUibU10D99Ns3L5VgKm/view?usp=sharing) (84MB download / 236MB uncompressed)\n - [InferConvAI_v1.3_jsonl.tar.gz](https://drive.google.com/file/d/1yeU7yHzFBs93UkMHtN2uq_rv_nLrD8mF/view?usp=sharing) (74MB download / 274MB uncompressed)\n \nCheck out [convai_to_nli.py](dialogentail/preprocessing/convai_to_nli.py) to see how the synthesized inferenece data is generated from the utterances. \n \n## Train an Entailment model \nWe adopt two prominent models that have shown promising results in commonsense reasoning: \n\n- The Enhanced Sequential Inference Model (ESIM) [\\[Chen et al., 2016\\]](https://arxiv.org/abs/1609.06038) entangled with ELMO [\\[Peters et al., 2018\\]](https://arxiv.org/abs/1802.05365) contextualized word embedding. The implementation is obtained from the [AllenNLP](https://allennlp.org/) library. You can run the following command to train the ESIM model with [this](training/configs/esim_elmo.jsonnet) configuration:\n```bash\ntraining/allennlp.sh -s \u003cMODEL_DIR\u003e [--overwrite] [--config \u003cCONFIG_FILE\u003e]\n```\n- BERT [\\[Devlin et al., 2018\\]](https://arxiv.org/abs/1810.04805): We fine-tuned a pre-trained BERT model using :hugs: [Transformers](https://github.com/huggingface/pytorch-pretrained-BERT) (when it was called, `pytorch-pretrained-BERT`). We modified [run_classifier.py](https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_classifier.py) to support the entailment task. Here is how to train the model, followed by other arguments that can be passed to the program: \n```bash\npython -m dialogentail.huggingface --do_eval --do_train --output_dir \u003cMODEL_DIR\u003e\n```\n\u003cpre\u003e\n    --train_dataset     default: InferConvAI train data\n    --eval_dataset      default: InferConvAI validation data\n    --model             bert-base-uncased, bert-large-uncased (default: bert-base-uncased)\n    --train_batch_size  default: 32\n    --eval_batch_size   default: 8\n    --num_train_epochs  default: 3\n    --max_seq_length    default: 128\n\u003c/pre\u003e\n\n## Visualization\nYou may run the `dialogentail` module to replicate the plots provided in the paper:\n```\npython -m dialogentail --bert_dir \u003cBERT_DIR\u003e --esim_model \u003cESIM_MODEL\u003e [--plots_dir \u003cDIR\u003e]\n```\nFor the ESIM model, you need to input `model.tar.gz` which is generated by allennlp in the model directory once the training is finished.\n\nNote that loading the BERT model and the ESIM model in the same process requires massive amount of memory, so we recommend to run the above command for each model separately.\n\n#### Custom Test Data\nThe default test data is 150 dialogues drawn from Reddit (used in [THRED](https://github.com/nouhadziri/THRED) for human evaluation). We also provided a 150-dialogue test data from OpenSubtitles. You can change the test data by the `--response_file` argument. To use our OpenSubtitles data, simply pass `--response_file opensubtitles`.\nFor your own test data, the file format should be the following for each test sample (see our [Reddit]() data for more information):\n\u003cpre\u003e\nLine N: TAB-separated utterances in the conversation history\nLine N+1: the ground-truth response\nLine N+2: Response generated by Method_1\nLine N+3: Response generated by Method_2\n...\nLine N+m+1: Response generated by Method_m  \n\u003c/pre\u003e\n\nRun the program with the following arguments:\n\u003cpre\u003e\n    --response_file     Path to your test file\n    --generator_types   The names of 'm' generative models\n\u003c/pre\u003e\n\nBy default, the program evaluates the following `m=4` models:\n - Seq2Seq [\\[Vinyals \u0026 Le, 2015\\]](https://arxiv.org/abs/1506.05869),\n - HRED [\\[Serban et al., 2016\\]](https://arxiv.org/abs/1507.04808),\n - TA-Seq2Seq [\\[Xing et al., 2017\\]](https://arxiv.org/abs/1606.08340),\n - THRED [\\[Dziri et al., 2018\\]](https://arxiv.org/abs/1811.01063).\n\n#### Correlation with Human Judgment\nTo measure the correlation with human judgment, you need to provide a pickle file \ncontaining the mean evaluation ratings of your human judges. More precisely, the pickle file consists of a python list \ncontaining triples `('Method_i', sample_index, mean_rate)`. \nIf you have `m` generative models and `N` test samples, the size of the list would be `N * m`:\n\u003cpre\u003e\n[('Method_1', 1, 2.1), ('Method_2', 1, 3.4), ..., ('Method_m', 1, 2.6), ('Method_1', 2, 0.2), ...]\n\u003c/pre\u003e\n\nTo pass your own human judgment file, use `--human_judgment \u003cPATH_TO_PICKLE_FILE\u003e`. For the OpenSubtitles test data, you may simply set the argument to `opensubtitles` to use the provided human judgment.\n\n## Citation\nPlease cite the following paper if you used our work in your research:\n```\n@inproceedings{dziri-etal-2019-evaluating,\n    title = \"Evaluating Coherence in Dialogue Systems using Entailment\",\n    author = \"Dziri, Nouha  and\n      Kamalloo, Ehsan  and\n      Mathewson, Kory  and\n      Zaiane, Osmar\",\n    booktitle = \"Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)\",\n    month = jun,\n    year = \"2019\",\n    address = \"Minneapolis, Minnesota\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/N19-1381\",\n    doi = \"10.18653/v1/N19-1381\",\n    pages = \"3806--3812\",\n}\n\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnouhadziri%2FDialogEntailment","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnouhadziri%2FDialogEntailment","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnouhadziri%2FDialogEntailment/lists"}