https://github.com/salesforce/nnd_evaluation
Code for the EMNLP 2022 paper: Near-Negative Distinction
https://github.com/salesforce/nnd_evaluation
evaluation nlg nnd question-answering question-generation summarization text-generation
Last synced: 2 months ago
JSON representation
Code for the EMNLP 2022 paper: Near-Negative Distinction
- Host: GitHub
- URL: https://github.com/salesforce/nnd_evaluation
- Owner: salesforce
- License: bsd-3-clause
- Created: 2022-10-28T21:25:07.000Z (almost 3 years ago)
- Default Branch: master
- Last Pushed: 2022-11-02T21:12:18.000Z (almost 3 years ago)
- Last Synced: 2025-04-16T03:53:44.698Z (6 months ago)
- Topics: evaluation, nlg, nnd, question-answering, question-generation, summarization, text-generation
- Language: Jupyter Notebook
- Homepage: https://arxiv.org/abs/2205.06871v1
- Size: 34.2 KB
- Stars: 7
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
- Code of conduct: CODE_OF_CONDUCT.md
- Codeowners: CODEOWNERS
- Security: SECURITY.md
Awesome Lists containing this project
README
# Near-Negative Distinction
Code repository for the EMNLP 2022 paper: [Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets](https://arxiv.org/abs/2205.06871v1)
![]()
## Motivation
In the NND framework, a generation model is evaluated by measuring whether it passes NND tests. To pass an NND test, the generation model must assigning higher likelihood to high-quality output candidate than to a lower-quality candidate. Candidate quality is based on pre-existing human evaluation datasets.
For example, for the task of generative Question Answering, the Challenge 300 dataset contains the following annotation.
```
Question: ``How could one divert an asteroid heading directly for the Earth?''
Macaw-11b output: ``launch a space shuttle into orbit around it'' -- Credit: 0
Macaw-answer-11b candidate: ``create a spacecraft to intercept and deflect the asteroid'' -- Credit: 1
```Imagine you want to evaluate a new QA model. The issue is that for this input question, the model might generate a novel answer, say: ``By creating a black hole to absorb the asteroid.''. Because this candidate is not in the human evaluation, it can be challenging to score the candidate (and evaluate the model).
In NND, rather than require the model to generate *new* candidates, we evaluate the likelihood it places on *known* candidates. More specifically, we check whether the model assigns:
```
P(``create a spacecraft to intercept and deflect the asteroid'') > P(``launch a space shuttle into orbit around it'')
```
If it does, we say the model passes the NND test. By creating sets of hundreds of NND tests, we can precisely evaluate text generators.## Tasks supported
### - Summarization
There are two NND test sets:
- **[Summ Eval NND]** - 3,613 NND tests, measuring model ability on: Consistence, Coherence, Fluency, and Relevance. Based on: https://arxiv.org/abs/2007.12626.
- **[FRANK NND]** - 824 NND tests, focused on factual consistency, more specifically: Semantic Frame, Discourse, and Verifiability errors. Based on: https://arxiv.org/abs/2104.13346.See [NND_Summarization.ipynb](https://github.com/MetaMind/nnd/blob/main/NND_Summarization.ipynb) for example use.
### - Question Generation
There is one NND test set:
- **[Quiz Design NND]** - 2686 NND tests, for answer-aware QGen models, measuring ability to avoid: Disfluent, Off Target, and Wrong Context errors. Based on: https://arxiv.org/abs/2205.01730v1See [NND_QGen.ipynb](https://github.com/MetaMind/nnd/blob/main/NND_QGen.ipynb) for example use.
### - Question Answering
There is one NND test set:
- **[Challenge 300 NND]** - 807 NND tests, for generative QA models, measuring model ability to answer: Common Sense, Comprehension, Entity, Creativity, and Science open-ended questions. Based on: https://arxiv.org/abs/2109.02593See [NND_QA.ipynb](https://github.com/MetaMind/nnd/blob/main/NND_QA.ipynb) for example use.
### - Machine Translation
There is one NND test set:
- **[WMT 2021 MQM] - 42,675 NND tests, for EN-DE machine translation models, measuring errors along the MQM error categories: Accuracy, Fluency, Terminology, and Locale convention. Also measuring errors based on severity: Minor and Major. Based on: https://arxiv.org/abs/2104.14478Note: this dataset was not included in the original paper. NND test sets could be expanded to other language pairs (ZH-EN) and years of the MQM competition (WMT 2020) based on the same annotations. See [NND_MT.ipynb](https://github.com/MetaMind/nnd/blob/main/NND_MT.ipynb) for example use.
## Cite the work
If you make use of the code, models, or pipeline, please cite our paper:
```
@inproceedings{laban2022nnd_evaluation,
title={Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets},
author={Philippe Laban and Chien-Sheng Wu and Wenhao Liu and Caiming Xiong},
booktitle={Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing},
volume={1},
year={2022}
}
```## Contributing
If you'd like to contribute an NND test set for an existing or a new task, feel free to contribute through Pull Requests, or leave an issue with a public resource you think can be transformed into an NND test set.
If you have questions or suggestions, you can contact us at plaban@salesforce.com. All contributions welcome!