https://github.com/salesforce/factualNLG

Code for the arXiv paper: "LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond"
https://github.com/salesforce/factualNLG

factual-consistency factuality large-language-models llm nlp summarization

Last synced: 3 months ago
JSON representation

Code for the arXiv paper: "LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond"

Host: GitHub
URL: https://github.com/salesforce/factualNLG
Owner: salesforce
License: apache-2.0
Created: 2023-05-16T16:06:59.000Z (about 2 years ago)
Default Branch: master
Last Pushed: 2025-01-27T13:38:52.000Z (5 months ago)
Last Synced: 2025-03-12T15:01:48.757Z (3 months ago)
Topics: factual-consistency, factuality, large-language-models, llm, nlp, summarization
Language: Jupyter Notebook
Homepage: https://arxiv.org/abs/2305.14540
Size: 1.6 MB
Stars: 59
Watchers: 6
Forks: 3
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
- Code of conduct: CODE_OF_CONDUCT.md
- Codeowners: CODEOWNERS
- Security: SECURITY.md

Awesome Lists containing this project

Awesome-Reasoning-Foundation-Models - [Code

README

        # Factual Consistency in Summarization 

Can you tell which edits of summaries are consistent, and which are inconsistent?



  



## SummEdits Benchmark

Here is the updated benchmark, with the latest LLMs (Gemini-pro added on 12/14/2023)

| Model Name 
|:----------- 
| Llama2-7b 
| Dav001 
| DAE 
| Cohere-cmd-xl 
| Vicuna-13b 
| SummaCConv 
| Mistral-7b 
| Llama2-13b 
| Claudev13 
| Dav002 
| Bard 
| QAFactEval 
| PaLM-bison 
| Dav003 
| CGPT 
| Claudev2 
| Claudev21 
| Gemini-pro 
| GPT4 
| Human Perf.

|   Podcast |   Bill Sum |   Sam Sum |   News  |   Sales Call |   Sales Email |   Shake speare  |   Sci TLDR |   QMSumm |   ECT Sum |   Overall | ---------|----------:|----------:|---------:|--------:|--------------:|--------------:|--------------:|----------:|---------:|---------:|----------:| |      50   |      50   |     50   |    50.6 |          50.9 |          50   |          50   |      50   |     50.7 |     51.4 |      50.4 | |      53.3 |      50.2 |     51   |    54.4 |          55.5 |          52.5 |          50   |      51   |     50.1 |     50.9 |      51.9 | |      54.4 |      55.1 |     58.7 |    60.9 |          50.4 |          53.6 |          53.6 |      54.7 |     52   |     58.3 |      55.2 | |      51.1 |      52.7 |     51.3 |    52.6 |          60.2 |          59.4 |          50   |      60.5 |     54.5 |     60.5 |      55.3 | |      52.8 |      52.5 |     51.3 |    63.5 |          57.9 |          51.8 |          55.4 |      59.7 |     54   |     62.4 |      56.1 | |      58.1 |      55.2 |     53.1 |    61.9 |          59   |          53.7 |          59.3 |      59.7 |     53.5 |     57.9 |      57.1 | |      50   |      55.5 |     56.7 |    59.8 |          63.4 |          59.7 |          53.5 |      59.6 |     55.9 |     63.7 |      57.8 | |      51.3 |      54.6 |     57.2 |    59.3 |          63.1 |          58.1 |          58.6 |      63.4 |     56.5 |     61.4 |      58.4 | |      60.4 |      51.9 |     64.5 |    63.4 |          61.3 |          57   |          58.1 |      57.8 |     56.9 |     68.1 |      59.9 | |      56.4 |      53.9 |     57.1 |    61.9 |          65.1 |          59.1 |          56.6 |      64.6 |     60.6 |     66.2 |      60.1 | |      50   |      58.1 |     61.3 |    71.6 |          73.3 |          70.6 |          58.7 |      66   |     53.9 |     72.7 |      63.6 | |      63.7 |      54.2 |     66.2 |    74.4 |          68.4 |          63.6 |          61.6 |      67.5 |     62.4 |     72.6 |      65.5 | |      66   |      62   |     69   |    68.4 |          74.4 |          68.1 |          61.6 |      78.1 |     70.4 |     72.4 |      69   | |      65.7 |      59.9 |     67.6 |    71   |          78.8 |          69.2 |          69.7 |      74.4 |     72.2 |     77.8 |      70.6 | |      68.4 |      63.6 |     69.1 |    74.4 |          79.4 |          65.5 |          68   |      75.6 |     69.2 |     78.6 |      71.2 | |      68.7 |      61.7 |     75.4 |    75.5 |          81   |          67.4 |          74   |      78.1 |     74.8 |     79.2 |      73.6 | |      72.6 |      66   |     75.7 |    77.2 |          82   |          68.5 |          73.2 |      78.6 |     72.7 |     77.1 |      74.4 | |      73.7 |      60.2 |     75.7 |    77.6 |          86.9 |          74.2 |          71.9 |      77.6 |     74   |     83.1 |      75.5 | |      82.7 |      71.1 |     83.1 |    83.3 |          87.9 |          79.5 |          84   |      82.4 |     79.6 |     87   |      82.1 | |      90.8 |      87.5 |     89.4 |    90   |          91.8 |          87.4 |          96.9 |      89.3 |     90.7 |     95.4 |      90.9 |

## SummEdits Benchmark Release (Section 6-7)

We release the data for the 10 domains in the SummEdits benchmark in the [data/summedits](https://github.com/salesforce/factualNLG/tree/master/data/summedits) folder.

The [SummEdits_Benchmark.ipynb](https://github.com/salesforce/factualNLG/blob/master/SummEdits_Benchmark.ipynb) notebook provides information on how to access open, and visualize the dataset.

## FactCC Explanation Analysis (Section 3.5)

As part of the paper, we annotated 3.6k explanations generated by models justifying their choice to identify a summary as *inconsistent*. The annotations are available in [data/factcc/factcc_explanation_annotation.json](https://github.com/salesforce/factualNLG/blob/master/data/factcc/factcc_explanation_annotation.json).

The notebook [FactCC_Explanation_Annotation.ipynb](https://github.com/salesforce/factualNLG/blob/master/FactCC_Explanation_Annotation.ipynb) shows how to load/view the annotations.

## Prompts

We release all prompts that were used in experiments in the paper in the [prompts/](https://github.com/salesforce/factualNLG/tree/master/prompts/) folder. More specifically:

- [summedits/factcc](https://github.com/salesforce/factualNLG/blob/master/prompts/factcc/) is a folder that contains the 26 prompts that we experimented with in initial FactCC experiments (Section 3.1)

- [summedits/step2_consistent.txt](https://github.com/salesforce/factualNLG/blob/master/prompts/summedits/step2_consistent.txt) and [summedits/step2_inconsistent.txt](https://github.com/salesforce/factualNLG/blob/master/prompts/summedits/step2_inconsistent.txt) were the prompts used in Step 2 of the SummEdits protocol to generate edits of seed summaries. (Section 5.2)

- [summedits/standard_zs_prompt.txt](https://github.com/salesforce/factualNLG/blob/master/prompts/summedits/standard_zs_prompt.txt) is the zero-shot prompt that was used to assess all LLM model performance on the SummEdits benchmark. (Section 6.3)

- [summedits/edit_typing_gpt4.txt](https://github.com/salesforce/factualNLG/blob/master/prompts/summedits/edit_typing_gpt4.txt) is a few-shot prompt used to predict the types of edits for inconsistent samples in SummEdits (Section 6.4)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/salesforce/factualNLG

Awesome Lists containing this project

README