{"id":13429876,"url":"https://github.com/salesforce/factualNLG","last_synced_at":"2025-03-16T04:31:15.009Z","repository":{"id":169452025,"uuid":"641509843","full_name":"salesforce/factualNLG","owner":"salesforce","description":"Code for the arXiv paper: \"LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond\"","archived":false,"fork":false,"pushed_at":"2025-01-27T13:38:52.000Z","size":1681,"stargazers_count":59,"open_issues_count":1,"forks_count":3,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-03-12T15:01:48.757Z","etag":null,"topics":["factual-consistency","factuality","large-language-models","llm","nlp","summarization"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2305.14540","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/salesforce.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-16T16:06:59.000Z","updated_at":"2025-01-27T13:38:56.000Z","dependencies_parsed_at":"2024-11-08T13:32:48.430Z","dependency_job_id":null,"html_url":"https://github.com/salesforce/factualNLG","commit_stats":{"total_commits":17,"total_committers":2,"mean_commits":8.5,"dds":0.4117647058823529,"last_synced_commit":"b7f3e2e7cf84d216c2912b0a560fe2d0afbbdbcf"},"previous_names":["salesforce/factualnlg"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2FfactualNLG","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2FfactualNLG/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2FfactualNLG/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2FfactualNLG/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/salesforce","download_url":"https://codeload.github.com/salesforce/factualNLG/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243826788,"owners_count":20354220,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["factual-consistency","factuality","large-language-models","llm","nlp","summarization"],"created_at":"2024-07-31T02:00:47.067Z","updated_at":"2025-03-16T04:31:13.645Z","avatar_url":"https://github.com/salesforce.png","language":"Jupyter Notebook","funding_links":[],"categories":["3 Reasoning Tasks","Jupyter Notebook","Anthropomorphic-Taxonomy"],"sub_categories":["3.1 Commonsense Reasoning","Typical Intelligence Quotient (IQ)-General Intelligence evaluation benchmarks"],"readme":"# Factual Consistency in Summarization \n\nCan you tell which edits of summaries are consistent, and which are inconsistent?\n\n\u003cp align=\"center\"\u003e\n  \u003cimg width=\"650\" src=\"images/summedits_examples.png\"\u003e\n\u003c/p\u003e\n\n## SummEdits Benchmark\n\nHere is the updated benchmark, with the latest LLMs (Gemini-pro added on 12/14/2023)\n\n| Model Name          |   Podcast |   Bill Sum |   Sam Sum |   News  |   Sales Call |   Sales Email |   Shake speare  |   Sci TLDR |   QMSumm |   ECT Sum |   Overall |\n|:--------------------|----------:|----------:|---------:|--------:|--------------:|--------------:|--------------:|----------:|---------:|---------:|----------:|\n| Llama2-7b      |      50   |      50   |     50   |    50.6 |          50.9 |          50   |          50   |      50   |     50.7 |     51.4 |      50.4 |\n| Dav001              |      53.3 |      50.2 |     51   |    54.4 |          55.5 |          52.5 |          50   |      51   |     50.1 |     50.9 |      51.9 |\n| DAE                 |      54.4 |      55.1 |     58.7 |    60.9 |          50.4 |          53.6 |          53.6 |      54.7 |     52   |     58.3 |      55.2 |\n| Cohere-cmd-xl       |      51.1 |      52.7 |     51.3 |    52.6 |          60.2 |          59.4 |          50   |      60.5 |     54.5 |     60.5 |      55.3 |\n| Vicuna-13b          |      52.8 |      52.5 |     51.3 |    63.5 |          57.9 |          51.8 |          55.4 |      59.7 |     54   |     62.4 |      56.1 |\n| SummaCConv          |      58.1 |      55.2 |     53.1 |    61.9 |          59   |          53.7 |          59.3 |      59.7 |     53.5 |     57.9 |      57.1 |\n| Mistral-7b          |      50   |      55.5 |     56.7 |    59.8 |          63.4 |          59.7 |          53.5 |      59.6 |     55.9 |     63.7 |      57.8 |\n| Llama2-13b          |      51.3 |      54.6 |     57.2 |    59.3 |          63.1 |          58.1 |          58.6 |      63.4 |     56.5 |     61.4 |      58.4 |\n| Claudev13           |      60.4 |      51.9 |     64.5 |    63.4 |          61.3 |          57   |          58.1 |      57.8 |     56.9 |     68.1 |      59.9 |\n| Dav002              |      56.4 |      53.9 |     57.1 |    61.9 |          65.1 |          59.1 |          56.6 |      64.6 |     60.6 |     66.2 |      60.1 |\n| Bard                |      50   |      58.1 |     61.3 |    71.6 |          73.3 |          70.6 |          58.7 |      66   |     53.9 |     72.7 |      63.6 |\n| QAFactEval          |      63.7 |      54.2 |     66.2 |    74.4 |          68.4 |          63.6 |          61.6 |      67.5 |     62.4 |     72.6 |      65.5 |\n| PaLM-bison          |      66   |      62   |     69   |    68.4 |          74.4 |          68.1 |          61.6 |      78.1 |     70.4 |     72.4 |      69   |\n| Dav003              |      65.7 |      59.9 |     67.6 |    71   |          78.8 |          69.2 |          69.7 |      74.4 |     72.2 |     77.8 |      70.6 |\n| CGPT                |      68.4 |      63.6 |     69.1 |    74.4 |          79.4 |          65.5 |          68   |      75.6 |     69.2 |     78.6 |      71.2 |\n| Claudev2            |      68.7 |      61.7 |     75.4 |    75.5 |          81   |          67.4 |          74   |      78.1 |     74.8 |     79.2 |      73.6 |\n| Claudev21           |      72.6 |      66   |     75.7 |    77.2 |          82   |          68.5 |          73.2 |      78.6 |     72.7 |     77.1 |      74.4 |\n| Gemini-pro          |      73.7 |      60.2 |     75.7 |    77.6 |          86.9 |          74.2 |          71.9 |      77.6 |     74   |     83.1 |      75.5 |\n| GPT4                |      82.7 |      71.1 |     83.1 |    83.3 |          87.9 |          79.5 |          84   |      82.4 |     79.6 |     87   |      82.1 |\n| Human Perf.         |      90.8 |      87.5 |     89.4 |    90   |          91.8 |          87.4 |          96.9 |      89.3 |     90.7 |     95.4 |      90.9 |\n\n\n## SummEdits Benchmark Release (Section 6-7)\n\nWe release the data for the 10 domains in the SummEdits benchmark in the [data/summedits](https://github.com/salesforce/factualNLG/tree/master/data/summedits) folder.\n\nThe [SummEdits_Benchmark.ipynb](https://github.com/salesforce/factualNLG/blob/master/SummEdits_Benchmark.ipynb) notebook provides information on how to access open, and visualize the dataset.\n\n## FactCC Explanation Analysis (Section 3.5)\n\nAs part of the paper, we annotated 3.6k explanations generated by models justifying their choice to identify a summary as *inconsistent*. The annotations are available in [data/factcc/factcc_explanation_annotation.json](https://github.com/salesforce/factualNLG/blob/master/data/factcc/factcc_explanation_annotation.json).\nThe notebook [FactCC_Explanation_Annotation.ipynb](https://github.com/salesforce/factualNLG/blob/master/FactCC_Explanation_Annotation.ipynb) shows how to load/view the annotations.\n\n## Prompts\n\nWe release all prompts that were used in experiments in the paper in the [prompts/](https://github.com/salesforce/factualNLG/tree/master/prompts/) folder. More specifically:\n- [summedits/factcc](https://github.com/salesforce/factualNLG/blob/master/prompts/factcc/) is a folder that contains the 26 prompts that we experimented with in initial FactCC experiments (Section 3.1)\n- [summedits/step2_consistent.txt](https://github.com/salesforce/factualNLG/blob/master/prompts/summedits/step2_consistent.txt) and [summedits/step2_inconsistent.txt](https://github.com/salesforce/factualNLG/blob/master/prompts/summedits/step2_inconsistent.txt) were the prompts used in Step 2 of the SummEdits protocol to generate edits of seed summaries. (Section 5.2)\n- [summedits/standard_zs_prompt.txt](https://github.com/salesforce/factualNLG/blob/master/prompts/summedits/standard_zs_prompt.txt) is the zero-shot prompt that was used to assess all LLM model performance on the SummEdits benchmark. (Section 6.3)\n- [summedits/edit_typing_gpt4.txt](https://github.com/salesforce/factualNLG/blob/master/prompts/summedits/edit_typing_gpt4.txt) is a few-shot prompt used to predict the types of edits for inconsistent samples in SummEdits (Section 6.4)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsalesforce%2FfactualNLG","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsalesforce%2FfactualNLG","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsalesforce%2FfactualNLG/lists"}