{"id":13652599,"url":"https://github.com/hkust-nlp/felm","last_synced_at":"2025-07-24T13:05:31.252Z","repository":{"id":175981877,"uuid":"650746226","full_name":"hkust-nlp/felm","owner":"hkust-nlp","description":"Github repository for \"FELM: Benchmarking Factuality Evaluation of Large Language Models\" (NeurIPS 2023)","archived":false,"fork":false,"pushed_at":"2023-12-25T15:20:49.000Z","size":2428,"stargazers_count":59,"open_issues_count":3,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-05-14T08:03:46.463Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hkust-nlp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-06-07T18:07:21.000Z","updated_at":"2025-04-29T02:22:39.000Z","dependencies_parsed_at":"2023-11-23T15:28:45.088Z","dependency_job_id":"98357553-0d47-453d-bd58-d7f7f32aecbf","html_url":"https://github.com/hkust-nlp/felm","commit_stats":null,"previous_names":["sjtu-lit/felm","hkust-nlp/felm"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/hkust-nlp/felm","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hkust-nlp%2Ffelm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hkust-nlp%2Ffelm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hkust-nlp%2Ffelm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hkust-nlp%2Ffelm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hkust-nlp","download_url":"https://codeload.github.com/hkust-nlp/felm/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hkust-nlp%2Ffelm/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259763073,"owners_count":22907406,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T02:01:00.814Z","updated_at":"2025-07-10T14:30:42.046Z","avatar_url":"https://github.com/hkust-nlp.png","language":"Python","funding_links":[],"categories":["Anthropomorphic-Taxonomy"],"sub_categories":["Typical Intelligence Quotient (IQ)-General Intelligence evaluation benchmarks"],"readme":"# FELM\n![](image/title.png)\n\u003cp align=\"center\"\u003e\n   🌐 \u003ca href=\"https://hkust-nlp.github.io/felm/\" target=\"_blank\"\u003eWebsite\u003c/a\u003e • 🤗 \u003ca href=\"https://huggingface.co/datasets/hkust-nlp/felm\" target=\"_blank\"\u003eHugging Face Dataset\u003c/a\u003e •   📃 \u003ca href=\"http://arxiv.org/abs/2310.00741\" target=\"_blank\"\u003ePaper\u003c/a\u003e \n\u003c/p\u003e\n\n FELM is a meta benchmark to *evaluate factuality evaluation* for large language models.\n The benchmark comprises 847 questions that span five distinct domains: world knowledge, science/technology, writing/recommendation, reasoning, and math. We gather prompts corresponding to each domain by various sources including standard datasets like truthfulQA, online platforms like Github repositories, ChatGPT generation or drafted by authors.\n\n We then obtain responses from ChatGPT for these prompts. For each response, we employ fine-grained annotation at the segment level, which includes reference links, identified error types, and the reasons behind these errors as provided by our annotators.\n\n![](image/felm_examples.png)\n\n## Download\n\n- Method 1: Download the whole dataset by:\n  ```\n  wget https://huggingface.co/datasets/hkust-nlp/felm/blob/main/all.jsonl\n  ```\n- Method 2: Load the dataset using [Hugging Face datasets](https://huggingface.co/datasets/hkust-nlp/felm):\n\n  ```python\n  from datasets import load_dataset\n  dataset=load_dataset(r\"hkust-nlp/felm\",'wk')\n  print(dataset['test'][0])\n  \n  ```\n\n\n## Data Description\n#### Dataset Snapshot\n\nCategory | Data\n--- | ---\nNumber of Instances | 847\nNumber of Fields | 5\nLabeled Classes | 2\nNumber of Labels | 4427\n\n#### Descriptive Statistics\n\n\nStatistic | All | world_knowledge | Reasoning | Math | Science/tech | Writting/Recommendation \n--- | --- | --- | --- | --- | --- | ---\nSegments | 4427 | 532  | 1025 | 599 | 683 |  1588\nPositive segments | 3642 | 385  | 877 | 477 | 582 |1321 \nNegative segments |785 | 147  | 148 | 122 | 101 | 267  \n\n#### Data Fields\n\n| Field Name  | Field Value | Description                                 |\n| ----------- | ----------- | ------------------------------------------- |\n| index         | Integer     | the order number of the data point          |\n| source   | string      | the prompt source   |\n| prompt           | string      | the prompt for generating response                   |\n| response           | string      | the response of ChatGPT for prompt                  |\n| segmented_response           | list      | segments of reponse                   |\n| labels          | list      | factuality labels for segmented_response                  |\n| comment      | list      | error reasons for segments with factual error  |\n| type | list      |  error types for segments with factual error        |\n| ref | list      |  reference links       |\n\n\n#### Typical Data Point\n\n\n```\n{\"index\": \"0\", \n \"source\": \"quora\", \n \"prompt\": \"Which country or city has the maximum number of nuclear power plants?\", \n \"response\": \"The United States has the highest number of nuclear power plants in the world, with 94 operating reactors. Other countries with a significant number of nuclear power plants include France, China, Russia, and South Korea.\",\n \"segmented_response\": [\"The United States has the highest number of nuclear power plants in the world, with 94 operating reactors.\", \"Other countries with a significant number of nuclear power plants include France, China, Russia, and South Korea.\"], \n \"labels\": [false, true],\n \"comment\": [\"As of December 2022, there were 92 operable nuclear power reactors in the United States.\", \"\"], \n \"type\": [\"knowledge_error\", null], \n \"ref\": [\"https://www.eia.gov/tools/faqs/faq.php?id=207\u0026t=3\"]}\n\n```\n#### Evaluation on FELM\nEnvironment requirements:\n```\ntransformers 4.32.0\nopenai 0.27.8\ntenacity 8.2.2\ntokenizer 3.4.2\npandas 2.0.3\n```\nTo reproduce our results:\n```\ncd eval\n#put \"all.jsonl\" here (Downloaded by Method 1)\nbash eval.sh\n#You can choose \"vicuna_30B\", \"gpt-3.5-turbo\" and \"gpt-4\" for the parameter \"model\".\n#You can choose \"raw\", \"cot\", \"link\", \"content\" and \"cot-cons\"(cot-cons means cot self-consistency method) for the parameter \"method\".\n#replace 'Your OPENAI KEY' with your openai api key if using GPT-3.5 or GPT-4\n```\n\n\n#### LEADBOARD (in segment level)\n\n| Model | F1 score | Balanced accuracy                                |\n| ----------- | ----------- | ------------------------------------------- |\n| GPT4         | 48.3     |   67.1        |\n| Vicuna-33B   | 32.5      | 56.5   |\n| ChatGPT           | 25.5      | 55.9                   |\n\nWe only report the highest scores in this table.\n\n\n## Licenses\n\n[![MIT license](https://img.shields.io/badge/License-MIT-blue.svg)](https://lbesson.mit-license.org/)\n\nThis work is licensed under a [MIT License](https://lbesson.mit-license.org/).\n\n[![CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC%20BY--NC--SA%204.0-lightgrey.svg)](http://creativecommons.org/licenses/by-nc-sa/4.0/)\n\nThe FELM dataset is licensed under a\n[Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-nc-sa/4.0/).\n\n## Citation\n\nPlease cite our paper if you use our dataset:\n```bibtex\n@inproceedings{\nchen2023felm,\ntitle={FELM: Benchmarking Factuality Evaluation of Large Language Models},\nauthor={Chen, Shiqi and Zhao, Yiran and Zhang, Jinghan and Chern, I-Chun and Gao, Siyang and Liu, Pengfei and He, Junxian},\nbooktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track},\nyear={2023},\nurl={http://arxiv.org/abs/2310.00741}\n}\n\n```\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhkust-nlp%2Ffelm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhkust-nlp%2Ffelm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhkust-nlp%2Ffelm/lists"}