{"id":19243832,"url":"https://github.com/openmoss/halluqa","last_synced_at":"2025-07-07T00:09:23.714Z","repository":{"id":203115046,"uuid":"700142838","full_name":"OpenMOSS/HalluQA","owner":"OpenMOSS","description":"Dataset and evaluation script for \"Evaluating Hallucinations in Chinese Large Language Models\"","archived":false,"fork":false,"pushed_at":"2024-06-05T13:32:51.000Z","size":6354,"stargazers_count":126,"open_issues_count":0,"forks_count":7,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-05-12T09:18:42.163Z","etag":null,"topics":["hallucinations","large-language-models","question-answering"],"latest_commit_sha":null,"homepage":"https://arxiv.org/pdf/2310.03368.pdf","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenMOSS.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-10-04T03:01:40.000Z","updated_at":"2025-05-08T13:22:37.000Z","dependencies_parsed_at":null,"dependency_job_id":"0c92d9eb-7dfa-44e1-b40a-55c062c693c8","html_url":"https://github.com/OpenMOSS/HalluQA","commit_stats":null,"previous_names":["xiami2019/halluqa","openmoss/halluqa"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenMOSS%2FHalluQA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenMOSS%2FHalluQA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenMOSS%2FHalluQA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenMOSS%2FHalluQA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenMOSS","download_url":"https://codeload.github.com/OpenMOSS/HalluQA/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenMOSS%2FHalluQA/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259038045,"owners_count":22796551,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["hallucinations","large-language-models","question-answering"],"created_at":"2024-11-09T17:20:29.747Z","updated_at":"2025-06-10T08:33:14.610Z","avatar_url":"https://github.com/OpenMOSS.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Evaluating Hallucinations in Chinese Large Language Models\n\nThis repository contains data and evaluation scripts of HalluQA (Chinese Hallucination Question-Answering) benchmark.\nThe full data of HalluQA is in **HalluQA.json**.\nThe paper introducing HalluQA and detailed experimental results of many Chinese large language models is [here](https://arxiv.org/pdf/2310.03368.pdf).\n\n## Update\n**2024.2.28**: We add the multiple-choice task for HalluQA. \nThe test data for multiple-choice task is in HalluQA_mc.json.\nThe multiple-choice QA prompt is in prompts/Chinese_QA_prompt_mc.txt .\n\n## Data Collection Pipeline\n![](imgs/pipeline.png)\nHalluQA contains 450 meticulously designed adversarial questions, spanning multiple domains, and takes into account Chinese historical culture, customs, and social phenomena. The pipeline of data collection is shown above. At step 1, we write questions which we think may induce model hallucinations. At step 2, we use ChatGPT3.5/Puyu/GLM-130B to generate answers and collect adversarial questions. At step 3, we write multiple correct and wrong answers for each adversarial question and add support evidence. At step 4, we check all annotated question-answer pairs and remove low quality samples.\n\n## Data Examples\nWe show some data examples of HalluQA here.\n![](imgs/examples.png)\n\n## Metric \u0026 Evaluation Method\nWe use non-hallucination rate as the metric of HalluQA, which represents the percentage of answers that do not exhibit hallucinations out of all model generated answers.  \nFor automated evaluation, we use GPT-4 as the evaluator. GPT-4 will judge whether a generated answer exhibit hallucinations based on the given criterias and reference correct answers.  \nThe prompt for GPT-4 based evaluation is in **calculate_metrics.py**\n\n### Run evaluation for your models\n1. Install requirements\n```\npip install openai\n```\n2. Run evaluation using our script.\n```python\npython calculate_metrics.py --response_file_name gpt-4-0613_responses.json(\"replace with your own responses\") --api_key \"your openai api key\" --organization \"organization of your openai account\"\n```\n3. The results and metric will be saved in results.json and non_hallucination_rate.txt respectively.\n\n### Multiple-choice task\nWe also provide a multiple-choice task for HalluQA. \nYou need to first generate answers for each question using the model to be tested, using our [multiple-choice prompt](./prompts/Chinese_QA_prompt_mc.txt), and then calculate the accuracy of the multiple-choice task using the following script.\n```python\npython calculate_metrics_mc.py --response_file_name \u003cyour_results_file_name\u003e\n```\n\n## Results\n### Leaderboard\n**Non-hallucination rate of each model for different types of questions**:\n| **Model**                | **Misleading** | **Misleading-hard** | **Knowledge** | **Total** |\n|--------------------------|----------------|--------------------|---------------|-----------|\n| ***Retrieval-Augmented Chat Model*** | | | | |\n| ERNIE-Bot                | 70.86          | 46.38              | 75.73        | 69.33    |\n| Baichuan2-53B            | 59.43          | 43.48              | 83.98        | 68.22    |\n| ChatGLM-Pro              | 64.00          | 34.78              | 67.96        | 61.33    |\n| SparkDesk                | 59.43          | 27.54              | 71.36        | 60.00    |\n| ***Chat Model***                     | | | | |\n| abab5.5-chat             | 60.57          | 39.13              | 57.77        | 56.00    |\n| gpt-4-0613               | 76.00          | 57.97              | 32.04        | 53.11    |\n| Qwen-14B-chat            | 75.43          | 23.19              | 30.58        | 46.89    |\n| Baichuan2-13B-chat       | 61.71          | 24.64              | 32.04        | 42.44    |\n| Baichuan2-7B-chat        | 54.86          | 28.99              | 32.52        | 40.67    |\n| gpt-3.5-turbo-0613       | 66.29          | 30.43              | 19.42        | 39.33    |\n| Xverse-13B-chat          | 65.14          | 23.19              | 22.33        | 39.11    |\n| Xverse-7B-chat           | 64.00          | 13.04              | 21.84        | 36.89    |\n| ChatGLM2-6B              | 55.43          | 23.19              | 21.36        | 34.89    |\n| Qwen-7B-chat             | 55.43          | 14.49              | 17.48        | 31.78    |\n| Baichuan-13B-chat        | 49.71          | 8.70               | 23.30        | 31.33    |\n| ChatGLM-6b               | 52.57          | 20.29              | 15.05        | 30.44    |\n| ***Pre-Trained Model***              | | | | |\n| Qwen-14B                 | 54.86          | 23.19              | 24.76        | 36.22    |\n| Baichuan2-13B-base       | 23.43          | 24.64              | 45.63        | 33.78    |\n| Qwen-7B                  | 48.57          | 20.29              | 16.99        | 29.78    |\n| Xverse-13B               | 18.86          | 24.64              | 32.52        | 27.33    |\n| Baichuan-13B-base        | 9.71           | 18.84              | 40.78        | 25.33    |\n| Baichuan2-7B-base        | 8.00           | 21.74              | 41.26        | 25.33    |\n| Baichuan-7B-base         | 6.86           | 15.94              | 37.38        | 22.22    |\n| Xverse-7B                | 12.00          | 13.04              | 29.61        | 20.22    |\n\n### Detailed results\nEach model's generated answers and the corresponding judgement of GPT-4 are in **Chinese_LLMs_outputs/**.\n\n### Multiple-choice task results\nHere we report accuracy of the multiple-choice task for seven representative models.\n![](./imgs/mc_acc.png)\n\n## Acknowledgements\n- We sincerely thank annotators and staffs from Shanghai AI Lab who involved in this work.\n- I especially thank Tianxiang Sun, Xiangyang Liu and Wenwei Zhang for their guidance and help.\n- I am also grateful to Xinyang Pu for her help and patience.\n\n## Citation\n```bibtex\n@article{DBLP:journals/corr/abs-2310-03368,\n  author       = {Qinyuan Cheng and\n                  Tianxiang Sun and\n                  Wenwei Zhang and\n                  Siyin Wang and\n                  Xiangyang Liu and\n                  Mozhi Zhang and\n                  Junliang He and\n                  Mianqiu Huang and\n                  Zhangyue Yin and\n                  Kai Chen and\n                  Xipeng Qiu},\n  title        = {Evaluating Hallucinations in Chinese Large Language Models},\n  journal      = {CoRR},\n  volume       = {abs/2310.03368},\n  year         = {2023},\n  url          = {https://doi.org/10.48550/arXiv.2310.03368},\n  doi          = {10.48550/arXiv.2310.03368},\n  eprinttype    = {arXiv},\n  eprint       = {2310.03368},\n  timestamp    = {Thu, 19 Oct 2023 13:12:52 +0200},\n  biburl       = {https://dblp.org/rec/journals/corr/abs-2310-03368.bib},\n  bibsource    = {dblp computer science bibliography, https://dblp.org}\n}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenmoss%2Fhalluqa","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopenmoss%2Fhalluqa","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenmoss%2Fhalluqa/lists"}