{"id":24116345,"url":"https://github.com/zhuohaoyu/kieval","last_synced_at":"2025-09-18T04:32:11.019Z","repository":{"id":224473789,"uuid":"762043982","full_name":"zhuohaoyu/KIEval","owner":"zhuohaoyu","description":"[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models","archived":false,"fork":false,"pushed_at":"2024-07-19T06:02:49.000Z","size":11103,"stargazers_count":29,"open_issues_count":2,"forks_count":3,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-07-19T14:12:08.524Z","etag":null,"topics":["acl2024","explainable-ai","llm","llm-evaluation","llm-evaluation-framework","llm-evaluation-metrics","llm-evaluation-toolkit","machine-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zhuohaoyu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-23T01:09:23.000Z","updated_at":"2024-07-19T06:02:52.000Z","dependencies_parsed_at":"2024-07-19T11:53:32.998Z","dependency_job_id":null,"html_url":"https://github.com/zhuohaoyu/KIEval","commit_stats":null,"previous_names":["zhuohaoyu/kieval"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhuohaoyu%2FKIEval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhuohaoyu%2FKIEval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhuohaoyu%2FKIEval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhuohaoyu%2FKIEval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zhuohaoyu","download_url":"https://codeload.github.com/zhuohaoyu/KIEval/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":233446335,"owners_count":18677488,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["acl2024","explainable-ai","llm","llm-evaluation","llm-evaluation-framework","llm-evaluation-metrics","llm-evaluation-toolkit","machine-learning"],"created_at":"2025-01-11T06:16:18.849Z","updated_at":"2025-09-18T04:32:03.107Z","avatar_url":"https://github.com/zhuohaoyu.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models\n\n\u003cdiv align=\"center\"\u003e\n  \u003ca\u003eZhuohao Yu\u003csup\u003e1\u003c/sup\u003e\u003c/a\u003e\u0026emsp;\n  \u003ca\u003eChang Gao\u003csup\u003e1\u003c/sup\u003e\u003c/a\u003e\u0026emsp;\n  \u003ca\u003eWenjin Yao\u003csup\u003e1\u003c/sup\u003e\u003c/a\u003e\u0026emsp;\n  \u003ca\u003eYidong Wang\u003csup\u003e1\u003c/sup\u003e\u003c/a\u003e\u0026emsp; \u003cbr\u003e\n  \u003ca\u003eWei Ye\u003csup\u003e†1\u003c/sup\u003e\u003c/a\u003e\u0026emsp;\n  \u003ca\u003eJindong Wang\u003csup\u003e2\u003c/sup\u003e\u003c/a\u003e\u0026emsp;\n  \u003ca\u003eXing Xie\u003csup\u003e2\u003c/sup\u003e\u003c/a\u003e\u0026emsp;\n  \u003ca\u003eYue Zhang\u003csup\u003e3\u003c/sup\u003e\u003c/a\u003e\u0026emsp;\n  \u003ca\u003eShikun Zhang\u003csup\u003e1\u003c/sup\u003e\u003c/a\u003e\u0026emsp;\n  \u003cp\u003e \u003csup\u003e1\u003c/sup\u003e Peking University, \u003csup\u003e2\u003c/sup\u003e Microsoft Research, \u003csup\u003e3\u003c/sup\u003e Westlake University.\u003c/p\u003e\n\u003c/div\u003e\n\n\n\n## Overview\n\u003cdiv align=\"center\"\u003e\n\n\u003ca href=\"https://github.com/zhuohaoyu/KIEval/\"\u003e\n    \u003cimg src=\"figures/pipeline.png\" alt=\"KIEval Pipeline\" width=\"600\" class=\"center\"\u003e\n\u003c/a\u003e\n\u003c/div\u003e\n\nThis is the official repository for [KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models](https://arxiv.org/abs/2402.15043), accepted to the main conference of 62nd Annual Meeting of the Association for Computational Linguistics (**ACL 2024**).\n\nAutomatic evaluation methods for large language models (LLMs) are hindered by data contamination, leading to inflated assessments of their effectiveness. Existing strategies, which aim to detect contaminated texts, focus on quantifying contamination status instead of accurately gauging model performance. In this paper, we introduce KIEval, a Knowledge-grounded Interactive Evaluation framework, which incorporates an LLM-powered \"interactor\" role for the first time to accomplish a dynamic contamination-resilient evaluation. Starting with a question in a conventional LLM benchmark involving domain-specific knowledge, KIEval utilizes dynamically generated, multi-round, and knowledge-focused dialogues to determine whether a model's response is merely a recall of benchmark answers or demonstrates a deep comprehension to apply knowledge in more complex conversations. Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization. We also reveal that data contamination brings no contribution or even negative effect to models' real-world applicability and understanding, and existing contamination detection methods for LLMs can only identify contamination in pre-training but not during supervised fine-tuning.\n\n\n## Quick Start\n\nTo get started, first clone the repository and setup the environment:\n\n```bash\ngit clone https://github.com/zhuohaoyu/KIEval.git\ncd KIEval\npip install -r requirements.txt\n```\n\nWe provide a modular implementation of our method, currently we support evaluating models locally with Huggingface's Transformers, and remote models with text-generation-inference or other APIs.\n\nTo reproduce results in our paper or evaluate new models with KIEval, we recommend starting a [text-generation-inference](https://huggingface.co/docs/text-generation-inference/en/index) instance with your model:\n\n```bash\nmodel=meta-llama/Llama-2-7b-chat-hf\nvolume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run\n\ndocker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4 --model-id $model\n```\n\nThen, generate an evaluation config file with our script:\n\n```bash\npython scripts/generate-basic.py \\\n    --template ./config/template-basic.json \\ # a template config file we provide\n    --dataset arc_challenge \\ # dataset name, please refer to datasets/ for all supported datasets\n    --base_url http://your-host-url:8080 \\ # replace with your host url, if you start the text-generation-inference locally, use http://localhost:8080\n    --model_name llama-2-7b-chat-hf \\ # any name you like\n    --model_path meta-llama/Llama-2-7b-chat-hf \\ # Huggingface model ID or local model path\n    --openai_api_base https://api.openai.com/v1/ \\ # OpenAI API base url, you could replace with proxy URL if needed\n    --openai_key your_openai_key \\ # replace with your OpenAI API key\n    --openai_model gpt-4-1106-preview \\ \n    --output_path ./result \\ # output path for evaluation results\n    --generate_path ./config/generated.json # output path for generated config file\n```\n\nFinally, run the evaluation process with the generated config file and wait for the results :)\n\n```bash\npython run.py -c ./config/generated.json\n```\n\n\nThis repository provides all settings necessary for researchers to reproduce the results of KIEval, it also facilitates the reproduction of all metrics (from previous works) discussed in our paper. Please refer to `config/templates` for all supported evaluation methods.\n\n\n## Citation\n✨ If you find our work helpful, please consider citing with:\n\n\n```bibtex\n@article{yu2024kieval,\n  title={KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models}, \n  author={Zhuohao Yu and Chang Gao and Wenjin Yao and Yidong Wang and Wei Ye and Jindong Wang and Xing Xie and Yue Zhang and Shikun Zhang},\n  journal={ArXiv},\n  year={2024},\n  volume={abs/2402.15043},\n}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzhuohaoyu%2Fkieval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzhuohaoyu%2Fkieval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzhuohaoyu%2Fkieval/lists"}