{"id":19053579,"url":"https://github.com/open-compass/cibench","last_synced_at":"2025-10-06T10:32:16.120Z","repository":{"id":246860986,"uuid":"740789324","full_name":"open-compass/CIBench","owner":"open-compass","description":"Official Repo of \"CIBench: Evaluation of LLMs as Code Interpreter \"","archived":false,"fork":false,"pushed_at":"2024-07-19T04:51:48.000Z","size":2213,"stargazers_count":10,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-16T20:34:22.864Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/open-compass.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-09T04:17:44.000Z","updated_at":"2024-12-16T19:28:03.000Z","dependencies_parsed_at":"2024-07-05T15:20:48.215Z","dependency_job_id":"8c9403b4-789a-4724-ae2d-be74da771c3a","html_url":"https://github.com/open-compass/CIBench","commit_stats":null,"previous_names":["open-compass/cibench"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/open-compass%2FCIBench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/open-compass%2FCIBench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/open-compass%2FCIBench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/open-compass%2FCIBench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/open-compass","download_url":"https://codeload.github.com/open-compass/CIBench/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":235519884,"owners_count":19003201,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-08T23:32:25.281Z","updated_at":"2025-10-06T10:32:10.656Z","avatar_url":"https://github.com/open-compass.png","language":"Python","readme":"# CIBench: Evaluating Your LLMs with a Code Interpreter Plugin\n\n\u003c!-- [![arXiv](https://img.shields.io/badge/arXiv-2312.14033-b31b1b.svg)](https://arxiv.org/abs/2312.14033) --\u003e\n[![license](https://img.shields.io/github/license/InternLM/opencompass.svg)](./LICENSE)\n\n## ✨ Introduction  \n\nThis is an evaluation harness for the benchmark described in CIBench: Evaluating Your LLMs with a Code Interpreter Plugin.\n\n\u003c!-- [CIBench: Evaluating Your LLMs with a Code Interpreter Plugin](https://arxiv.org/abs/2312.14033).  --\u003e\n\n[[Paper](https://www.arxiv.org/abs/2407.10499)]\n[[Project Page](https://open-compass.github.io/CIBench/)]\n[[LeaderBoard](https://open-compass.github.io/CIBench/leaderboard.html)]\n\u003c!-- [[HuggingFace](https://huggingface.co/datasets/lovesnowbest/CIBench)] --\u003e\n\n\u003e While LLM-Based agents, which use external tools to solve complex problems, have made significant progress, benchmarking their ability is challenging, thereby hindering a clear understanding of their limitations. In this paper, we propose an interactive evaluation framework, named CIBench, to comprehensively assess LLMs' ability to utilize code interpreters for data science tasks. Our evaluation framework includes an evaluation dataset and two evaluation modes. The evaluation dataset is constructed using an LLM-human cooperative approach and simulates an authentic workflow by leveraging consecutive and interactive IPython sessions. The two evaluation modes assess LLMs' ability with and without human assistance. We conduct extensive experiments to analyze the ability of 24 LLMs on CIBench and provide valuable insights for future LLMs in code interpreter utilization.\n\n\u003c!-- \u003cdiv\u003e\n\u003ccenter\u003e\n\u003cimg src=\"figs/proper.jpg\" width=\"700\" height=\"270\"\u003e\n\u003c/div\u003e --\u003e\n\n\u003cdiv\u003e\n\u003ccenter\u003e\n\u003cimg src=\"figs/teaser.jpg\" width=\"800\" height=\"270\"\u003e\n\u003c/div\u003e\n\n\n## 🛠️ Preparations\nCIBench is evaluated based on [OpenCompass](https://github.com/open-compass/opencompass). Please first install opencompass. \n\n```bash\nconda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y\nconda activate opencompass\ngit clone https://github.com/open-compass/opencompass opencompass\ncd opencompass\npip install -e .\npip install requirements/agent.txt\n```\n\nThen, \n\n```bash\ncd ..\ngit clone https://github.com/open-compass/CIBench.git\ncd CIBench\n```\n\nmove the *cibench_eval* to the *opencompass/config*\n\u003c!-- ##  🛫️ Get Started\n\nWe support both API-based models and HuggingFace models via [Lagent](https://github.com/InternLM/lagent). --\u003e\n\n### 💾 Test Data\n\nYou can download the CIBench from [here](https://github.com/open-compass/opencompass/releases/download/0.2.4.rc1/cibench_dataset.zip). \n\nThen, unzip the dataset and place the dataset in *OpenCompass/data*. The data path should be like *OpenCompass/data/cibench_dataset/cibench_{generation or template}*. \n\nFinally, using the following scripts to download the nessceary data.\n\n```bash\ncd OpenCompass/data/cibench_dataset\nsh collect_datasources.sh\n```\n\n\u003c!-- The data file structure is:\n```bash\nopencompass\n--- data\n--- --- cibench_dataset\n--- --- --- cibench_generation\n--- --- --- cibench_template\n--- --- --- cibench_template_chinese\n--- --- --- collect_datasources.sh\n``` --\u003e\n\n\n\n\u003c!-- ### 🤖 API Models\n\n1. Set your OPENAI key in your environment.\n```bash\nexport OPENAI_API_KEY=xxxxxxxxx\n```\n2. Run the model with the following scripts\n```bash\n# test all data at once\nsh test_all_en.sh api gpt-4-1106-preview gpt4\n# test ZH dataset\nsh test_all_zh.sh api gpt-4-1106-preview gpt4\n# test for Instruct only\npython test.py --model_type api --model_path gpt-4-1106-preview --resume --out_name instruct_gpt4.json --out_dir work_dirs/gpt4/ --dataset_path data/instruct_v2.json --eval instruct --prompt_type json --\u003e\n\u003c!-- ``` --\u003e\n\n### 🤗 HuggingFace Models\n\n1. Download the huggingface model to your local path.\n\u003c!-- 2. Uncomment or comment the model in *opencompass/config/cibench_eval/model_collections/chat_models.py*. --\u003e\n2. Run the model with the following scripts in the opencompass dir.\n```bash\npython run.py config/cibench_eval/eval_cibench_hf.py\n```\nNote that the currently accelerator config (-a lmdeploy) doesnot support CodeAgent model. If you want to use lmdeploy to acclerate the evaluation, please refer to [lmdeploy_internlm2_chat_7b](https://github.com/open-compass/opencompass/blob/main/configs/models/hf_internlm/lmdeploy_internlm2_chat_7b.py) to write the model config by yourself.\n\u003c!-- Note: You can install [lmdeploy](https://github.com/InternLM/lmdeploy) and add '-a lmdeploy' to acclerate the evaluation. --\u003e\n### 💫 Final Results\nOnce you finish all tested samples, you can check the results in *outputs/cibench*. \n\nNote that the output images will be saved in *output_images*.\n\n## 📊 Benchmark Results\n\nMore detailed and comprehensive benchmark results can refer to 🏆 [CIBench official leaderboard](https://open-compass.github.io/CIBench/leaderboard.html) !\n\n\u003cdiv\u003e\n\u003ccenter\u003e\n\u003cimg src=\"figs/cibench.png\"\u003e\n\u003c/div\u003e\n\n\u003c!-- ### ✉️ Submit Your Results\n\nYou can submit your inference results (via running test.py) to this [email](lovesnow@mail.ustc.edu.cn). We will run your predictions and update the results in our leaderboard. Please also provide the scale of your tested model. A sample structure of your submission should be like:\n```\n$model_display_name/\n    instruct_$model_display_name/\n        query_0_1_0.json\n        query_0_1_1.json\n        ...\n    plan_json_$model_display_name/\n    plan_str_$model_display_name/\n    ...\n``` --\u003e\n\n## ❤️ Acknowledgements\n\nCIBench is built with [Lagent](https://github.com/InternLM/lagent) and [OpenCompass](https://github.com/open-compass/opencompass). Thanks for their awesome work!\n\n\u003c!-- ## 🖊️ Citation\n\nIf you find this project useful in your research, please consider cite:\n```\n@article{chen2023t,\n  title={CIBench: Evaluating Your LLMs with a Code Interpreter Plugin},\n  author={Chuyu Zhang*, Yingfan Hu*, Songyang Zhang, Kuikun Liu, Zerun Ma, Fengzhe Zhou1, Wenwei Zhang, Xuming He, Dahua Lin, Kai Chen},\n  journal={arXiv preprint arXiv:2312.14033},\n  year={2023}\n}\n``` --\u003e\n\n## 💳 License\n\nThis project is released under the Apache 2.0 [license](./LICENSE).\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopen-compass%2Fcibench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopen-compass%2Fcibench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopen-compass%2Fcibench/lists"}