{"id":13753920,"url":"https://github.com/LaVi-Lab/CLEVA","last_synced_at":"2025-05-09T21:36:08.579Z","repository":{"id":200432452,"uuid":"705455621","full_name":"LaVi-Lab/CLEVA","owner":"LaVi-Lab","description":"[EMNLP 2023 Demo] CLEVA: Chinese Language Models EVAluation Platform","archived":false,"fork":false,"pushed_at":"2025-03-31T12:30:05.000Z","size":398,"stargazers_count":62,"open_issues_count":1,"forks_count":3,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-31T13:49:14.573Z","etag":null,"topics":["chinese","evaluation","nlp"],"latest_commit_sha":null,"homepage":"http://www.lavicleva.com/","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LaVi-Lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-10-16T03:22:24.000Z","updated_at":"2025-03-31T12:30:05.000Z","dependencies_parsed_at":null,"dependency_job_id":"aae69066-9c58-48b6-a8d8-1f73c7b3fe79","html_url":"https://github.com/LaVi-Lab/CLEVA","commit_stats":{"total_commits":6,"total_committers":3,"mean_commits":2.0,"dds":0.5,"last_synced_commit":"9dc51d47c885258c598e8b6e8dd02ffbc2ddb47c"},"previous_names":["lavi-lab/cleva"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LaVi-Lab%2FCLEVA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LaVi-Lab%2FCLEVA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LaVi-Lab%2FCLEVA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LaVi-Lab%2FCLEVA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LaVi-Lab","download_url":"https://codeload.github.com/LaVi-Lab/CLEVA/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253329048,"owners_count":21891568,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chinese","evaluation","nlp"],"created_at":"2024-08-03T09:01:34.868Z","updated_at":"2025-05-09T21:36:08.563Z","avatar_url":"https://github.com/LaVi-Lab.png","language":"Shell","funding_links":[],"categories":["A01_文本生成_文本对话"],"sub_categories":["大语言对话模型及数据"],"readme":"# \u003ch1 align=\"center\"\u003eCLEVA: Chinese Language Models EVAluation Platform\u003c/h1\u003e\n\u003cdiv align=\"center\"\u003e\n\n[![GitHub Repo stars](https://img.shields.io/github/stars/Lavi-Lab/CLEVA)](https://github.com/Lavi-Lab/CLEVA/stargazers)\n[![License: CC BY-NC-ND 4.0](https://img.shields.io/badge/License-CC_BY--NC--ND_4.0-blue.svg)](https://creativecommons.org/licenses/by-nc-nd/4.0/)\n[![Ask me anything](https://img.shields.io/badge/Ask%20me-anything-blue.svg)](https://github.com/LaVi-Lab/CLEVA/issues/new)\n\n[🌐Website](http://www.lavicleva.com/)\n•[📜Paper \\[EMNLP 2023 Demo\\]](https://arxiv.org/abs/2308.04813)\n•[📌Instructions](#instructions)\n•✉️\u003ca href=\"mailto:clevaplat@gmail.com\"\u003eEmail\u003c/a\u003e\n\nEnglish | [简体中文](README_zh-CN.md)\n\n\u003c/div\u003e\n\n## 🎯 Introduction\n\n[CLEVA](https://arxiv.org/abs/2308.04813) is a Chinese Language Models EVAluation Platform developed by CUHK [LaVi Lab](https://lwwangcse.github.io/). CLEVA would like to thank Shanghai AI Lab for the great collaboration in the process. The main features of CLEVA include:\n- A comprehensive **Chinese Benchmark**, featuring 31 tasks (11 application assessments + 20 ability evaluation tasks), with a total of 370K Chinese test samples (33.98% are newly collected, mitigating *data contamination* issues);\n- A standardized **Prompt-Based Evaluation Methodology**, incorporating unified pre-processing for all data and using *a consistent set* of Chinese prompt templates for evaluation.\n- A trustworthy **Leaderboard**, as CLEVA uses a significant amount of new data to minimize data contamination and regularly organizes evaluations.\n\nThe leaderboard is evaluated and maintained by CLEVA using new test data. Past leaderboard data (processed test samples, annotated prompt templates, etc.) are made available to users for local evaluation runs.\n\n![Overview](overview.png)\n\n## 🔥 News\n\n- **\\[2024.12.06\\]** We are thrilled to announce \u003ca href=\"https://arxiv.org/abs/2412.04947\"\u003eC\u003csup\u003e2\u003c/sup\u003eLEVA\u003c/a\u003e, an effort toward building a comprehensive bilingual benchmark with systematic contamination prevention. 🔥🔥🔥\n- **\\[2023.11.02\\]** Thanks for the support of Stanford CRFM HELM team! CLEVA has now been integrated into the [latest release](https://github.com/stanford-crfm/helm/releases/tag/v0.3.0) of HELM. You can use CLEVA to evaluate your own models locally via HELM.\n- **\\[2023.09.30\\]** CLEVA has been accepted to [EMNLP 2023 System Demonstrations](https://2023.emnlp.org/calls/demos/)!\n- **\\[2023.08.09\\]** Our [paper](https://arxiv.org/abs/2308.04813) for CLEVA is out!\n\n\u003ca id=\"instructions\"\u003e\u003c/a\u003e\n## 📌 Instructions\n\n[CLEVA](https://arxiv.org/abs/2308.04813) has been integrated into [HELM](https://github.com/stanford-crfm/helm). CLEVA would like to thank Stanford CRFM HELM team for the support. Users can employ CLEVA's datasets, prompt templates, perturbations, and Chinese automatic metrics for local evaluations via HELM.\n\n\u003e [!NOTE]\n\u003e If you want to evaluate your models on CLEVA online, please contact us via \u003cclevaplat@gmail.com\u003e for authentication and check out [📘Documentation](http://www.lavicleva.com/#/homepage/createautotask) for API development.\n\n### 🛠️ Installation\n\nUsers can refer to the [installation guide](https://crfm-helm.readthedocs.io/en/latest/installation/) of HELM for setting up the Python environment and dependencies (`Python\u003e=3.8`).\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eInstallation Using Anaconda\u003c/b\u003e\u003c/summary\u003e\n\nHere is an example for installation using [Anaconda](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html):\n\nCreate the environment first:\n```sh\n# Create virtual environment\n# Only need to run once\nconda create -n cleva python=3.8 pip\n\n# Activate the virtual environment\nconda activate cleva\n```\n\nThen install the dependencies:\n\n```sh\npip install crfm-helm\n```\n\u003c/details\u003e\n\n### ⚖️ Evaluation\n\nExample command to evaluate `gpt-3.5-turbo-0613` on CLEVA's Chinese-to-English translation task using HELM:\n\n```sh\nhelm-run \\\n-r \"cleva:model=openai/gpt-3.5-turbo-0613,task=translation,subtask=zh2en,prompt_id=0,version=v1,data_augmentation=cleva\" \\\n--num-train-trials \u003cnum_trials\u003e \\\n--max-eval-instances \u003cmax_eval_instances\u003e \\\n--suite \u003csuite_id\u003e\n```\n\nExplanation of parameters in `-r` (run configuration):\n\n- `task` represents one of the 31 tasks included in CLEVA;\n- `subtask` specifies the subcategory under each CLEVA task;\n- `prompt_id` is the index of CLEVA's annotated prompt templates (starting from 0);\n- `version` is the version number of the CLEVA dataset (currently only the `v1` dataset used in the paper is provided);\n- `data_augmentation` specifies the data augmentation strategy, where values like `cleva_robustness`, `cleva_fairness`, and `cleva` are unique to CLEVA for evaluating Chinese language robustness, fairness and both respectively.\n\nFor other parameters, please refer to HELM's [tutorial](https://crfm-helm.readthedocs.io/en/latest/tutorial/).\n\nThe full list of available `task`, `subtask`, and `prompt_id` of CLEVA (`version=v1`) can be found in HELM's [.conf](https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/presentation/run_specs_cleva_v1.conf) file. Users can run the entire CLEVA evaluation suite using the following command (the running time for reproducing CLEVA results can be found in the [paper](https://arxiv.org/abs/2308.04813)):\n\n```sh\nhelm-run \\\n-c src/helm/benchmark/presentation/run_specs_cleva_v1.conf \\\n--num-train-trials \u003cnum_trials\u003e \\\n--max-eval-instances \u003cmax_eval_instances\u003e \\\n--suite \u003csuite_id\u003e\n```\nGenerally, setting `--max-eval-instances` to over 5000 ensures all CLEVA task data are used for evaluation.\n\n### 📊 Reference Results\n\nComparison between the results obtained using HELM for evaluating `gpt-3.5-turbo-0613` on selected CLEVA tasks (`version=v1`) and those from the CLEVA platform:\n\n| Scenario | Metric | Reproduced in HELM | Evaluated by CLEVA |\n| ---- | ----------------- | ---------------- | ----------- |\n| task=summarization,subtask=dialogue_summarization | ROUGE-2 | 0.3045 | 0.3065 |\n| task=translation,subtask=en2zh | SacreBLEU | 60.48 | 59.23 |\n| task=fact_checking | Exact Match | 0.4595 | 0.4528 |\n| task=bias,subtask=dialogue_region_bias | Micro F1 | 0.5656 | 0.5589 |\n\n\u003e [!NOTE]\n\u003e The difference is mainly due to different random seeds resulting in different in-context demonstrations, and the ChatGPT versions used by CLEVA and HELM are not completely aligned.\n\n## ⏬ Download Data\n\nIf you want to use CLEVA data for evaluation with your own code, you can download the data by:\n```sh\nbash download_data.sh\n```\nAfter a successful run, a folder named with the data version will be generated in the current directory, which contains the data of each task of CLEVA. You can specify the data version by passing arguments to `download_data.sh`. It is `v1` by default.\n\n## 🛂 License\n\nCLEVA is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.\n\nYou should have received a copy of the license along with this work. If not, see \u003chttps://creativecommons.org/licenses/by-nc-nd/4.0/\u003e.\n\n## 🖊️ Citation\n\nPlease cite our paper if you use CLEVA in your work:\n```bib\n@misc{li2024c2leva,\n      title={C$^2$LEVA: Toward Comprehensive and Contamination-Free Language Model Evaluation}, \n      author={Yanyang Li and Tin Long Wong and Cheung To Hung and Jianqiao Zhao and Duo Zheng and Ka Wai Liu and Michael R. Lyu and Liwei Wang},\n      year={2024},\n      eprint={2412.04947},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2412.04947}, \n}\n\n@misc{li2023cleva,\n      title={CLEVA: Chinese Language Models EVAluation Platform}, \n      author={Yanyang Li and Jianqiao Zhao and Duo Zheng and Zi-Yuan Hu and Zhi Chen and Xiaohui Su and Yongfeng Huang and Shijia Huang and Dahua Lin and Michael R. Lyu and Liwei Wang},\n      year={2023},\n      eprint={2308.04813},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLaVi-Lab%2FCLEVA","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FLaVi-Lab%2FCLEVA","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLaVi-Lab%2FCLEVA/lists"}