{"id":19085981,"url":"https://github.com/openlmlab/leval","last_synced_at":"2025-04-05T11:12:35.529Z","repository":{"id":180688538,"uuid":"665530183","full_name":"OpenLMLab/LEval","owner":"OpenLMLab","description":"[ACL'24 Outstanding] Data and code for L-Eval, a comprehensive long context language models evaluation benchmark","archived":false,"fork":false,"pushed_at":"2024-07-09T08:32:43.000Z","size":14625,"stargazers_count":373,"open_issues_count":3,"forks_count":14,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-03-29T10:09:00.335Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenLMLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"citation.bib","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-07-12T12:08:35.000Z","updated_at":"2025-03-28T14:48:21.000Z","dependencies_parsed_at":"2023-09-26T16:41:05.737Z","dependency_job_id":"9ab42299-5027-4ab6-9b75-8b1dd872cfb4","html_url":"https://github.com/OpenLMLab/LEval","commit_stats":null,"previous_names":["openlmlab/leval"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenLMLab%2FLEval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenLMLab%2FLEval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenLMLab%2FLEval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenLMLab%2FLEval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenLMLab","download_url":"https://codeload.github.com/OpenLMLab/LEval/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247325693,"owners_count":20920714,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-09T02:58:00.632Z","updated_at":"2025-04-05T11:12:35.502Z","avatar_url":"https://github.com/OpenLMLab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"figs/logo.png\" border=\"0\" width=450px/\u003e\n\u003c/div\u003e\n\n------\n### *L-Eval: Instituting Standardized Evaluation for Long Context Language Models*\n\n**Data Collection:**   L-Eval ([preview on 🤗 HuggingFace Datasets](https://huggingface.co/datasets/L4NLP/LEval) • [check our 📃 paper](https://arxiv.org/abs/2307.11088) ) is a comprehensive Long Context Language Models (LCLMs) evaluation suite with 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs encompassing diverse question styles, domains, and input length (3k～200k tokens).\nL-Eval has 2 groups: *closed-ended* tasks and *open-ended* tasks. The closed-ended group primarily tests the reasoning and understanding ability regarding a longer context, and the open-ended group consists of more summarization tasks that require aggregation of long document information ([download the data](#use)).\n\n**Long Context LLMs Evaluation:**  Closed-ended tasks typically do not present issues with evaluation fairness. However, in real-world long-context tasks, open-ended tasks tend to be more common. We have found that *n-gram* metrics such as ROUGE and F1 cannot accurately reflect the abilities of LCLMs. As such, L-Eval does not solely rely on metrics used in previous text generation benchmarks. Instead, L-Eval primarily utilizes Length-Instruction-Enhanced (LIE) evaluation, and LLM judges (battling with Turbo-16k or Llama2). Please refer to [open-ended tasks evaluation](#eval)).\n\nWe hope L-Eval could help researchers and developers track the progress of long-context language models (LCLMs) and understand the strengths/shortcomings of different methods. We will also keep up with the **latest releases** of instruction-following LCLMs.\n\n#### Other features of this repo:\n- 🧭️ [Handle CUDA OOM with memory-efficient inference](#inference)\n- 🖇️ [Build a retrieval-based baseline with Langchain](#tool)  \n- ✏️ [Flask web client for editing local jsonl files](#tool)\n- 🔖 [View the Leaderboard](https://l-eval.github.io) \n- 📨 [How to submit your results](#submit)  \n- [Previous long sequence datasets used in L-Eval](#ack)  \n\n#### Long context abilities of LLMs on closed/open-ended tasks:\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"figs/overall.png\" border=\"0\" width=850px/\u003e\n\u003c/div\u003e\n\n## 🔥 Updates of L-Eval \n- **[2024-4-25]** We add the results for Llama3 8b/70b.\n\n| Model | TOEFL | QuALITY | Coursera | SFiction | GSM | CodeU |\n|--------|------|------|-------|-------|-------|-------|\n| [Llama3-8b-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)  | 82.89 | 64.85 |  53.77 | 69.53 |  79.00 | 2.22|\n| [Llama3-70b-Instruct](https://huggingface.co/meta-llama/Meta-Llama-70B-Instruct) | 84.75 |80.19 | 75.87 | 72.65 |  90.00 |  6.67 | \n| GPT4-32k (2023) | 84.38 |82.17 | 75.58 | 74.99 |  96.00 |  25.55 | \n\n- **[2023-10-7]** Final version of our paper can be found [here](https://arxiv.org/abs/2307.11088).\n- **[2023-8-30]** We have annotated two new closed-ended tasks:  (i) A [scientific fiction](https://github.com/OpenLMLab/LEval/blob/main/LEval-data/Closed-ended-tasks/sci_fi.jsonl) dataset to test the loyalty to input and (ii) a [code understanding](https://github.com/OpenLMLab/LEval/blob/main/LEval-data/Closed-ended-tasks/codeU.jsonl) dataset. 📢 **L-Eval** has been supported by [OpenCompass](https://github.com/internLM/OpenCompass/). You can  test L-Eval together with other benchmarks for foundation models here.\n\n## Folders\nThe repository is structured as follows:\n\n```bash\n├── Baselines/ # scripts to generate the prediction files with baseline models\n├── Baselines-light/ # scripts to generate the prediction files with 24G gpus\n├── Evaluation/ # evaluation scripts\n├── LEval-data/ # test samples\n│   ├── Closed-ended-tasks/ # exact match tasks (like multiple-choice)\n│   │   ├── test_file.jsonl \n│   │   └── ...\n│   ├── Open-ended-tasks/ # generation tasks\n│   │   ├── test_file.jsonl\n│   │   └── ...\n├── Predictions/ # output of models\n│   ├── exam_eval/turbo-16k-0613\n│   │              ├── \u003ctask_name\u003e.pred.jsonl\n│   │              └── ... \n│   ├── llm_gpt4_eval  \n│   │             ├──\u003cmodel_name\u003e.pred.jsonl\n│   ├── ngram_eval  \n│   │             ├──model_name\n│   │                     └──task_name.pred.jsonl\n│   ├── ...\n└── Tools/ # useful scripts\n\n```\n\u003ca name=\"use\"\u003e\u003c/a\u003e\n## Quick use\n#### Step 1. Download the data \nIt is easy to load the 20 test data in one line with huggingface datasets, and we give the example scripts:\n```python\nfrom datasets import load_dataset, disable_caching\n\ndatasets = [\"coursera\", \"gsm100\", \"quality\", \"topic_retrieval_longchat\", \"tpo\", \"codeU\", \"sci_fi\" ,\"financial_qa\", \"gov_report_summ\", \"legal_contract_qa\", \"meeting_summ\", \"multidoc_qa\", \"narrative_qa\", \"natural_question\", \"news_summ\", \"paper_assistant\", \"patent_summ\", \"review_summ\", \"scientific_qa\", \"tv_show_summ\"]\n# The corresponding NAMEs in the paper\n# \"coursera\", \"GSM(16-shot)\", \"QuALITY\", \"TopicRet\", \"TOFEL\", \"codeU\", \"SFiction\", \"LongFQA\", \"GovReport\", \"CUAD\", \"QMSum\", \"MultiDoc2Dial\", \"NarrativeQA\", \"NQ\", \"Multi-news\", \"Openreview\", \"BigPatent\", \"SPACE\", \"Qasper\", \"SummScreen\"]\n\nfor testset in datasets:\n    # disable_caching()  uncomment this if you cannot download codeU and sci_fi \n    data = load_dataset('L4NLP/LEval', testset, split='test')\n    # evaluate your model\n```\n\nYou can also directly clone this repo:\n```\ngit clone https://github.com/OpenLMLab/LEval.git\n```\nThe test data is in [LEval-data](https://github.com/OpenLMLab/LEval/tree/main/LEval-data).\n\nEach long document has multiple queries and corresponding responses. The format of each sample is as follows:\n```json\n{\n    \"instructions\": [\"What is the main goal of data science?\\nA. Analyze and predict future trends\\nB. Generate massive amounts of data\\nC. Answer questions using data\\nD. Increase the use of technology\", \"...\"], // a list of instructions (questions need LLMs to answer)\n    \"outputs\": [\"C\",\"A\", \"...\"], // the ground truth or reference of corresponding instructions\n    \"input\": \"A very long document\", // LLMs need to respond to instructions based on this long document.\n    \"source\": \"domain the document belongs to\", // meeting, narrative_qa, etc.\n    \"evaluation\": \"Metrics used for evaluation\" // e.g., exam, human, LLM, ROUGE, F1, etc.\n}\n```\n\n#### Step 2. Generate your prediction results (Closed-ended tasks)\n**Examples of closed-ended tasks**\n  - Multiple Choice Question (single correct option). Example predicted answer: `A, BCD`\n  - Math Word Problems. Example predicted answer: `3`\n\nWe test all the baselines with a single 80G A800 GPU. If you encounter the OOM problem, please refer to [multiple GPUs inference](#inference). To generate the output files, you need to add a new file to `Baseline` folder and then replace the model name with your own model. An example of testing `gpt3.5-turbo-16k` on closed-ended tasks:\n```\npython Baselines/turbo16k-test.py  --metric exam_eval (for closed-ended group)  --task_name quality [Optional, if you only want to test one task]\n```\nThe script will save the prediction results to a local file. You need to press enter to confirm the path. Details about open-ended tasks can be found in the [next section](#eval).\n\n#### Step 3. Evaluate the prediction file\nGiven the prediction file generated in Step 2, please run the following command to calculate the metric:\n```\npython Evaluation/auto_eval.py --pred_file Predictions/exam_eval/turbo-16k-0613/quality.pred.jsonl \n```\n\n\n## Evaluating LCLMs on open-ended tasks\nIn this part, we mainly introduce how to evaluate LCLMs on open-ended tasks.\n\n\u003ca name=\"eval\"\u003e\u003c/a\u003e\n#### Examples of open-ended tasks \n- Summarization. Example predicted answer: `This paper proposes a new method for ...`\n- Abstractive question answering. Example predicted answer: `The main goal of data science is to answer questions using data.`\n\nGenerate prediction results on open-ended tasks:\n```\nCMD: python Baselines/turbo16k-test.py --metric ngram_eval (for open-ended group)  --task_name narrative_qa [Optional, if you only want to test one task]\n```\nGenerate prediction results on the **96-question** subset (GPT-4 evaluation subset):\n```\nCMD: python Baselines/turbo16k-test.py --metric llm_gpt4_eval\n```\nGenerate prediction results on the **85-question** subset (human evaluation subset):\n```\nCMD: python Baselines/turbo16k-test.py --metric human_eval \n```\nGenerate prediction results on the 2 subsets (181 questions) :\n```\nCMD: python Baselines/turbo16k-test.py --metric llm_turbo_eval \n```\n\n\n#### Automatic Metrics\nwe use the following automatic metrics to evaluate the performance of generation tasks:\n- **GPT-4/3.5** Evaluation. We suggest using GPT-4 as a judge and battling with `turbo-16k-0613`. We report the win-rate in our paper. Turbo-16k serves as a strong baseline, and you could also opt for `Llama2-4k` to directly demonstrate the extent of your improvements.\n```\npython Evaluation/llm_eval.py --pred_file Predictions/ngram_eval/vicuna-13b-16k/narrative_qa.pred.jsonl --judge_model gpt-4 (or gpt-3.5-turbo) --battle_with Predictions/ngram_eval/turbo-16k-0613 (or llama2-13b-chat)/narrative_qa.pred.jsonl\n```\nPlease add the following judgment prompt in Long context settings:\n\u003e Additional details or information that are not mentioned in the reference answer cannot be considered as advantages and do not let them sway your judgment.\n\n- **N-gram Match** Evaluation (biased), traditional automatic metrics like F1, ROUGE, is very cheap and efficient to calculate. However, they are biased towards the length of the predicted answer. \n```\npython Evaluation/auto_eval.py --pred_file Predictions/ngram_eval/vicuna-13b-16k/narrative_qa.pred.jsonl\n```\n#### ❗ Length-Instruction-Enhanced Evaluation\nFor open-ended tasks,  models are informed of the ground truth length via a length instruction,e.g,  *We need a 20 words summary* where 20 is the length of reference answer to reduce the length bias in automatic metrics. The figure below shows the improvement in Kendall-Tau correlation with human assessment brought by length-instruction-enhanced evaluation.\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"figs/kt_cor.png\" border=\"0\" width=450px/\u003e\n\u003c/div\u003e\n\n#### Human evaluation\nwe provide a very easy-to-use flask web app running on `localhost 127.0.0.1:5000`. You need to copy your prediction file `\u003cmodel_name\u003e.pred.jsonl` (samples with `evaluation: human`) to the `Predictions/human_eval` folder and then run:\n```\npython Evaluation/web_human_eval.py  --mode begin (or continue)\n```\nwhere `--mode` denotes whether you are starting a new evaluation or continuing your previous annotation.  Feel free to close the browser and set `--mode continue` to continue from your last annotation. Once running the script, you have to provide the annotator name and your annotation results will be saved to `Predictions/human_eval/annotation_from_\u003cname\u003e.jsonl`.\nSee the running screenshot [here](#human_demo). We  have provided the prediction files from 5 popular models as baselines for human evaluation. if you want to add outputs from other baselines, you can also move the corresponding prediction file to the `Predictions/human_eval` folder.\n\n### Statistics of the data:\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"figs/data.png\" border=\"0\" width=650px/\u003e\n\u003c/div\u003e\n\n\n\u003ca name=\"submit\"\u003e\u003c/a\u003e\n## How to Submit\nThe [leaderboard](https://l-eval.github.io) contains 5 parts: `Exact Match, GPT-4 evaluator, GPT-3.5 Evaluator, F1, ROUGE`,\n\nTo submit your results on our leaderboard, you can send an email to `levalbenchmark@gmail.com`. \n#### Your submission should include 4 things:\n\n* Metadata: Model name, number of parameters, and links to your paper/blog/GitHub/demo.\n* Output files: Please submit 1 folder named with your model (e.g., `Predictions/turbo-16k-0613` ) for ngram matching evaluation and a jsonl file, e.g., `Predictions/LLM_Eval/claude100k.pred.jsonl`(The file naming format is `model_name.pred.jsonl`) for  LLM evaluation, as described in [Evaluation scripts section](#eval).\n* Results: Please submit the results produced by our evaluation scripts. Results should contain all keys in the  [leaderboard](https://l-eval.github.io).\n* Judgements from turbo3.5 and gpt4 (The output file produced by `llm_eval.py`)\n\nWe will randomly verify some results with the submitted output files.\n\n#### Explanation of keys in the leaderboard\n\n1. Keys in [Exact Match](https://l-eval.github.io)\n   - `Avg`:  averaging over 4 datasets performance score.\n   - `Max-Ctx`: the maximum context length of your model.\n   - `Tokens`: the number of input tokens in experiments.\n   - `Ret.`: whether using retrieval.\n   - `PE`: whether doing prompt engineering (e.g., modifying the original prompt to improve the performance,  providing in-context examples).\n   - `IDD`: whether using in-domain data (e.g.  data from qmsum, narrative_qa training set) into further finetuning. **Please don't hack this evaluation set**. But considering most of the sources are open, if your dataset potentially contains some in-domain data, you don't need to remove them. In that case, please set this value to 'yes'. If the construction of the IFT data is not transparent, you can leave it blank.\n2. Keys in [F1_and ROUGE](https://l-eval.github.io) \n   - `F1 avg`:  the average over each dataset’s overall F1 score on QA-style tasks\n   - `ROUGE avg`: the average over each dataset’s overall ROUGE-L score on Summarization-style tasks.\n   - `Length`: the average length of the generated outputs.\n3. Keys in [GPT-4/3.5 Evaluator](https://l-eval.github.io)\n    - `n_wins`: number of wins including results of swapping the position of two answers.\n    - `n_draws` number of draws including results of swapping the position of two answers.\n    - `win % vs turbo16k` The win rate of your model in the battle with `turbo-16k-0613`\n    - `Length`: the average length of the generated outputs.\n\n\u003ca name=\"inference\"\u003e\u003c/a\u003e\n## Memory-efficient inference and multiple GPUs inference\n### Using Flash Attention during inference 🚀\nPlease first try [Flash Attention](https://github.com/Dao-AILab/flash-attention) if you have a **80G** GPU. Based on our experiments, it works well when the sequence length is less than 32k (Flash-attn v2).  if you still encounter OOM, please refer to the next section.\nIf you are using LLaMA, we support FlashAttention in inference which can save your gpu memory, please add the param `--flash`.  The code is similar for other models.\nFlash attention for Chatglm is implemented with torch2.0. Please ensure that you have successfully installed it.\n\nIf you encounter installation issues, it's likely due to the CUDA and Torch versions mismatch. Here is my running env:\n```\npython\u003e=3.8\ntorch==1.13.1+cu117\nCUDA Driver Version: 525.105.17   CUDA Toolkit: 11.7\ngit clone https://github.com/Dao-AILab/flash-attention.git\ncd flash-attention/\n[if flashAttn-v1] git checkout tags/v1.0.0 \npython setup.py install\n```\n\n```\npython Baselines/longchat-test.py --task_path LEval-data/Open-ended-tasks/narrative.jsonl --max_length 16k --gpu 0 --metric ngram_eval --flash \n```\n\n### Memory-efficient inference with [LightLLM](https://github.com/ModelTC/lightllm) 🚂\n\nUsing lightLLM can make the inference procedure on a single or multiple 24G GPUs by optimizing the storage of KV cache but sacrificing inference speed.\n\n#### Installation\n1. Download L-Eval and the [data](https://github.com/OpenLMLab/LEval#step-1-download-the-data).\n2. Install LightLLM according to the [official instructions](https://github.com/ModelTC/lightllm#get-started).\n\n#### Examples of running L-Eval with LightLLM\n\u003e You must first download the model you would like to evaluate. LightLLM does not support automatic downloads yet.\n\n\u003e Code for running L-Eval with LightLLM is located in the `Baselines-light` directory.\n\nThe following command evaluates vicuna-7b-v1.5-16k on 4 RTX 3090 GPUs.\n```bash\npython Baselines-lightllm/vicuna-test.py --metric exam_eval --max_length 16k --model_path /.../.../vicuna-7b-v1.5-16k/ --lightllm_extra_args \"--tp 4 --max_total_token_num 130000 --trust_remote_code --max_req_total_len 16384 --max_req_input_len 15900\"\n```\n\n\u003e You don't actually need 4 GPUs to run this example. But performance will improve with more GPUs.\n\n`--lightllm_extra_args` are extra arguments passed to LightLLM server. View the [LightLLM documentation](https://github.com/ModelTC/lightllm/blob/main/docs/ApiServerArgs.md) for more information on how to set these arguments. `model_dir` is automatically passed and do not need to be specified again.\n\nThe script assumes LightLLM server is listening on port `8000`.\n\n#### Known Issues\nLightLLM server process might not properly terminate after the evaluation script stops. If you don't have other Python processes running, you can run `killall -HUP python` to terminate LightLLM server.\n\n\n## Other Tools\n\u003ca name=\"tool\"\u003e\u003c/a\u003e\n### Using Langchain to build retrieval-based baselines\nYou can use the script `turbo4k-retrieve-test.py` in `Baselines` to enhance a regular LLM with a sparser or dense retriever. An example is as follows:\n```\npython Baselines/turbo4k-retrieve-test.py --metric exam_eval (or ngram_eval, human_eval, llm_turbo_eval, llm_gpt4_eval) --retriever BM25 (or AdaEmbedding)\n```\nThe retrieval-based method is implemented with [langchain](https://github.com/hwchase17/langchain). If you want to use BM25 retriever, please first install [Elasticsearch](https://github.com/elastic/elasticsearch). If you want to try ada embedding (cheap but effective), please fill your api-key.\n \n\n### A flask-based annotation website for jsonl files\nWe have also released a very easy-to-use annotation website for L-Eval and make sure you have installed flask.\nFirstly, you have to preprocess your files into a jsonl format which should contains 3 keys `input:str`, `instructions:list` and, `outputs:list` (see the examples in `LEval-data` folder).\nTo annotate new instruction-output pairs, please run the script to view and annotate the local jsonl file:\nStart running the website on `127.0.0.1:5000` by:\n```\npython Tools/web_annotate_jsonl.py --path LEval-data/Generation/meeting_summ.jsonl --mode begin --new_pairs_num 2\n```\nwhere `--new_pairs_num` means the number of new QA pairs you want to add and `--mode` (begin or continue) means whether you want to continue from previous annotation results. \nThe input file denoted by `--path` should be a `jsonl` file like the examples in `LEval-data` folder.  In this case, we annotate two new QA pairs based on the long input. After clicking `submit`, the results will be saved to the disk.\n\n#### Example of our annotation website\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"figs/annotation.png\" border=\"0\" width=660px/\u003e\n\u003c/div\u003e\n\n\u003ca name=\"human_demo\"\u003e\u003c/a\u003e\n#### Example of the human evaluation website\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"figs/human_eval.png\" border=\"0\" width=660px/\u003e\n\u003c/div\u003e\nYou can score the outputs from different models via the website. After completing the annotation, the result page is like:\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"figs/res_page.png\" border=\"0\"/\u003e\n\u003c/div\u003e\n\n## Acknowledgement\n\u003ca name=\"ack\"\u003e\u003c/a\u003e\nThis work is done by Fudan University and The University of Hong Kong.\nPrimary contributors: Chenxin An, Shansan Gong, Ming Zhong, Xingjian zhao, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu.\n\nWe would like to express our gratitude towards Siyu Ren, Zhiyong Wu, Qinyuan Cheng, Bo Wang, and Yukang Chen for their valuable suggestions and insights!\n\n**We sincerely appreciate the assistance provided by the following works for L-Eval**:\n- We download the videos to form the long documents from [Coursera website](https://www.coursera.org/)\n- we extract 100 math problems from  [GSM8k](https://github.com/openai/grade-school-math) and we use 8 long examples from [{chain-of-thought-hub](https://github.com/FranxYao/chain-of-thought-hub/blob/main/gsm8k/lib_prompt/prompt_hardest.txt)\n- topic retrieval data is collected from [LongChat](https://github.com/DachengLi1/LongChat)\n- QuALITY is from [their official github](https://github.com/nyu-mll/quality)\n- TOEFL Practice Online data comes from [TOEFL-QA](https://github.com/iamyuanchung/TOEFL-QA/tree/master) \nOther open-sourced datasets are collected from: [gov_report](https://gov-report-data.github.io/),  [cuad](https://github.com/TheAtticusProject/cuad), [qmsum](https://github.com/Yale-LILY/QMSum),  [Multidoc2dial](https://doc2dial.github.io/multidoc2dial)\n [narrativeQA](https://github.com/deepmind/narrativeqa), [Natural Questions](https://github.com/google-research-datasets/natural-questions), [review advisor](https://github.com/neulab/ReviewAdvisor), [multi-news](https://github.com/Alex-Fabbri/Multi-News)\n[bigpatent](https://evasharma.github.io/bigpatent/), [SPACE](https://github.com/stangelid/qt), [Qasper](https://github.com/allenai/qasper-led-baseline), [SummScreen](https://github.com/mingdachen/SummScreen)\n\nPlease kindly cite the [original papers](https://github.com/OpenLMLab/LEval/blob/main/citation.bib) when using L-Eval.\nThanks again for their effort!!  \n\nWe are very pleased to answer any questions about L-Eval: `cxan20@fudan.edu.cn`\n\n## Citation\n```\n@misc{an2023leval,\n      title={L-Eval: Instituting Standardized Evaluation for Long Context Language Models}, \n      author={Chenxin An and Shansan Gong and Ming Zhong and Mukai Li and Jun Zhang and Lingpeng Kong and Xipeng Qiu},\n      year={2023},\n      eprint={2307.11088},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenlmlab%2Fleval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopenlmlab%2Fleval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenlmlab%2Fleval/lists"}