{"id":18109160,"url":"https://declare-lab.github.io/instruct-eval/","last_synced_at":"2025-03-29T15:30:39.420Z","repository":{"id":156051936,"uuid":"620479896","full_name":"declare-lab/instruct-eval","owner":"declare-lab","description":"This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks. ","archived":false,"fork":false,"pushed_at":"2024-03-10T05:00:00.000Z","size":3719,"stargazers_count":544,"open_issues_count":24,"forks_count":45,"subscribers_count":12,"default_branch":"main","last_synced_at":"2025-03-28T19:08:11.588Z","etag":null,"topics":["instruct-tuning","llm"],"latest_commit_sha":null,"homepage":"https://declare-lab.github.io/instruct-eval/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/declare-lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-28T19:06:56.000Z","updated_at":"2025-03-22T21:27:57.000Z","dependencies_parsed_at":"2024-01-14T13:39:48.940Z","dependency_job_id":"deebdbfa-4af6-4799-bf3b-c7ff859da413","html_url":"https://github.com/declare-lab/instruct-eval","commit_stats":{"total_commits":147,"total_committers":4,"mean_commits":36.75,"dds":0.5374149659863945,"last_synced_commit":"49bbd2ce7e3a5bdc4024ba62a5c9c68c107ee966"},"previous_names":["declare-lab/instruct-eval"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/declare-lab%2Finstruct-eval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/declare-lab%2Finstruct-eval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/declare-lab%2Finstruct-eval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/declare-lab%2Finstruct-eval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/declare-lab","download_url":"https://codeload.github.com/declare-lab/instruct-eval/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246204422,"owners_count":20740307,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["instruct-tuning","llm"],"created_at":"2024-11-01T00:01:35.728Z","updated_at":"2025-03-29T15:30:39.394Z","avatar_url":"https://github.com/declare-lab.png","language":"Python","readme":"## :camel: 🍮 📚 InstructEval: Towards Holistic Evaluation of Instruction-Tuned Large Language Models\n\n[Paper](https://arxiv.org/abs/2306.04757) | [Model](https://huggingface.co/declare-lab/flan-alpaca-gpt4-xl) | [Leaderboard](https://declare-lab.github.io/instruct-eval/)\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/declare-lab/instruct-eval/main/docs/logo.png\" alt=\"\" width=\"300\" height=\"300\"\u003e\n\u003c/p\u003e\n\n\u003e 🔥 If you are interested in IQ testing LLMs, check out our new work: [AlgoPuzzleVQA](https://github.com/declare-lab/puzzle-reasoning)\n\n\u003e 📣 Introducing Resta: **Safety Re-alignment of Language Models**. [**Paper**](https://arxiv.org/abs/2402.11746) [**Github**](https://github.com/declare-lab/resta)\n\n\u003e 📣 **Red-Eval**, the benchmark for **Safety** Evaluation of LLMs has been added: [Red-Eval](https://github.com/declare-lab/instruct-eval/tree/main/red-eval)\n\n\u003e 📣 Introducing **Red-Eval** to evaluate the safety of the LLMs using several jailbreaking prompts. With **Red-Eval** one could jailbreak/red-team GPT-4 with a 65.1% attack success rate and ChatGPT could be jailbroken 73% of the time as measured on DangerousQA and HarmfulQA benchmarks. More details are here: [Code](https://github.com/declare-lab/red-instruct) and [Paper](https://arxiv.org/abs/2308.09662).\n\n\u003e 📣 We developed Flacuna by fine-tuning Vicuna-13B on the Flan collection. Flacuna is better than Vicuna at problem-solving. Access the model here [https://huggingface.co/declare-lab/flacuna-13b-v1.0](https://huggingface.co/declare-lab/flacuna-13b-v1.0).\n\n\u003e 📣 The [**InstructEval**](https://declare-lab.net/instruct-eval/) benchmark and leaderboard have been released. \n\n\u003e 📣 The paper reporting Instruction Tuned LLMs on the **InstructEval** benchmark suite has been released on Arxiv. Read it here: [https://arxiv.org/pdf/2306.04757.pdf](https://arxiv.org/pdf/2306.04757.pdf)\n\n\u003e 📣 We are releasing **IMPACT**, a dataset for evaluating the writing capability of LLMs in four aspects: Informative, Professional, Argumentative, and Creative. Download it from Huggingface: [https://huggingface.co/datasets/declare-lab/InstructEvalImpact](https://huggingface.co/datasets/declare-lab/InstructEvalImpact). \n\n\u003e 📣 **FLAN-T5** is also useful in text-to-audio generation. Find our work\nat [https://github.com/declare-lab/tango](https://github.com/declare-lab/tango) if you are interested.\n\nThis repository contains code to evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out\ntasks.\nWe aim to facilitate simple and convenient benchmarking across multiple tasks and models.\n\n### Why?\n\nInstruction-tuned models such as [Flan-T5](https://arxiv.org/abs/2210.11416)\nand [Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html) represent an exciting direction to approximate the\nperformance of large language models (LLMs) like ChatGPT at lower cost.\nHowever, it is challenging to compare the performance of different models qualitatively.\nTo evaluate how well the models generalize across a wide range of unseen and challenging tasks, we can use academic\nbenchmarks such as [MMLU](https://arxiv.org/abs/2009.03300) and [BBH](https://arxiv.org/abs/2210.09261).\nCompared to existing libraries such as [evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)\nand [HELM](https://github.com/stanford-crfm/helm), this repo enables simple and convenient evaluation for multiple\nmodels.\nNotably, we support most models from HuggingFace Transformers 🤗 (check [here](./docs/models.md) for a list of models we support):\n\n- [AutoModelForCausalLM](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForCausalLM) (\n  eg [GPT-2](https://huggingface.co/gpt2-xl), [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6b)\n  , [OPT-IML](https://huggingface.co/facebook/opt-iml-max-1.3b), [BLOOMZ](https://huggingface.co/bigscience/bloomz-7b1))\n- [AutoModelForSeq2SeqLM](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForSeq2SeqLM) (\n  eg [Flan-T5](https://huggingface.co/google/flan-t5-xl), [Flan-UL2](https://huggingface.co/google/flan-ul2)\n  , [TK-Instruct](https://huggingface.co/allenai/tk-instruct-3b-def))\n- [LlamaForCausalLM](https://huggingface.co/docs/transformers/main/model_doc/llama#transformers.LlamaForCausalLM) (\n  eg [LLaMA](https://huggingface.co/decapoda-research/llama-7b-hf)\n  , [Alpaca](https://huggingface.co/chavinlo/alpaca-native), [Vicuna](https://huggingface.co/chavinlo/vicuna))\n- [ChatGLM](https://huggingface.co/THUDM/chatglm-6b)\n\n### Results\n\nFor detailed results, please go to our [leaderboard](https://declare-lab.net/instruct-eval/)\n\n| Model Name | Model Path                                                                                                              | Paper                                                                                                         | Size | MMLU | BBH  | DROP | HumanEval |\n|------------|-------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|------|------|------|------|-----------|\n|            | [GPT-4](https://openai.com/product/gpt-4)                                                                               | [Link](https://arxiv.org/abs/2303.08774)                                                                      | ?    | 86.4 |      | 80.9 | 67.0      |\n|            | [ChatGPT](https://openai.com/blog/chatgpt)                                                                              | [Link](https://arxiv.org/abs/2303.08774)                                                                      | ?    | 70.0 |      | 64.1 | 48.1      |\n| seq_to_seq | [google/flan-t5-xxl](https://huggingface.co/google/flan-t5-xxl)                                                         | [Link](https://arxiv.org/abs/2210.11416)                                                                      | 11B  | 54.5 | 43.9 |      |           |\n| seq_to_seq | [google/flan-t5-xl](https://huggingface.co/google/flan-t5-xl)                                                           | [Link](https://arxiv.org/abs/2210.11416)                                                                      | 3B   | 49.2 | 40.2 | 56.3 |           |\n| llama      | [eachadea/vicuna-13b](https://huggingface.co/eachadea/vicuna-13b)                                                       | [Link](https://vicuna.lmsys.org/)                                                                             | 13B  | 49.7 | 37.1 | 32.9 | 15.2      |\n| llama      | [decapoda-research/llama-13b-hf](https://huggingface.co/decapoda-research/llama-13b-hf)                                 | [Link](https://arxiv.org/abs/2302.13971)                                                                      | 13B  | 46.2 | 37.1 | 35.3 | 13.4      |\n| seq_to_seq | [declare-lab/flan-alpaca-gpt4-xl](https://huggingface.co/declare-lab/flan-alpaca-gpt4-xl)                               | [Link](https://github.com/declare-lab/flan-alpaca)                                                            | 3B   | 45.6 | 34.8 |      |           |\n| llama      | [TheBloke/koala-13B-HF](https://huggingface.co/TheBloke/koala-13B-HF)                                                   | [Link](https://bair.berkeley.edu/blog/2023/04/03/koala/)                                                      | 13B  | 44.6 | 34.6 | 28.3 | 11.0      |\n| llama      | [chavinlo/alpaca-native](https://huggingface.co/chavinlo/alpaca-native)                                                 | [Link](https://crfm.stanford.edu/2023/03/13/alpaca.html)                                                      | 7B   | 41.6 | 33.3 | 26.3 | 10.3      |\n| llama      | [TheBloke/wizardLM-7B-HF](https://huggingface.co/TheBloke/wizardLM-7B-HF)                                               | [Link](https://arxiv.org/abs/2304.12244)                                                                      | 7B   | 36.4 | 32.9 |      | 15.2      |\n| chatglm    | [THUDM/chatglm-6b](https://huggingface.co/THUDM/chatglm-6b)                                                             | [Link](https://arxiv.org/abs/2210.02414)                                                                      | 6B   | 36.1 | 31.3 | 44.2 | 3.1       |\n| llama      | [decapoda-research/llama-7b-hf](https://huggingface.co/decapoda-research/llama-7b-hf)                                   | [Link](https://arxiv.org/abs/2302.13971)                                                                      | 7B   | 35.2 | 30.9 | 27.6 | 10.3      |\n| llama      | [wombat-7b-gpt4-delta](https://huggingface.co/GanjinZero/wombat-7b-gpt4-delta)                                          | [Link](https://arxiv.org/abs/2304.05302)                                                                      | 7B   | 33.0 | 32.4 |      | 7.9       |\n| seq_to_seq | [bigscience/mt0-xl](https://huggingface.co/bigscience/mt0-xl)                                                           | [Link](https://arxiv.org/abs/2210.11416)                                                                      | 3B   | 30.4 |      |      |           |\n| causal     | [facebook/opt-iml-max-1.3b](https://huggingface.co/facebook/opt-iml-max-1.3b)                                           | [Link](https://arxiv.org/abs/2212.12017)                                                                      | 1B   | 27.5 |      |      | 1.8       |\n| causal     | [OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5](https://huggingface.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5) | [Link](https://github.com/LAION-AI/Open-Assistant)                                                            | 12B  | 27.0 | 30.0 |      | 9.1       |\n| causal     | [stabilityai/stablelm-base-alpha-7b](https://huggingface.co/stabilityai/stablelm-base-alpha-7b)                         | [Link](https://github.com/Stability-AI/StableLM)                                                              | 7B   | 26.2 |      |      | 1.8       |\n| causal     | [databricks/dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b)                                               | [Link](https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm) | 12B  | 25.7 |      |      | 7.9       |\n| causal     | [Salesforce/codegen-6B-mono](https://huggingface.co/Salesforce/codegen-6B-mono)                                         | [Link](https://arxiv.org/abs/2203.13474)                                                                      | 6B   |      |      |      | 27.4      |\n  | causal     | [togethercomputer/RedPajama-INCITE-Instruct-7B-v0.1](https://huggingface.co/togethercomputer/RedPajama-INCITE-Instruct-7B-v0.1) | [Link](https://github.com/togethercomputer/RedPajama-Data)                                                             | 7B   | 38.1 | 31.3 | 24.7 | 5.5      |\n\n### Example Usage\n\nEvaluate on [Massive Multitask Language Understanding](https://huggingface.co/datasets/lukaemon/mmlu) (MMLU) which\nincludes exam questions from 57 tasks such as mathematics, history, law, and medicine.\nWe use 5-shot direct prompting and measure the exact-match score.\n\n```\npython main.py mmlu --model_name llama --model_path chavinlo/alpaca-native\n# 0.4163936761145136\n\npython main.py mmlu --model_name seq_to_seq --model_path google/flan-t5-xl \n# 0.49252243270189433\n```\n\nEvaluate on [Big Bench Hard](https://huggingface.co/datasets/lukaemon/bbh) (BBH) which includes 23 challenging tasks for\nwhich PaLM (540B) performs below an average human rater.\nWe use 3-shot direct prompting and measure the exact-match score.\n\n```\npython main.py bbh --model_name llama --model_path TheBloke/koala-13B-HF --load_8bit\n# 0.3468942926723247\n```\n\nEvaluate on [DROP](https://huggingface.co/datasets/drop) which is a math question answering benchmark.\nWe use 3-shot direct prompting and measure the exact-match score.\n\n```\npython main.py drop --model_name seq_to_seq --model_path google/flan-t5-xl \n# 0.5632458233890215\n```\n\nEvaluate on [HumanEval](https://huggingface.co/datasets/openai_humaneval) which includes 164 coding questions in python.\nWe use 0-shot direct prompting and measure the pass@1 score.\n\n```\npython main.py humaneval  --model_name llama --model_path eachadea/vicuna-13b --n_sample 1 --load_8bit\n# {'pass@1': 0.1524390243902439}\n```\n\n### Setup\n\nInstall dependencies and download data.\n\n```\nconda create -n instruct-eval python=3.8 -y\nconda activate instruct-eval\npip install -r requirements.txt\nmkdir -p data\nwget https://people.eecs.berkeley.edu/~hendrycks/data.tar -O data/mmlu.tar\ntar -xf data/mmlu.tar -C data \u0026\u0026 mv data/data data/mmlu\n```\n\n","funding_links":[],"categories":["Model Ranking","LLMs and ChatGPT"],"sub_categories":["Text"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/declare-lab.github.io%2Finstruct-eval%2F","html_url":"https://awesome.ecosyste.ms/projects/declare-lab.github.io%2Finstruct-eval%2F","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/declare-lab.github.io%2Finstruct-eval%2F/lists"}