Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/FranxYao/chain-of-thought-hub

Benchmarking large language models' complex reasoning ability with chain-of-thought prompting
https://github.com/FranxYao/chain-of-thought-hub

Last synced: 3 months ago
JSON representation

Benchmarking large language models' complex reasoning ability with chain-of-thought prompting

Lists

README

        

# Chain-of-Thought Hub: Measuring LLMs' Reasoning Performance

![Title](resources/title.png)
"A fantasy graph illustrating a chain of stars in a dark night with blue sky, digital art, super resolution". Midjourney V5

----

By [Yao Fu](https://franxyao.github.io/), [Litu Ou](https://github.com/Leonard907), [Mingyu Chen](https://github.com/Spehhhhh), [Yuhao Wan](https://github.com/Yuhao-Wan), [Hao Peng](https://haopeng-nlp.github.io/), [Tushar Khot](https://allenai.org/team/tushark), [Wenhu Chen](https://wenhuchen.github.io/)

From University of Edinburgh, University of Washington, Allen Institute for AI, University of Waterloo

[[paper](https://arxiv.org/abs/2305.17306)] [[blog](https://yaofu.notion.site/Towards-Complex-Reasoning-the-Polaris-of-Large-Language-Models-c2b4a51355b44764975f88e6a42d4e75)] [[twitter](https://twitter.com/Francis_YAO_/status/1663472109299937280)]

Recently, there are a lot of progress in LLMs. Many claim that a small model less than 10B can achieve comparable performance to GPT-3.5. Really?

> In a casual conversation, the distinction between GPT-3.5 and GPT-4 can be subtle. The difference comes out when **\*the complexity of the task reaches a sufficient threshold\*** β€” GPT-4 is more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5. -- *GPT-4 release blog*

The key differentiator is whether a model can do **complex tasks**, like the old saying: "chit-chat is cheap, show me the reasoning." This is why we compile a list of complex reasoning tasks including math (GSM8K), science (MATH, TheoremQA), symbolic (BBH), knowledge (MMLU, C-Eval), coding (HumanEval), factual (SummEdits), and long-context (RepoBench, Qspr, QALT, BkSS) to measure the models' performance on challenging tasks.

More importantly, we envisage large language models to become the next-generation computational platform and foster an ecosystem of LLM-based new applications.
When this comes, chain-of-thought prompt engineering will be the next-generation system calls and shell scripts.

The credibility of chain-of-thought hub comes from the very carefully mediculously picked datasets and models that can clearly help the development of LLMs. The resutls and scripts from Chain-of-thought Hub is being used and referred by leading industrial and academic organizations in the space of large language models. We devide the tasks into three categories: main, experimental, and long-context.
* Main: datasets that are stable and consistently referred by places where LLMs are built.
* Experimental: datasets that has the potential to test future LLM capabilities.
* Long-context: datasets that require reasoning over very long context, an important direction of future LLMs.

[List of datasets we consider]

| Section | Dataset | Description |
| ------- | ------- | ----------- |
| Main | GSM8K | Grade-level math word problems |
| Main | MATH | Competition-level math and science problems |
| Main | MMLU | Multi-discipline knowledge |
| Main | BBH | Challenging language and symbolic reasoning |
| Main | HumanEval | Python coding |
| Main | C-Eval | Chineses multi-discipline knowledge |
| Experimental | TheoremQA | Theorem proving |
| Experimental | SummEdits | Factual reasoning |
| Long Ctx | Qspr | Question answering over research papers |
| Long Ctx | QALT | Multiple-choice questions over long articles and stories |
| Long Ctx | BkSS | Reordering of summaries of parts of novels |

**[Call for contribution]**: would love to invite community members to:
* Send a PR to fill in a missing number in the table
* Raise an issue to suggest / brainstorm a new task / benchmark that measures **reasoning over very long context**
* Raise an issue to suggest / brainstorm a new task / benchmark that measures **complex API calls and tool usage**
* Raise an issue to suggest other good tasks / benchmarks that can clearly differentiate models' performance
* Raise an issue to suggest a new model that can be added to the table

**[UPDATE 20231210]**:
* Add [Gemini](https://deepmind.google/technologies/gemini/#introduction), [Yi-34B](https://github.com/01-ai/Yi), [DeepSeek 67B](https://github.com/deepseek-ai/DeepSeek-LLM)
* Update long-context -- we will have more updates on this section
* Preview of Mistral 7B8E MoE model results

Mistral 7B 8E looks approximately comparible with Yi34B / LLaMA2 70B / DeepSeek 67B

| Benchmark | Mistral 7B Dense | Mistral 7Bx8E=50B | Yi-34B | DeepSeek-67B | LLaMA2 70B |
|------------|------------------|-------------------|--------|--------------|------------|
| Arc-c | 59.98 | 66.38 | 64.59 | 65.44 | - |
| HellaSwag | 83.31 | 86.61 | 85.69 | 87.10 | - |
| MMLU | 64.16 | 71.73 | 76.35 | 71.78 | 68.9 |
| TruthfulQA | 42.15 | 48.55 | 56.23 | 51.08 | 50.18 |
| Winogrande | 78.37 | 82.40 | 83.03 | 84.14 | - |
| GSM8K | 37.83 | 57.09 | 50.64 | 56.71 | 56.8 |

**[UPDATE 20230620]**:
* Seperate main (datasets that are stable and consistently referred by places where LLMs are built) and experimental (datasets that has the potential to test future LLM capabilities) leaderboards.
* Add long-context section (experimental)

[Previous updates]

**[UPDATE 20230609]**: Add [evaluation scripts](MMLU/readme.md) on MMLU for LLaMA and Falcon

**[UPDATE 20230601]**: Add SummEdits

**[UPDATE 20230527]**: Add TheoremQA, add Vicuna, Alpaca, InstructCodeT5.

## Leaderboard - Main

| Model | Param. | Type | GSM8K | MATH | MMLU | BBH | HumanEval | C-Eval |
| ---- | --------- | ---- | ----- | ---- | ---- | --- | --------- | ----- |
| Gemini Ultra | ? | Base | - | 53.2 | 83.7 | 83.6 | 74.4 | - |
| gpt-4 | ? | RLHF | 92.0 | 42.5 | 86.4 | - | 67.0 | 68.7* |
| claude-2 | ? | RLHF | 88 | - | 78.5 | - | 71.2 | - |
| Gemini Pro | ? | Base | - | 32.6 | 71.8 | 75.0 | 67.7 | - |
| claude-v1.3 | ? | RLHF | 81.8* | - | 75.6*| 67.3* | - | 54.2* |
| PaLM-2-Unicorn | ? | Base | 80.7 | 34.3 | 78.3 | 78.1 | - | - |
| Mistral MoE | 7Bx8E=46B | Base | 57.9 | - | 71.3 | - | - | - |
| DeepSeek | 67B | Base | 56.7 | 18.7 | 71.7 | 68.7 | 42.7 | 66.1 |
| Yi | 34B | Base | 50.6 | - | 76.3 | 54.3 | - | 81.4 |
| gpt-3.5-turbo | ? | RLHF | 74.9* | - | 67.3*| 70.1* | 48.1 | 54.4* |
| claude-instant | ? | RLHF | 70.8* | - | 61.3*| 66.9* | - | 45.9* |
| text-davinci-003 | ? | RLHF | - | - | 64.6 | 70.7 | - | - |
| code-davinci-002 | ? | Base | 66.6 | 19.1 | 64.5 | 73.7 | 47.0 | - |
| text-davinci-002 | ? | SIFT | 55.4 | - | 60.0 | 67.2 | - | - |
| Minerva | 540B | SIFT | 58.8 | 33.6 | - | - | - | - |
| Flan-PaLM | 540B | SIFT | - | - | 70.9 | 66.3 | - | - |
| Flan-U-PaLM | 540B | SIFT | - | - | 69.8 | 64.9 | - | - |
| PaLM | 540B | Base | 56.9 | 8.8 | 62.9 | 62.0 | 26.2 | - |
| LLaMA-2 | 70B | Base | 56.8 | - | 68.9 | 51.2 | 29.9 | - |
| LLaMA | 65B | Base | 50.9 | 10.6 | 63.4 | - | 23.7 | 38.8* |
| PaLM | 64B | Base | 52.4 | 4.4 | 49.0 | 42.3 | - | - |
| Falcon | 40B | Base | - | - | 49.0*| - | - | - |
| Vicuna | 33B | SIFT | - | - | 59.2 | - | - | - |
| LLaMA | 33B | Base | 35.6 | 7.1 | 57.8 | - | 21.7 | - |
| InstructCodeT5+ | 16B | SIFT | - | - | - | - | 35.0 | - |
| StarCoder | 15B | Base | 8.4 | 15.1 | 33.9 | - | 33.6 | - |
| Vicuna | 13B | SIFT | - | - | - | 52.1 | - | - |
| LLaMA | 13B | Base | 17.8 | 3.9 | 46.9 | - | 15.8 | - |
| Flan-T5 | 11B | SIFT | 16.1* | - | 48.6 | 41.4 | - | - |
| Alpaca | 7B | SIFT | - | - | - | - | - | - |
| LLaMA | 7B | Base | 11.0 | 2.9 | 35.1 | - | 10.5 | - |
| Flan-T5 | 3B | SIFT | 13.5* | - | 45.5 | 35.2 | - | - |

We call these datasets "main" because they are rather stable and widely used in LLM development in major places. Base means the pretrained checkpoint. SIFT means the checkpoint after supervised instruction finetuning. RLHF means the checkpoint after Reinforcement Learning from Human Feedback. Numbers marked with an asterisk * are from our own run, otherwise from multiple sources which we explain below. All methods are measured in accuracy, the higher the better.

## Leaderboard - Experimental: Long Context

| Model | Param. | Ctx. | Type | Qspr | QALT | BkSS |
| ---- | ------ | ---- | ---- | --------- | ---- | ---- |
| Human | ? | ? | ? | 67.7 | 93.5 | ? |
| gpt-4 | ? | 8K | RLHF | 50.7 | 89.2 | 60.5 |
| claude-v1.3 | ? | 8K | RLHF | 52.3 | 84.8 | 47.4 |
| claude-v1.3 | ? | 4K | RLHF | 47.7 | 76.8 | 37.6 |
| PaLM-2-Unicorn | ? | - | Base | - | - | - |
| PaLM-2-bison | ? | - | RLHF | - | - | - |
| gpt-3.5-turbo | ? | 4K | RLHF | 49.3 | 66.6 | 49.8 |
| claude-instant | ? | - | RLHF | - | - | - |
| text-davinci-003 | ? | 4K | RLHF | 52.7 | 69.0 | 49.5 |
| text-davinci-002 | ? | - | SIFT | - | - | - |
| LLaMA | 65B | - | Base | - | - | - |
| Falcon | 40B | - | Base | - | - | - |
| Flan-UL2 | 20B | 8K | SIFT | 56.9 | 75.6 | 14.0 |
| LLaMA | 33B | - | Base | - | - | - |
| Vicuna | 13B | - | SIFT | - | - | - |
| LLaMA | 13B | - | Base | - | - | - |
| Flan-T5 | 11B | 8K | SIFT | 48.3 | 75.2 | 15.1 |
| Flan-T5 | 11B | 4K | SIFT | 46.5 | 70.8 | 16.4 |
| T0pp | 11B | 8K | SIFT | 25.0 | 21.4 | 0.0 |
| Alpaca | 7B | - | SIFT | - | - | - |
| LLaMA | 7B | - | Base | - | - | - |
| Flan-T5 | 3B | 8K | SIFT | 46.6 | 69.6 | 2.2 |

* TODO: [RepoBench](https://github.com/Leolty/repobench): benchmarking repository-level code auto-completion systems
* Qspr, QALT and BkSS numbers are from zero-scrolls
* Why do we pick these datasets? See [detailed documentation](resources/long_context.md)

## What's different than other important evaluation?
* [HeLM](https://crfm.stanford.edu/helm/latest/) uses answer-only prompting, we use chain-of-thought promoting
* HeLM evaluates everything. We only focus on complex reasoning, the key differentiator of LLMs' capability.
* [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) evaluates open-sourced language models. We consider most leading models.
* Currently, the performance of LLaMA 65B on Open LLM Leaderboard is just 48.8, which is significantly lower than the 63.4 reported in the paper. This casts [doubts](https://twitter.com/karpathy/status/1662209158748442625) on the comparison between LLaMA and Falcon.
* In our [reproduction](MMLU/readme.md), we got 61.4 using the MMLU official prompt + greedy decoding + fp16. Our results favors the original LLaMA number and cast doublts on the results of Open LLM Leaderboard.
* Our [evaluation script](MMLU/run_mmlu_llama.py) is rather straightforward, most parameters are default, no fancy prompt engineering. We encourage the community to try out our scripts and reproduce our results.
* According to [Nathan Lambert](https://twitter.com/natolambert/status/1667249342456160257?s=20), HuggingFace is currently redoing the backend of Open LLM Leaderboard, and the results may change (Jun 10 2023).
* [Chatbot Arena](https://lmsys.org/blog/2023-05-03-arena/) evaluates chatbot models, which is more user-oriented at deployment. Our evaluation is more developer-oriented, and we consider on not only chatbots but also base models.

## How the models are ranked
* If we know model scale, we rank it by scale.
* If we do not know model scale, we rank it by GSM8K, the classical benchmark measuring chain-of-thought math reasoning performance.
* This is definitely not the only metric, but a good interpretation is "how good the model can do math while maintaining other generic abilities" -- which is also very hard.
* GPT-4 is already pretrained on GSM8k training split, others may not. So for GPT-4, its perf. on GSM8k is in-distribution generalization, while for others are ood. generalization. Yet even for in-dist. FlanT5 is also trained on GSM8k, still shows perf. difference.
* Generally it is very hard to rigiously compare model perf. due to multiple factors (whether trained on the corresponding training split, whether trained on code, whether optimize prompt .etc). View our results as approximate reference.

## Source of numbers
* GPT-4 from its [website](https://openai.com/research/gpt-4) and [Bubeck et al Mar 2023](https://arxiv.org/abs/2303.12712). Note that the version that Bubeck uses is GPT-4 Early which is supposedly to be more powerful than GPT-4 Launch (OpenAI paid a lot of alignment tax to make GPT-4 safer).
* \*-davinci-00\* and \*PaLM are from the [Flan-PaLM](https://arxiv.org/abs/2210.11416) paper appendix.
* code-davinci-002 is the base model of GPT-3.5 family but unfortunately it can no longer be accessed.
* LLaMA from [LLaMA](https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/) paper. ~~Note that the prompt of LLaMA used in these tasks are not released so reproduction may have varied numbers, see [this twitter thread](https://twitter.com/karpathy/status/1662209158748442625) for more discussions.~~
* ~~We are doing our own implementation of LLaMA on MMLU and BBH. Stay tuned.~~
* We have reproduced LLaMA on MMLU using the official MMLU prompts and default HuggingFace Transformers `generate()` function, and our results matches the official numbers very well. See [here](MMLU/readme.md) for more details.
* Falcon on MMLU is from our own script [here](MMLU/readme.md).
* PaLM-2 from [their tech report](https://ai.google/static/documents/palm2techreport.pdf).
* Claude is from our own test script, see below about how to run it.
* The HumanEval results for LLaMA models, PaLM and StartCoder are from [HuggingFace report](https://huggingface.co/blog/starcoder). Code-davinci-002's performance on HumanEval is from [CodeT5+ paper](https://arxiv.org/pdf/2305.07922.pdf)
* C-Eval is from their [website](https://cevalbenchmark.com/static/leaderboard.html)
* TheoremQA is from their [github](https://github.com/wenhuchen/TheoremQA)
* SummEdits is from their [github](https://github.com/salesforce/factualNLG/tree/master) and [paper](https://arxiv.org/abs/2305.14540)
* Long context section are from [zero-scrolls paper](https://arxiv.org/abs/2305.14196) and [leaderboard](https://www.zero.scrolls-benchmark.com/leaderboard)
* Vicuna performance on MMLU is from [Chatbot Arena](https://lmsys.org/blog/2023-06-22-leaderboard/)

## Current results
* GPT-4 clearly outperforms all other models on GSM8K and MMLU.
* \*\***The 65B LLaMA is very close to text/code-davinci-002, which means that based on it, if SFT and RLHF are done correctly, it is very likely that we could reproduce ChatGPT based on the 65B LLaMA**\*\*
* Claude is the only model family that is comparable to GPT family.
* On GSM8K, gpt-3.5-turbo improves over text-davinci-003. This confirms OpenAI's Jan 30 2023 release notes "improved mathematical capabilities."
* On MMLU, gpt-3.5-turbo is slightly better than text-davinci-003. But this level of margin is NOT SIGNIFICANT
* Also remember that gpt-3.5-turbo is 10 times cheaper than text-davinci-003
* Also be careful that GPT-4/ 3.5's performance on GSM8K is not true few-shot -- in [GPT-4 report](https://cdn.openai.com/papers/gpt-4.pdf) they said that they mixed a portion of GSM8K training set to train the model
* LLaMA performance on MMLU is from their paper and probably not CoT but AO. Generally on MMLU, AO is better than CoT but just slightly better. So the LLaMA numbers on MMLU might be slightly overestimated.

## Visualization
![Title](resources/ranking.png)
* There is a clear gap between open-source and close.
* Most top models are after RLHF.
* LLaMA 65B is very close to code-davinc-002.
* Existing results strongly suggest that if RLHF is done right on LLaMA, it may be close to ChatGPT-3.5.

## More about the tasks
* [GSM8K](https://arxiv.org/abs/2201.11903): 8k elementary school math. -- Performance improvements on this dataset directly translate to daily math abilities when interacting with LLMs
* [MMLU](https://arxiv.org/abs/2210.11416): 15k problems under 57 subjects, high school and college knowledge
* [MATH](https://arxiv.org/abs/2206.14858) (Hard!): 12k problems within 7 categories, very hard math and natural science. All current models struggle.
* [BBH](https://arxiv.org/abs/2210.09261): 6.5k problems within 23 subsets, symbolic and text reasoning
* [HumanEval](https://github.com/openai/human-eval): a classical handwritten dataset of 164 Python problems for evaluating coding capability.
* [C-Eval](https://cevalbenchmark.com/): a collection of 13k multi-choice questions spanning 52 disciplines of knowledge test in Chinese.
* [TheoremQA](https://github.com/wenhuchen/TheoremQA) (Hard!): 800 QA pairs covering 350+ theorems spanning across Math, EE&CS, Physics and Finance.
* [SummEdits](https://github.com/salesforce/factualNLG): 6.3k factual consistency reasoning problems within 10 domains.

## Run

### MMLU
```bash
cd MMLU
mkdir outputs
API_KEY=
# GPT-3.5-Turbo
python run_mmlu_gpt_3.5_turbo.py --api_key=${API_KEY}
# Claude-v1.3
python run_mmlu_claude.py --api_key=${API_KEY} --engine=claude-v1.3

# LLaMA
LLAMA_CKPT_DIR=
PARAM_SIZE=65 # 7, 13, 33, 65
MODEL_TYPE=llama # ["llama", "falcon"]
python run_mmlu_open_source.py --ckpt_dir ${LLAMA_CKPT_DIR} --param_size ${PARAM_SIZE} --model_type ${MODEL_TYPE}
```

### GSM8k
```bash
cd gsm8k
mkdir outputs

# run gpt-3.5
# codex_gsm8k_complex.ipynb -- code-davinci-002 + complex prompt
# gpt3.5turbo_gsm8k_complex.ipynb -- gpt-3.5-turbo + complex prompt

# run claude
python run_gsm8k_claude.py\
--anthropic_key=${API_KEY}\
--prompt_file=lib_prompt/prompt_original.txt\
--engine=claude-v1.3\
--output_file=outputs/gsm8k_claude_v1.3_original_test.txt

# run FlanT5
# flan_t5_11b_gsm8k.ipynb
```

### BBH
```bash
cd BBH
mkdir outputs
# then run jupyter notebook to see an example penguins dataset
cd penguins
# gpt3.5trubo_penguins_original.ipynb

# Or run the script for all datasets
API_KEY=
TASK=
python run_bbh_gpt_3.5_turbo.py --api_key=${API_KEY} --task=${TASK} # task=all by default
python run_bbh_claude_v1.3.py --api_key=${API_KEY} --model_index=claude-v1.3 --task=${TASK} # task=all by default
```

## FAQ
* The sensibility of model performance is very high.
* Unfortunately, it is a nature of LLMs. We are currently taking efforts to standardize the prompts (see initial progress [here](spl/markdown.md)) and will update more on it.
* What are the prompts used in the _complexity-based prompting_ paper?
* See `research/complexity_based_prompting/`
* I want to try some open-sourced model
* See `gsm8k/flan_t5_11b_gsm8k.ipynb` for a place to start
* There are some prompts that have wrong answer
* Yes, but we keep it as they are used in the original papers
* Generally the model can be robust under prompt perturbation: even if sometimes there are errors in the prompt, as long as the format of the prompt is about the corresponding task, the model tend to only look at the format, ignore the prompt error, and make its own prediction.
* See https://arxiv.org/abs/2202.12837 and https://arxiv.org/abs/2212.10001 about more analysis how the model can ignore errors in the prompt

## I want to know more about building LLMs for reasoning tasks
A detailed roadmap is discussed in [our previous blog post](https://yaofu.notion.site/Towards-Complex-Reasoning-the-Polaris-of-Large-Language-Models-c2b4a51355b44764975f88e6a42d4e75).

Generally, the recipe for building models of strong reasoning is the same as generic LLMs: pretraining, finetuning, reinforcement learning. Here we list some very important papers that should be considered:

### Pretraining/ Continue Training

* Lewkowycz et. al. 2022. Minerva: [Solving Quantitative Reasoning Problems with Language Models](https://arxiv.org/abs/2206.14858)
* Taylor et. al. 2022. [Galactica: A Large Language Model for Science](https://arxiv.org/abs/2211.09085)

### Finetuning
* Chung et. al. 2022. [Scaling Instruction-Finetuned Language Models](https://arxiv.org/abs/2210.11416)
* Li et. al. 2022. [Competition-Level Code Generation with AlphaCode](https://arxiv.org/abs/2203.07814)
* Fu et. al. 2023. [Specializing Smaller Language Models towards Multi-Step Reasoning](https://arxiv.org/abs/2301.12726)

### Reinforcement Learning
* Uesato et. al. 2022. [Solving math word problems with process- and outcome-based feedback](https://arxiv.org/abs/2211.14275)
* Le et. al. 2022. [CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning](https://arxiv.org/abs/2207.01780)
* Lightman et. al. 2023. [Let’s Verify Step by Step](https://openai.com/research/improving-mathematical-reasoning-with-process-supervision)

## Under Development

* [CotHub Standard Prompt Library](spl/readme.md)
* [TODOs](resources/todo.md)
* [Literature](resources/literature.md)
* [Detailed Results](resources/detailed_results.md)
* Experimental section and long context