https://tiger-ai-lab.github.io/MAmmoTH2/

Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]
https://tiger-ai-lab.github.io/MAmmoTH2/

language math reasoning

Last synced: 15 days ago
JSON representation

Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]

Host: GitHub
URL: https://tiger-ai-lab.github.io/MAmmoTH2/
Owner: TIGER-AI-Lab
License: mit
Created: 2024-05-04T01:23:39.000Z (12 months ago)
Default Branch: main
Last Pushed: 2024-10-27T03:18:55.000Z (6 months ago)
Last Synced: 2024-10-27T04:25:04.452Z (6 months ago)
Topics: language, math, reasoning
Language: Python
Homepage: https://tiger-ai-lab.github.io/MAmmoTH2/
Size: 19.8 MB
Stars: 122
Watchers: 3
Forks: 9
Open Issues: 4
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-llm4math - WebInstruct(Sub)
awesome-llm4math - WebInstruct(Sub)

README

        # MAmmoTH2

This repo contains the code, data, and models for NeurIPS-24 paper "[MAmmoTH2: Scaling Instructions from the Web](https://arxiv.org/abs/2405.03548)". Our paper proposes a new paradigm to scale up high-quality instruction data from the web.



 🔥 🔥 🔥 Check out our [Project Page] for more results and analysis! Also, our Demo is online!







    



## WebInstruct

We propose discovering instruction data from the web. We argue that vast amounts of high-quality instruction data exist in the web corpus, spanning various domains like math and science. Our three-step pipeline involves recalling documents from Common Crawl, extracting Q-A pairs, and refining them for quality. This approach yields 10 million instruction-response pairs, offering a scalable alternative to existing datasets. We name our curated dataset as WebInstruct.





    



Part of our WebInstruct dataset has been released at [🤗 TIGER-Lab/WebInstructSub](https://huggingface.co/datasets/TIGER-Lab/WebInstructSub) and [🤗 TIGER-Lab/WebInstructFull](https://huggingface.co/datasets/TIGER-Lab/WebInstructFull).

## Model Downloads



| **Model**            | **Dataset**                                            | **Init Model** | **Download**   |

| :------------:       | :------------:                                         | :------------: | :------------: |

| MAmmoTH2-8x7B        | WebInstruct                                            | Mixtral-8x7B   | [🤗 HuggingFace](https://huggingface.co/TIGER-Lab/MAmmoTH2-8x7B)   |

| MAmmoTH2-7B          | WebInstruct                                            | Mistral-7B-v0.2| [🤗 HuggingFace](https://huggingface.co/TIGER-Lab/MAmmoTH2-7B)   |

| MAmmoTH2-8B          | WebInstruct                                            | Llama-3-base   | [🤗 HuggingFace](https://huggingface.co/TIGER-Lab/MAmmoTH2-8B)   |

| MAmmoTH2-8x7B-Plus   | WebInstruct + OpenHermes2.5 + CodeFeedback + Math-Plus | MAmmoTH2-8x7B  | [🤗 HuggingFace](https://huggingface.co/TIGER-Lab/MAmmoTH2-8x7B-Plus)   |

| MAmmoTH2-7B-Plus     | WebInstruct + OpenHermes2.5 + CodeFeedback + Math-Plus | MAmmoTH2-7B    | [🤗 HuggingFace](https://huggingface.co/TIGER-Lab/MAmmoTH2-7B-Plus)   |

| MAmmoTH2-8B-Plus     | WebInstruct + OpenHermes2.5 + CodeFeedback + Math-Plus | MAmmoTH2-8B    | [🤗 HuggingFace](https://huggingface.co/TIGER-Lab/MAmmoTH2-8-Plus)   |



## Evaluation Results

Please refer to https://tiger-ai-lab.github.io/MAmmoTH2/ for more details.

## Evaluation Command

Please refer to https://github.com/TIGER-AI-Lab/MAmmoTH2/tree/main/math_eval. 

## Cite our paper

Please cite our paper if you use our data, model or code. Please also kindly cite the original dataset papers.

```

@article{yue2024mammoth2,

  title={MAmmoTH2: Scaling Instructions from the Web},

  author={Yue, Xiang and Zheng, Tuney and Zhang, Ge and Chen, Wenhu},

  journal={Advances in Neural Information Processing Systems},

  year={2024}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://tiger-ai-lab.github.io/MAmmoTH2/

Awesome Lists containing this project

README