{"id":28463839,"url":"https://github.com/gair-nlp/octothinker","last_synced_at":"2025-06-30T17:32:03.138Z","repository":{"id":289539728,"uuid":"967908227","full_name":"GAIR-NLP/OctoThinker","owner":"GAIR-NLP","description":"Revisiting Mid-training in the Era of RL Scaling","archived":false,"fork":false,"pushed_at":"2025-06-26T11:53:57.000Z","size":17058,"stargazers_count":68,"open_issues_count":1,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-06-26T12:43:30.497Z","etag":null,"topics":["llama","llm","mid-training","post-training","pre-training","qwen","reasoning","rl","verl"],"latest_commit_sha":null,"homepage":"http://arxiv.org/abs/2506.20512","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GAIR-NLP.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-17T07:30:51.000Z","updated_at":"2025-06-26T11:54:00.000Z","dependencies_parsed_at":"2025-06-26T12:38:25.055Z","dependency_job_id":null,"html_url":"https://github.com/GAIR-NLP/OctoThinker","commit_stats":null,"previous_names":["gair-nlp/octothinker"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/GAIR-NLP/OctoThinker","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GAIR-NLP%2FOctoThinker","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GAIR-NLP%2FOctoThinker/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GAIR-NLP%2FOctoThinker/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GAIR-NLP%2FOctoThinker/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GAIR-NLP","download_url":"https://codeload.github.com/GAIR-NLP/OctoThinker/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GAIR-NLP%2FOctoThinker/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262819016,"owners_count":23369398,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llama","llm","mid-training","post-training","pre-training","qwen","reasoning","rl","verl"],"created_at":"2025-06-07T05:01:19.432Z","updated_at":"2025-06-30T17:32:03.128Z","avatar_url":"https://github.com/GAIR-NLP.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\u003ch1\u003e🐙 OctoThinker\u003cbr\u003e\nMid-training Incentivizes Reinforcement Learning Scaling\n\u003c/h1\u003e\n\u003c/div\u003e\n\u003cdiv align=\"center\"\u003e\n\n[![arXiv](https://img.shields.io/badge/arXiv-2506.20512-red?style=for-the-badge\u0026.svg)](https://arxiv.org/abs/2506.20512)\n[![Notion](https://img.shields.io/badge/Notion_Blog-4d8cd8?style=for-the-badge\u0026logo=notion\u0026logoColor=white)](https://tinyurl.com/OctoThinker)\n[![HF Org (Model \u0026 Data)](https://img.shields.io/badge/HF_Org_(Data_\u0026_Model)-5f16a8?style=for-the-badge\u0026logo=huggingface\u0026logoColor=white)](https://huggingface.co/OctoThinker)\n\u003c/div\u003e\n\n\u003e *Revisiting Mid-training in the Era of RL Scaling*\n\n## 🔥 News\n- **[2025-06-26]** 🎉🎉🎉 We release our detailed technical report on [**arXiv**](https://arxiv.org/abs/2506.20512) \nand MegaMath-Pro-Max corpus on [**HuggingFace**](https://huggingface.co/datasets/OctoThinker/MegaMath-Web-Pro-Max).\n- **[2025-04-24]** 🎉🎉🎉 We release our first progress blog on [**Notion**](https://tinyurl.com/OctoThinker), together with the first version of our base and RL models on [**HuggingFace**](https://huggingface.co/collections/GAIR/octothinker-68035e416813f9833a8060f3), which is trained on Llama-3 series.\n\n## 📖 Introduction\n\n\n![](./assets/octothinker_banner.png)\n\n\u003e **Note:** We are still in the process of exploring more possibilities and expand to different model families, but we are eager to share some findings with the community from our empirical results in an open-source manner!\n\nWe explores how different early pre(mid)-training strategies' could bring impact to post-training stages, especially during the period of Reinforcement Learning (RL). We hold the hope of reshaping the pre-training stage of LLMs, in the era of RL scaling. **🐙 OctoThinker** is our initial attempt to explore this direction. \n**We go through a thorough pipeline of pre-training, RL, and evaluation, to investigate deep-level insights.**\n\n### What does 🐙 OctoThinker mean?\n\"Octo\" is from the word \"octopus\", representing our base model families which are branched and trained via different strategies.\n\"Thinker\" means the model is finally trained to think and reason at RL stage, which is expected to show frequent self-reflection behaviors and strong reasoning abilities.\n\n## Usage\nCurrently, our repo contains 3 main parts:\n- Pre-training code based on [Nanotron](https://github.com/huggingface/nanotron)\n- RL code based on [verl](https://github.com/volcengine/verl)\n- Evaluation code which is refined from [DeepSeekMath](https://github.com/deepseek-ai/deepseek-math) and [MegaMath](https://github.com/LLM360/MegaMath)\n\n### Pre-training\n\n\u003csummary\u003e\u003cb\u003ePre-training Environment Setup\u003c/b\u003e\u003c/summary\u003e\n\u003cp\u003e\n\n```bash\nconda create -n nanotron python=3.10\nconda activate nanotron\ncd nanotron\npip install -r requirements.txt\n```\n\u003c/p\u003e\n\n\u003csummary\u003e\u003cb\u003eTo Submit Pre-training Jobs\u003c/b\u003e\u003c/summary\u003e\n\u003cp\u003e\n\n```bash\n#TODO: add pre-training scripts\n```\n\u003c/p\u003e\n\n### RL\n\n\u003csummary\u003e\u003cb\u003eRL Environment Setup\u003c/b\u003e\u003c/summary\u003e\n\u003cp\u003e\n\n```bash\n#TODO: add RL scripts\n```\n\u003c/p\u003e\n\n\u003csummary\u003e\u003cb\u003eTo Submit RL Jobs\u003c/b\u003e\u003c/summary\u003e\n\u003cp\u003e\n\n```bash\n#TODO: add RL scripts\n```\n\u003c/p\u003e\n\n### Evaluation\n\n\u003csummary\u003e\u003cb\u003eEvaluation Environment Setup\u003c/b\u003e\u003c/summary\u003e\n\u003cp\u003e\n\n```bash\nconda create -n matheval python=3.10\nconda activate matheval\ncd eval\npip install -r requirements.txt\n```\n\u003c/p\u003e\n\n\u003csummary\u003e\u003cb\u003eTo Submit Evaluation Jobs\u003c/b\u003e\u003c/summary\u003e\n\u003cp\u003e\n\n```bash\ncd eval\nbash scripts/en_math_cot_eval_last4dir.sh \u003cmodel_root_dir\u003e\n```\n\n\u003c/p\u003e\n\n\n### Visualization\nWe also provide the visualization code for the pre-training and RL process. All visualizations are in [plot](./plot/) directory to ensure the reproducibility.\n\n\n## Acknowledgements\n\nFor training framework and inference engine, we use [**verl**](https://github.com/volcengine/verl) and  [**vLLM**](https://github.com/vllm-project/vllm). We thank huggingface **[open-r1 team](https://huggingface.co/open-r1)**, [**a-m-team**](https://huggingface.co/a-m-team), and also [**SimpleRL**](https://github.com/hkust-nlp/simpleRL-reason) Project, to open source their dataset and training recipes. In fact, we are deeply grateful to the entire open‑source community for their tireless efforts in making our exploration possible.\n\nIf you find this work useful, please cite:\n\n```\n@article{wang2025octothinker,\n  title={OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling},\n  author={Wang, Zengzhi and Zhou, Fan and Li, Xuefeng and Liu, Pengfei},\n  year={2025},\n  journal={arXiv preprint arXiv:2506.20512},\n  year={2025},\n  note={Preprint}\n}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgair-nlp%2Foctothinker","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgair-nlp%2Foctothinker","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgair-nlp%2Foctothinker/lists"}