Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jackfsuia/open-m1
A roadmap to reproduce OpenAI o1, Combining PRM and RLHF to train the model to learn infinite CoT (Chain of Thought).
https://github.com/jackfsuia/open-m1
chatgpt llm openai-o1 reinforcement-learning strawberry
Last synced: 18 days ago
JSON representation
A roadmap to reproduce OpenAI o1, Combining PRM and RLHF to train the model to learn infinite CoT (Chain of Thought).
- Host: GitHub
- URL: https://github.com/jackfsuia/open-m1
- Owner: jackfsuia
- License: mit
- Created: 2024-10-12T03:32:51.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2024-10-27T16:32:02.000Z (21 days ago)
- Last Synced: 2024-10-27T19:33:49.019Z (20 days ago)
- Topics: chatgpt, llm, openai-o1, reinforcement-learning, strawberry
- Homepage:
- Size: 8.79 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# open-m1
A roadmap to reproduce OpenAI o1.
# A roadmap
**Step 1: A good base model.** Find a good math base model M0. For example, deepseek-math-base. The larger it is, the more likely for its emergence of self-backtrack and self-refine abilitis.**Step 2: Format training.** Finetune M0 to produce solutions in a newline delimited step-by-step format [1] for PRM training. The result is model M1.
**Step 3: PRM reinforcement learning.** Now model M1 can produce formatted CoT. Train a PRM like [1]. Use it to do the reinforcement learning for M1, of which object is maximizing the expectation of step rewards. Then this model is used to train PRM again... run as many iterations as possible. The result is model M2.
**Step 4: Inference.** During inference of model M2, if time is out, the program adds a "\n Therefore the answer is " to force it to produce the final answer [2]. Then one another model will produce a summary for this whole output.
# Reference
[1] Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever. Let’s Verify Step by Step.[2] Nat McAleese, Rai Michael Pokorny, Juan Felipe Ceron Uribe, Evgenia Nitishinskaya, Maja Trebacz, Jan Leike. LLM Critics Help Catch LLM Bugs.