Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/srush/awesome-o1
https://github.com/srush/awesome-o1
List: awesome-o1
Last synced: 14 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/srush/awesome-o1
- Owner: srush
- Created: 2024-10-16T16:43:11.000Z (27 days ago)
- Default Branch: main
- Last Pushed: 2024-10-27T21:17:20.000Z (16 days ago)
- Last Synced: 2024-10-28T01:27:22.360Z (16 days ago)
- Language: TeX
- Size: 109 KB
- Stars: 424
- Watchers: 11
- Forks: 15
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# awesome-o1
This is a bibliography of papers that are presumed to be related to
OpenAI’s [o1](https://openai.com/index/learning-to-reason-with-llms/).------------------------------------------------------------------------
> Our large-scale reinforcement learning algorithm teaches the model how
> to think productively using its chain of thought in a highly
> data-efficient training process. We have found that the performance of
> o1 consistently improves with more reinforcement learning (train-time
> compute) and with more time spent thinking (test-time compute). The
> constraints on scaling this approach differ substantially from those
> of LLM pretraining, and we are continuing to investigate them. …
> Similar to how a human may think for a long time before responding to
> a difficult question, o1 uses a chain of thought when attempting to
> solve a problem. Through reinforcement learning, o1 learns to hone its
> chain of thought and refine the strategies it uses. It learns to
> recognize and correct its mistakes. It learns to break down tricky
> steps into simpler ones. It learns to try a different approach when
> the current one isn’t working. This process dramatically improves the
> model’s ability to reason. To illustrate this leap forward, we
> showcase the chain of thought from o1-preview on several difficult
> problems below.------------------------------------------------------------------------
## What we would like to actually work?
- **Self-Consistency** ([X. Wang et al. 2022](#ref-Wang2022-px))
Majority voting of LLM output improves a bit.
- **Scratchpad** ([Nye et al. 2021](#ref-Nye2021-bx)) /
**Chain-of-Thought** ([Wei et al. 2022](#ref-Wei2022-uj)) Wouldn’t
it be cool if an LLM could talk to itself and get better?
- **Tree-of-Thought** ([Yao et al. 2023](#ref-Yao2023-nw)) Wouldn’t it
be cool if you could scale this as a tree?## Why might this be possible?
- **AlphaGo** ([Silver et al. 2016](#ref-Silver2016-ag)) Quantifies
value of self-play training vs. test search
- **AlphaZero** ([Silver et al. 2017](#ref-Silver2017-bn)) Shows
training on guided self-trajectory can be generalized / scaled
- **Libratus** ([N. Brown and Sandholm 2017](#ref-Brown2017-of)) Poker
bot built by scaling search
- **Scaling Laws for Board Games** ([Jones 2021](#ref-Jones2021-di))
Clean experiments that compare train / test FLOPs in a controlled
setting
- **Noam Lecture** ([Paul G. Allen School
2024](#ref-Paul-G-Allen-School2024-da)) Talk from Noam Brown about
the power of search## Can reasoning be a verifiable game?
- **WebGPT** ([Nakano et al. 2021](#ref-Nakano2021-iz)) Shows that
test time rejection sampling against a reward model is a very strong
model.
- **GSM8K** ([Cobbe et al. 2021](#ref-Cobbe2021-gt)) Considers why
math reasoning is challenging and introduces ORM models for
verification
- **Process Reward** ([Uesato et al. 2022](#ref-Uesato2022-aw))
Introduces distinction of a process reward / outcome reward model,
and uses expert iteration RL.
- **Let’s Verify** ([Lightman et al. 2023](#ref-Lightman2023-cr))
Demonstrates that PRMs can be quite effective in efficacy of
rejection sampling
- **Math-Shepard** ([P. Wang et al. 2023](#ref-Wang2023-ur))
Experiments with automatic value function learning with roll outs## Can a verifier make a better LLM?
- **Expert Iteration** ([Anthony, Tian, and Barber
2017](#ref-Anthony2017-dm)) Search, collect, train. Method for
self-improvement in RL.
- **Self-Training** ([Yarowsky 1995](#ref-Yarowsky1995-tm)) Classic
unsupervised method: generate, prune, retrain
- **STaR** ([Zelikman et al. 2022](#ref-Zelikman2022-id)) Formulates
LLM improvement as retraining on rationales that lead to correct
answers. Justified as approximate policy gradient.
- **ReST** ([Gulcehre et al. 2023](#ref-Gulcehre2023-vk)) Models
improvement as offline-RL. Samples trajectories, grow corpus,
retrain.
- **ReST-EM** ([Singh et al. 2023](#ref-Singh2023-eb)) Formalizes
similar methods as EM for RL. Applies to reasoning.## Can LLMs learn to plan?
(This part is the most speculative)
- **Stream of Search** ([Gandhi et al. 2024](#ref-Gandhi2024-vs))
Training on linearized, non-optimal search trajectories induces
better search.
- **DualFormer** ([Su et al. 2024](#ref-Su2024-us)) Training on
optimal reasoning traces with masked steps improves reasoning
ability.
- **AlphaZero-like** ([Feng et al. 2023](#ref-Feng2023-sz)) /
**MCTS-DPO** ([Xie et al. 2024](#ref-Xie2024-lp)) / **Agent Q**
([Putta et al. 2024](#ref-Putta2024-yy)) Sketches out MCTS-style
expert iteration for LLM planning.
- **PAVs** ([Setlur et al. 2024](#ref-Setlur2024-ax)) Argues for
advantage (PAV) function over value (PRM) for learning to search.
Shows increase in search efficacy.
- **SCoRE (Self-Correct)** ([Kumar et al. 2024](#ref-Kumar2024-fj))## Does this lead to test time scaling?
- **Optimal test scaling** ([Snell et al. 2024](#ref-Snell2024-dx))
- **Large Language Monkeys** ([B. Brown et al.
2024](#ref-Brown2024-bs))
- **Inference Scaling** ([Y. Wu et al. 2024](#ref-Wu2024-mt))------------------------------------------------------------------------
## Full Bibliography.
Anthony, Thomas, Zheng Tian, and David Barber. 2017. “Thinking Fast and
Slow with Deep Learning and Tree Search.” *arXiv \[Cs.AI\]*.
.Brown, Bradley, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le,
Christopher Ré, and Azalia Mirhoseini. 2024. “Large Language Monkeys:
Scaling Inference Compute with Repeated Sampling.” *arXiv \[Cs.LG\]*.
.Brown, Noam, and Tuomas Sandholm. 2017. “Libratus: The Superhuman AI for
No-Limit Poker.” In *Proceedings of the Twenty-Sixth International Joint
Conference on Artificial Intelligence*. California: International Joint
Conferences on Artificial Intelligence Organization.
.Chen, Ziru, Michael White, Raymond Mooney, Ali Payani, Yu Su, and Huan
Sun. 2024. “When Is Tree Search Useful for LLM Planning? It Depends on
the Discriminator.” *arXiv \[Cs.CL\]*.
.Cobbe, Karl, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun,
Lukasz Kaiser, Matthias Plappert, et al. 2021. “Training Verifiers to
Solve Math Word Problems.” *arXiv \[Cs.LG\]*.
.Feng, Xidong, Ziyu Wan, Muning Wen, Stephen Marcus McAleer, Ying Wen,
Weinan Zhang, and Jun Wang. 2023. “Alphazero-Like Tree-Search Can Guide
Large Language Model Decoding and Training.” *arXiv \[Cs.LG\]*.
.Gandhi, Kanishk, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng,
Archit Sharma, and Noah D Goodman. 2024. “Stream of Search (SoS):
Learning to Search in Language.” *arXiv \[Cs.LG\]*.
.Gulcehre, Caglar, Tom Le Paine, Srivatsan Srinivasan, Ksenia
Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, et al.
2023. “Reinforced Self-Training (ReST) for Language Modeling.” *arXiv
\[Cs.CL\]*.
.Jones, Andy L. 2021. “Scaling Scaling Laws with Board Games.” *arXiv
\[Cs.LG\]*. .Kazemnejad, Amirhossein, Milad Aghajohari, Eva Portelance, Alessandro
Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. 2024.
“VinePPO: Unlocking RL Potential for LLM Reasoning Through Refined
Credit Assignment.” *arXiv \[Cs.LG\]*.
.Kirchner, Jan Hendrik, Yining Chen, Harri Edwards, Jan Leike, Nat
McAleese, and Yuri Burda. 2024. “Prover-Verifier Games Improve
Legibility of LLM Outputs.” *arXiv \[Cs.CL\]*.
.Kumar, Aviral, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes,
Avi Singh, Kate Baumli, et al. 2024. “Training Language Models to
Self-Correct via Reinforcement Learning.” *arXiv \[Cs.LG\]*.
.Lightman, Hunter, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen
Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl
Cobbe. 2023. “Let’s Verify Step by Step.” *arXiv \[Cs.LG\]*.
.Nakano, Reiichiro, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang,
Christina Kim, Christopher Hesse, et al. 2021. “WebGPT: Browser-Assisted
Question-Answering with Human Feedback.” *arXiv \[Cs.CL\]*.
.Nye, Maxwell, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski,
Jacob Austin, David Bieber, David Dohan, et al. 2021. “Show Your Work:
Scratchpads for Intermediate Computation with Language Models.” *arXiv
\[Cs.LG\]*. .Paul G. Allen School. 2024. “Parables on the Power of Planning in AI:
From Poker to Diplomacy: Noam Brown (OpenAI).” Youtube.
.Putta, Pranav, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn,
Divyansh Garg, and Rafael Rafailov. 2024. “Agent Q: Advanced Reasoning
and Learning for Autonomous AI Agents.” *arXiv \[Cs.AI\]*.
.Setlur, Amrith, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob
Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral
Kumar. 2024. “Rewarding Progress: Scaling Automated Process Verifiers
for LLM Reasoning.” *arXiv \[Cs.LG\]*.
.Silver, David, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre,
George van den Driessche, Julian Schrittwieser, et al. 2016. “Mastering
the Game of Go with Deep Neural Networks and Tree Search.” *Nature* 529
(7587): 484–89. .Silver, David, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou,
Matthew Lai, Arthur Guez, Marc Lanctot, et al. 2017. “Mastering Chess
and Shogi by Self-Play with a General Reinforcement Learning Algorithm.”
*arXiv \[Cs.AI\]*. .Singh, Avi, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush
Patil, Xavier Garcia, Peter J Liu, et al. 2023. “Beyond Human Data:
Scaling Self-Training for Problem-Solving with Language Models.” *arXiv
\[Cs.LG\]*. .Snell, Charlie, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. “Scaling
LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model
Parameters.” *arXiv \[Cs.LG\]*. .Su, Dijia, Sainbayar Sukhbaatar, Michael Rabbat, Yuandong Tian, and
Qinqing Zheng. 2024. “Dualformer: Controllable Fast and Slow Thinking by
Learning with Randomized Reasoning Traces.” *arXiv \[Cs.AI\]*.
.Uesato, Jonathan, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel,
Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. 2022.
“Solving Math Word Problems with Process- and Outcome-Based Feedback.”
*arXiv \[Cs.LG\]*. .Wang, Junlin, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. 2024.
“Mixture-of-Agents Enhances Large Language Model Capabilities.” *arXiv
\[Cs.CL\]*. .Wang, Peiyi, Lei Li, Zhihong Shao, R X Xu, Damai Dai, Yifei Li, Deli
Chen, Y Wu, and Zhifang Sui. 2023. “Math-Shepherd: Verify and Reinforce
LLMs Step-by-Step Without Human Annotations.” *arXiv \[Cs.AI\]*.
.Wang, Xuezhi, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan
Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. “Self-Consistency
Improves Chain of Thought Reasoning in Language Models.” *arXiv
\[Cs.CL\]*. .Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter,
Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2022. “Chain-of-Thought
Prompting Elicits Reasoning in Large Language Models.” Edited by S
Koyejo, S Mohamed, A Agarwal, D Belgrave, K Cho, and A Oh. *arXiv
\[Cs.CL\]*, 24824–37.
.Welleck, Sean, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf,
Alex Xie, Graham Neubig, Ilia Kulikov, and Zaid Harchaoui. 2024. “From
Decoding to Meta-Generation: Inference-Time Algorithms for Large
Language Models.” *arXiv \[Cs.CL\]*. .Wu, Tianhao, Janice Lan, Weizhe Yuan, Jiantao Jiao, Jason Weston, and
Sainbayar Sukhbaatar. 2024. “Thinking LLMs: General Instruction
Following with Thought Generation.” *arXiv \[Cs.CL\]*.
.Wu, Yangzhen, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang.
2024. “Inference Scaling Laws: An Empirical Analysis of Compute-Optimal
Inference for Problem-Solving with Language Models.” *arXiv \[Cs.AI\]*.
.Xie, Yuxi, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P
Lillicrap, Kenji Kawaguchi, and Michael Shieh. 2024. “Monte Carlo Tree
Search Boosts Reasoning via Iterative Preference Learning.” *arXiv
\[Cs.AI\]*. .Xie, Yuxi, Kenji Kawaguchi, Yiran Zhao, Xu Zhao, Min-Yen Kan, Junxian
He, and Qizhe Xie. 2023. “Self-Evaluation Guided Beam Search for
Reasoning.” *arXiv \[Cs.CL\]*. .Yao, Shunyu, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths,
Yuan Cao, and Karthik Narasimhan. 2023. “Tree of Thoughts: Deliberate
Problem Solving with Large Language Models.” *arXiv \[Cs.CL\]*.
.Yarowsky, David. 1995. “Unsupervised Word Sense Disambiguation Rivaling
Supervised Methods.” In *Proceedings of the 33rd Annual Meeting on
Association for Computational Linguistics -*. Morristown, NJ, USA:
Association for Computational Linguistics.
.Yoshida, Davis, Kartik Goyal, and Kevin Gimpel. 2024.
“MAP’s Not Dead yet: Uncovering True
Language Model Modes by Conditioning Away Degeneracy.” In *Proceedings
of the 62nd Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers)*, 16164–215. Stroudsburg, PA, USA:
Association for Computational Linguistics.
.Zelikman, Eric, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber,
and Noah D Goodman. 2024. “Quiet-STaR: Language Models Can Teach
Themselves to Think Before Speaking.” *arXiv \[Cs.CL\]*.
.Zelikman, Eric, Yuhuai Wu, Jesse Mu, and Noah D Goodman. 2022. “STaR:
Bootstrapping Reasoning with Reasoning.” *arXiv \[Cs.LG\]*.
.Zhao, Stephen, Rob Brekelmans, Alireza Makhzani, and Roger Grosse. 2024.
“Probabilistic Inference in Language Models via Twisted Sequential Monte
Carlo.” *arXiv \[Cs.LG\]*. .