https://github.com/burnycoder/diverse-group-relative-policy-optimization
Reinforcement learning algorithm to make LLMs reason more creatively. While GRPO normalizes rewards within groups of responses, DGRPO incorporates solution diversity into the advantage calculation through two novel approaches: (1) upweighting less likely but correct tokens to incentivize rare solutions, and (2) cosine similarity of embeddings.
https://github.com/burnycoder/diverse-group-relative-policy-optimization
Last synced: about 1 year ago
JSON representation
Reinforcement learning algorithm to make LLMs reason more creatively. While GRPO normalizes rewards within groups of responses, DGRPO incorporates solution diversity into the advantage calculation through two novel approaches: (1) upweighting less likely but correct tokens to incentivize rare solutions, and (2) cosine similarity of embeddings.
- Host: GitHub
- URL: https://github.com/burnycoder/diverse-group-relative-policy-optimization
- Owner: BurnyCoder
- Created: 2025-03-23T16:13:49.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-24T01:20:58.000Z (over 1 year ago)
- Last Synced: 2025-03-24T01:23:09.953Z (over 1 year ago)
- Language: Jupyter Notebook
- Homepage:
- Size: 339 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Diverse Group Relative Policy Optimization (DGRPO): Upweighting Rare, Accurate Solutions for STEM and Art
Language models trained with reinforcement learning often converge to common solution patterns, limiting their creative problem-solving capabilities. This paper introduces Diverse Group Relative Policy Optimization (DGRPO), an extension of Group Relative Policy Optimization (GRPO) that specifically addresses this limitation. While GRPO normalizes rewards within groups of responses to promote accuracy, DGRPO incorporates solution diversity into the advantage calculation through two novel approaches: (1) upweighting less likely but correct tokens to incentivize rare solutions, and (2) quantifying solution uniqueness using cosine similarity of neural embeddings. By introducing a configurable diversity weight parameter, DGRPO allows practitioners to balance accuracy with exploration of diverse solution strategies. My approach encourages language models to discover multiple valid approaches to problems, a critical capability for applications in scientific discovery, mathematical problem-solving, creative coding, and art. DGRPO demonstrates how reinforcement learning can be adapted to reward not just correctness but also the novelty and diversity of solutions in generative AI systems.
Read `Diverse_Group_Relative_Policy_Optimization_(DGRPO).pdf` paper for more details.
## Usage
Run `Diverse_Group_Relative_Policy_Optimization_(DGRPO).ipynb` locally or in Google Colab.
## Disclaimer
Work on progress
## Summary
The project includes:
- Implements GRPO and DGRPO training algorithms with reward function implementations
- Training examples on math reasoning tasks (GSM8K)
- Comparative model evaluation tools
## To do
- Fix saving to Google Drive
- Double check code for errors
- Fix broken plotting
- Fix broken testing
- Dedupe and refactor code
- Benchmark on GSM8K and other datasets and compare original model, model trained using GRPO, and model trained using DGRPO
- Implement cosine similarity method in code and benchmark it
- Train bigger models
- Fix broken formatting in paper
- Add citation information