https://github.com/burnycoder/diverse-group-relative-policy-optimization

Reinforcement learning algorithm to make LLMs reason more creatively. While GRPO normalizes rewards within groups of responses, DGRPO incorporates solution diversity into the advantage calculation through two novel approaches: (1) upweighting less likely but correct tokens to incentivize rare solutions, and (2) cosine similarity of embeddings.
https://github.com/burnycoder/diverse-group-relative-policy-optimization

Last synced: about 1 year ago
JSON representation

Host: GitHub
URL: https://github.com/burnycoder/diverse-group-relative-policy-optimization
Owner: BurnyCoder
Created: 2025-03-23T16:13:49.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-03-24T01:20:58.000Z (over 1 year ago)
Last Synced: 2025-03-24T01:23:09.953Z (over 1 year ago)
Language: Jupyter Notebook
Homepage:
Size: 339 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Diverse Group Relative Policy Optimization (DGRPO): Upweighting Rare, Accurate Solutions for STEM and Art

Language models trained with reinforcement learning often converge to common solution patterns, limiting their creative problem-solving capabilities. This paper introduces Diverse Group Relative Policy Optimization (DGRPO), an extension of Group Relative Policy Optimization (GRPO) that specifically addresses this limitation. While GRPO normalizes rewards within groups of responses to promote accuracy, DGRPO incorporates solution diversity into the advantage calculation through two novel approaches: (1) upweighting less likely but correct tokens to incentivize rare solutions, and (2) quantifying solution uniqueness using cosine similarity of neural embeddings. By introducing a configurable diversity weight parameter, DGRPO allows practitioners to balance accuracy with exploration of diverse solution strategies. My approach encourages language models to discover multiple valid approaches to problems, a critical capability for applications in scientific discovery, mathematical problem-solving, creative coding, and art. DGRPO demonstrates how reinforcement learning can be adapted to reward not just correctness but also the novelty and diversity of solutions in generative AI systems.

Read `Diverse_Group_Relative_Policy_Optimization_(DGRPO).pdf` paper for more details.

## Usage

Run `Diverse_Group_Relative_Policy_Optimization_(DGRPO).ipynb` locally or in Google Colab.

## Disclaimer

Work on progress

## Summary

The project includes:

- Implements GRPO and DGRPO training algorithms with reward function implementations
- Training examples on math reasoning tasks (GSM8K)
- Comparative model evaluation tools

## To do

- Fix saving to Google Drive
- Double check code for errors
- Fix broken plotting
- Fix broken testing
- Dedupe and refactor code
- Benchmark on GSM8K and other datasets and compare original model, model trained using GRPO, and model trained using DGRPO
- Implement cosine similarity method in code and benchmark it
- Train bigger models
- Fix broken formatting in paper
- Add citation information

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/burnycoder/diverse-group-relative-policy-optimization

Awesome Lists containing this project

README