{"id":24986872,"url":"https://github.com/superlinear-ai/microgrpo","last_synced_at":"2025-07-17T16:04:10.846Z","repository":{"id":275662676,"uuid":"926437543","full_name":"superlinear-ai/microGRPO","owner":"superlinear-ai","description":"🐭 A tiny single-file implementation of Group Relative Policy Optimization (GRPO) as introduced by the DeepSeekMath paper","archived":false,"fork":false,"pushed_at":"2025-06-28T12:18:18.000Z","size":28,"stargazers_count":30,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-06-28T13:19:23.583Z","etag":null,"topics":["autograd","drgrpo","grpo","loop","magistral","numpy","reinforcement-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/superlinear-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-02-03T08:56:01.000Z","updated_at":"2025-06-28T12:18:19.000Z","dependencies_parsed_at":"2025-02-03T22:31:35.801Z","dependency_job_id":"7a307b47-3d8a-46ff-bfde-fc67b9288f67","html_url":"https://github.com/superlinear-ai/microGRPO","commit_stats":null,"previous_names":["superlinear-ai/microgrpo"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/superlinear-ai/microGRPO","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/superlinear-ai%2FmicroGRPO","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/superlinear-ai%2FmicroGRPO/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/superlinear-ai%2FmicroGRPO/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/superlinear-ai%2FmicroGRPO/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/superlinear-ai","download_url":"https://codeload.github.com/superlinear-ai/microGRPO/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/superlinear-ai%2FmicroGRPO/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265625570,"owners_count":23800624,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["autograd","drgrpo","grpo","loop","magistral","numpy","reinforcement-learning"],"created_at":"2025-02-04T11:32:54.771Z","updated_at":"2025-07-17T16:04:10.841Z","avatar_url":"https://github.com/superlinear-ai.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# microGRPO\n\nA tiny single-file implementation of Group Relative Policy Optimization (GRPO) as introduced by the DeepSeekMath paper[^1][^2][^3].\n\n🆕 microGRPO now implements the GRPO improvements introduced by Dr. GRPO[^4], Apple's LOOP[^5], and Mistral's Magistral[^6]:\n1. 💥 Remove per-group advantage normalization[^4]\n2. ⛳️ Leave-one-out advantage[^5] (LOOP only)\n3. 🔥 Eliminate KL divergence[^5]\n4. 🎢 Normalize loss[^5]\n5. 🏆 Add per-batch advantage normalization[^6] (Magistral only)\n6. 🚦 Relax trust region bounds[^5]\n7. 🌈 Eliminate non-diverse groups[^5]\n\n[^1]: [The DeepSeekMath paper](https://arxiv.org/abs/2402.03300)\n[^2]: [Yuge (Jimmy) Shi's blog post](https://yugeten.github.io/posts/2025/01/ppogrpo/)\n[^3]: [Nathan Lambert's RLHF book](https://rlhfbook.com/c/11-policy-gradients.html)\n[^4]: [The Dr. GRPO paper](https://arxiv.org/pdf/2503.20783)\n[^5]: [Apple's LOOP paper](https://arxiv.org/pdf/2502.01600)\n[^6]: [Mistral's Magistral paper](https://arxiv.org/pdf/2506.10910)\n\n## Features\n\n1. 🐭 Only ~300 lines of code\n2. 📦 In pure NumPy, with [autograd](https://github.com/HIPS/autograd) to compute the gradient\n3. ✅ Type annotated and linted\n4. ✂️ Easily swap out the default game and train on any other game or environment\n\n## Getting started\n\n\u003e [!NOTE]\n\u003e You'll need to [install uv](https://docs.astral.sh/uv/getting-started/installation/) to run the commands below.\n\nTo start teaching a policy to play a simplified version of [Battleship](https://en.wikipedia.org/wiki/Battleship_(game)), run:\n```sh\nuv run microgrpo.py\n```\n\nYou should see that the policy learns to improve its average score from around 15% to about 50% over 2000 iterations:\n\n![Battleship policy trained with GRPO](https://github.com/user-attachments/assets/de464264-2d1c-43f2-9bc3-dcd9eea48c45)\n\n## Background\n\n#### File structure\n\nThe file is structured into five sections:\n\n1. 🕹️ Game (~50 lines): An implementation of the Battleship board game\n2. 🌍 Environment (~60 lines): The API with which an agent can interact with the game\n3. 🧠 Policy (~30 lines): A model that produces action probabilities given the observed environment state\n4. 🎯 GRPO (~80 lines): The GRPO objective function and training data generator\n5. ⚡ Train (~50 lines): The loop that collects training data and optimizes the GRPO objective with AdamW\n\n#### GRPO config\n\nStarting a training run only requires defining a `GRPOConfig` with your choice of environment (here, `BattleshipEnv`) and a function that evaluates the policy model given its parameters (here, `neural_battleship_policy`):\n\n```python\n# Define the environment and the policy model to optimize.\ngrpo_config = GRPOConfig(environment=BattleshipEnv, policy=neural_battleship_policy)\n\n# Train the policy model by maximizing the GRPO objective with AdamW.\nθ_star, rewards_val = train_grpo(θ_init := neural_battleship_policy_init(), grpo_config)\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsuperlinear-ai%2Fmicrogrpo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsuperlinear-ai%2Fmicrogrpo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsuperlinear-ai%2Fmicrogrpo/lists"}