https://github.com/akshint0407/nano-r1
This project demonstrates the process of fine-tuning the Qwen2.5-3B-Instruct model using GRPO (Generalized Reward Policy Optimization) on the GSM8K dataset.
https://github.com/akshint0407/nano-r1
adapters grpo huggingface python qwen2-5 safetensors text-generation-inference transformer trl unsloth
Last synced: about 1 month ago
JSON representation
This project demonstrates the process of fine-tuning the Qwen2.5-3B-Instruct model using GRPO (Generalized Reward Policy Optimization) on the GSM8K dataset.
- Host: GitHub
- URL: https://github.com/akshint0407/nano-r1
- Owner: Akshint0407
- License: apache-2.0
- Created: 2025-04-04T06:00:58.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-07T07:23:25.000Z (about 1 year ago)
- Last Synced: 2025-06-04T18:54:16.273Z (about 1 year ago)
- Topics: adapters, grpo, huggingface, python, qwen2-5, safetensors, text-generation-inference, transformer, trl, unsloth
- Language: Jupyter Notebook
- Homepage: https://huggingface.co/Akshint47/Nano_R1_Model
- Size: 769 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Nano-R1
# Fine-Tuning Qwen2.5-3B-Instruct with GRPO for Mathematical Reasoning



This repository contains code for fine-tuning the **Qwen2.5-3B-Instruct** model using **GRPO (Generalized Reward Policy Optimization)** on the **GSM8K** dataset. The goal is to improve the model's ability to solve mathematical reasoning problems through reinforcement learning with custom reward functions.
## 🚀 Deployment
The fine-tuned model is deployed on Hugging Face and can be accessed here:
🔗 **[Hugging Face Model Hub](https://huggingface.co/Akshint47/Nano_R1_Model)**
You can interact with the model directly or integrate it into your projects using the Hugging Face `transformers` library.
## ✨ Features
- **Efficient Fine-Tuning**: Uses Unsloth and LoRA for faster training with reduced GPU memory.
- **Custom Reward Engineering**:
- Correctness (answer accuracy)
- Format adherence (XML-structured reasoning)
- Integer validation
- XML completeness scoring
- **vLLM Integration**: Accelerates inference during training.
- **GSM8K Focus**: Optimized for mathematical word problems.
## 📋 Requirements
```bash
# Core packages
pip install unsloth vllm trl datasets
```
## Additional dependencies
```bash
pip install torch transformers sentence piece accelerate
```
## Hardware Recommendations:
GPU with ≥16GB VRAM (e.g., NVIDIA T4, A10G, or better)
Recommended: CUDA 12.x and cuDNN 8.6+
## 🛠️ Setup & Usage
Install dependencies:
```bash
git clone https://github.com/your-username/your-repo.git
cd your-repo
pip install -r requirements.txt
```
Run the notebook:
```bash
jupyter notebook nano_r1_train_v2.ipynb
```
Key Configuration (in notebook):
```python```
```
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "Qwen/Qwen2.5-3B-Instruct",
max_seq_length = 1024,
load_in_4bit = True,
max_lora_rank = 64
)
```
## 📊 Training Process
The GRPO trainer optimizes for:
- Reward Maximization: Combined score from all reward functions
- KL Regularization: Maintains policy stability
- Efficiency: Processes 8 generations per batch
- Training Progress (replace with actual metrics screenshot)
## 📜 License
This project is licensed under the **Apache License 2.0** - see the [LICENSE](LICENSE) file for full terms.
## 🙏 Acknowledgments
- Unsloth for optimization tools
- Hugging Face for models and datasets
- vLLM for fast inference
- OpenAI for the GSM8K dataset
## 🤝 Contributing
- Contributions are welcome! Please open an issue or PR for:
- Bug fixes
- Additional reward functions
- Performance improvements