https://github.com/Raj-08/Q-Flow
Complete Reinforcement Learning Toolkit for Large Language Models!
https://github.com/Raj-08/Q-Flow
Last synced: 22 days ago
JSON representation
Complete Reinforcement Learning Toolkit for Large Language Models!
- Host: GitHub
- URL: https://github.com/Raj-08/Q-Flow
- Owner: Raj-08
- License: mit
- Created: 2025-02-22T21:44:54.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2025-03-14T11:00:32.000Z (about 2 months ago)
- Last Synced: 2025-03-14T12:21:29.635Z (about 2 months ago)
- Language: Python
- Homepage:
- Size: 1.05 MB
- Stars: 9
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- StarryDivineSky - Raj-08/Reinforce-Lite - Lite 是一个专为大型语言模型设计的强化学习工具包。它旨在简化和加速强化学习过程,让开发者能够更轻松地训练和优化 LLM。该工具包提供了一系列预定义的模块和实用工具,例如环境交互、奖励函数和策略优化算法。Reinforce-Lite 的核心优势在于其轻量级和易用性,即使是强化学习新手也能快速上手。它支持多种强化学习算法,并允许用户自定义环境和奖励机制。项目目标是构建一个灵活且高效的平台,帮助研究人员和开发者探索 LLM 在各种任务中的潜力,例如文本生成、对话系统和智能代理。通过 Reinforce-Lite,用户可以更有效地利用强化学习来提升 LLM 的性能和适应性。该项目鼓励社区贡献,共同推动 LLM 强化学习领域的发展。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
README
# Q-FLOW
Welcome to **Q-Flow**, we focus on advancing open source development on Reinforcement Learning (RL) for LLMs. At **QFlow**, we provide the complete toolbox that specifically addresses the Reinforcement Learning Needs of Large Language Models. - Reasoning , Alignment and More !
### Highlights - Aha Moment with Limited Compute
Our algorithm Reinforce-Lite was able to achieve Aha Moment on Grade School Math Dataset.
https://medium.com/@rjusnba/overnight-end-to-end-rl-training-a-3b-model-on-a-grade-school-math-dataset-leads-to-reasoning-df61410c04c6To get started with **QFlow**, simply clone the repository and install the dependencies:
```bash
git clone https://github.com/Raj-08/Q-Flow.git
cd Q-Flow
pip install -r requirements.txt
```## Features
- **Dedicated Toolbox**: A set of tools designed to handle reinforcement learning challenges specific to LLMs.
- **Creative Solutions**: Breakthrough techniques and methodologies that make training language models faster, smarter, and more efficient.
- **Scalable Performance**: Optimize LLMs with algorithms like **PPO**, **DPO**, and **GRPO**—designed for the unique needs of the LLM world.
- **Hyperparameter Search**: We use Evolutionary Algorithms to find the right configuration of hyperparametrs to make our trainings more effective.## Available RL Algorithms
QFlow supports several powerful RL algorithms that can be used to fine-tune your large language models. Choose the one that fits your training requirements:
- [x] **Reinforce-Lite** (Displays Emergence while being computationally affordable)
- [x] **Monte-Carlo** (Simple RL Monte Carlo , expectation over a sample of returns)
- [x] **Group Relative Policy Optimization (GRPO)** (DeepSeek's RL Algorithm)
- [ ] **Proximal Policy Optimization (PPO)**
- [ ] **Direct Preference Optimization (DPO)**
- [ ] **Actor Critic (A2C)**## Available Datasets
QFlow has out of the box support for reasoning datasets. We will expand further into process reward datasets.
- [x] **GSM8K** GradeSchoolMath
- [ ] **Math500**QFlow provides a simple command-line interface to train your models using different RL algorithms. Here are some examples:
### Training with Different Algorithms
```bash
# Train using Reinforce-Lite
python main.py --algorithm reinforce-lite \
--model_name "microsoft/Phi-3.5-mini-instruct" \
--dataset_name "gsm8k" \
--batch_size 1 \
--num_steps 5000 \
--learning_rate 1e-6# Train using GRPO
python main.py --algorithm grpo \
--model_name "microsoft/Phi-3.5-mini-instruct" \
--dataset_name "gsm8k" \
--batch_size 1 \
--group_size 10 \
--num_steps 5000 \
--learning_rate 1e-6# Train using Monte-Carlo
python main.py --algorithm monte-carlo \
--model_name "microsoft/Phi-3.5-mini-instruct" \
--dataset_name "gsm8k" \
--batch_size 1 \
--num_steps 5000 \
--learning_rate 1e-6
```### Common Command Line Arguments
- `--algorithm`: Choose the RL algorithm (`reinforce-lite`, `grpo`, `monte-carlo`)
- `--model_name`: Name or path of the pretrained model to fine-tune
- `--dataset_name`: Name of the dataset to use for training
- `--batch_size`: Number of samples per training batch
- `--num_steps`: Total number of training steps
- `--learning_rate`: Learning rate for optimization
- `--entropy_coef`: Entropy coefficient for exploration (default: 0.001)
- `--group_size`: Group size for GRPO algorithm (default: 10)### Monitoring Training with TensorBoard
QFlow automatically logs training metrics to TensorBoard. To view the training progress:

1. Start TensorBoard server:
```bash
tensorboard --logdir runs/
```2. Open your browser and navigate to:
```
http://localhost:6006
```The TensorBoard interface shows:
- Training loss curves
- Policy and entropy loss
- Average rewards and success rates
- Response lengths
- Sample model outputs
- Training hyperparametersYou can compare different runs by selecting them in the TensorBoard interface. Each run is tagged with the algorithm name and timestamp for easy identification.
### Checkpoints and Model Saving
Models are automatically saved during training:
- Regular checkpoints every 100 steps
- Final model after training completion
- Training state and hyperparametersCheckpoints are saved in:
```
checkpoints/{algorithm_name}_{timestamp}/
```To load a saved model for inference or continued training, use the checkpoint path as the `model_name` argument.