https://github.com/Raj-08/Q-Flow

Complete Reinforcement Learning Toolkit for Large Language Models!
https://github.com/Raj-08/Q-Flow

Last synced: 3 months ago
JSON representation

Complete Reinforcement Learning Toolkit for Large Language Models!

Host: GitHub
URL: https://github.com/Raj-08/Q-Flow
Owner: Raj-08
License: mit
Created: 2025-02-22T21:44:54.000Z (5 months ago)
Default Branch: main
Last Pushed: 2025-03-14T11:00:32.000Z (4 months ago)
Last Synced: 2025-03-14T12:21:29.635Z (4 months ago)
Language: Python
Homepage:
Size: 1.05 MB
Stars: 9
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

StarryDivineSky - Raj-08/Reinforce-Lite - Lite 是一个专为大型语言模型设计的强化学习工具包。它旨在简化和加速强化学习过程，让开发者能够更轻松地训练和优化 LLM。该工具包提供了一系列预定义的模块和实用工具，例如环境交互、奖励函数和策略优化算法。Reinforce-Lite 的核心优势在于其轻量级和易用性，即使是强化学习新手也能快速上手。它支持多种强化学习算法，并允许用户自定义环境和奖励机制。项目目标是构建一个灵活且高效的平台，帮助研究人员和开发者探索 LLM 在各种任务中的潜力，例如文本生成、对话系统和智能代理。通过 Reinforce-Lite，用户可以更有效地利用强化学习来提升 LLM 的性能和适应性。该项目鼓励社区贡献，共同推动 LLM 强化学习领域的发展。 (A01_文本生成_文本对话 / 大语言对话模型及数据)

README

# Q-FLOW
![Alt Text](images/img-copy.jpg)

Welcome to **Q-Flow**, we focus on advancing open source development on Reinforcement Learning (RL) for LLMs. At **QFlow**, we provide the complete toolbox that specifically addresses the Reinforcement Learning Needs of Large Language Models. - Reasoning , Alignment and More !

### Highlights - Aha Moment with Limited Compute
![Alt Text](images/aha.png)

Our algorithm Reinforce-Lite was able to achieve Aha Moment on Grade School Math Dataset.
https://medium.com/@rjusnba/overnight-end-to-end-rl-training-a-3b-model-on-a-grade-school-math-dataset-leads-to-reasoning-df61410c04c6

To get started with **QFlow**, simply clone the repository and install the dependencies:

```bash
git clone https://github.com/Raj-08/Q-Flow.git
cd Q-Flow
pip install -r requirements.txt
```

## Features

- **Dedicated Toolbox**: A set of tools designed to handle reinforcement learning challenges specific to LLMs.
- **Creative Solutions**: Breakthrough techniques and methodologies that make training language models faster, smarter, and more efficient.
- **Scalable Performance**: Optimize LLMs with algorithms like **PPO**, **DPO**, and **GRPO**—designed for the unique needs of the LLM world.
- **Hyperparameter Search**: We use Evolutionary Algorithms to find the right configuration of hyperparametrs to make our trainings more effective.

## Available RL Algorithms

QFlow supports several powerful RL algorithms that can be used to fine-tune your large language models. Choose the one that fits your training requirements:

- [x] **Reinforce-Lite** (Displays Emergence while being computationally affordable)
- [x] **Monte-Carlo** (Simple RL Monte Carlo , expectation over a sample of returns)
- [x] **Group Relative Policy Optimization (GRPO)** (DeepSeek's RL Algorithm)
- [ ] **Proximal Policy Optimization (PPO)**
- [ ] **Direct Preference Optimization (DPO)**
- [ ] **Actor Critic (A2C)**

## Available Datasets

QFlow has out of the box support for reasoning datasets. We will expand further into process reward datasets.

- [x] **GSM8K** GradeSchoolMath
- [ ] **Math500**

QFlow provides a simple command-line interface to train your models using different RL algorithms. Here are some examples:

### Training with Different Algorithms

```bash
# Train using Reinforce-Lite
python main.py --algorithm reinforce-lite \
--model_name "microsoft/Phi-3.5-mini-instruct" \
--dataset_name "gsm8k" \
--batch_size 1 \
--num_steps 5000 \
--learning_rate 1e-6

# Train using GRPO
python main.py --algorithm grpo \
--model_name "microsoft/Phi-3.5-mini-instruct" \
--dataset_name "gsm8k" \
--batch_size 1 \
--group_size 10 \
--num_steps 5000 \
--learning_rate 1e-6

# Train using Monte-Carlo
python main.py --algorithm monte-carlo \
--model_name "microsoft/Phi-3.5-mini-instruct" \
--dataset_name "gsm8k" \
--batch_size 1 \
--num_steps 5000 \
--learning_rate 1e-6
```

### Common Command Line Arguments

- `--algorithm`: Choose the RL algorithm (`reinforce-lite`, `grpo`, `monte-carlo`)
- `--model_name`: Name or path of the pretrained model to fine-tune
- `--dataset_name`: Name of the dataset to use for training
- `--batch_size`: Number of samples per training batch
- `--num_steps`: Total number of training steps
- `--learning_rate`: Learning rate for optimization
- `--entropy_coef`: Entropy coefficient for exploration (default: 0.001)
- `--group_size`: Group size for GRPO algorithm (default: 10)

### Monitoring Training with TensorBoard

QFlow automatically logs training metrics to TensorBoard. To view the training progress:
![Alt Text](images/screen3.png)

![Alt Text](images/screen4.png)
1. Start TensorBoard server:
```bash
tensorboard --logdir runs/
```

2. Open your browser and navigate to:
```
http://localhost:6006
```

The TensorBoard interface shows:
- Training loss curves
- Policy and entropy loss
- Average rewards and success rates
- Response lengths
- Sample model outputs
- Training hyperparameters

You can compare different runs by selecting them in the TensorBoard interface. Each run is tagged with the algorithm name and timestamp for easy identification.

### Checkpoints and Model Saving

Models are automatically saved during training:
- Regular checkpoints every 100 steps
- Final model after training completion
- Training state and hyperparameters

Checkpoints are saved in:
```
checkpoints/{algorithm_name}_{timestamp}/
```

To load a saved model for inference or continued training, use the checkpoint path as the `model_name` argument.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/Raj-08/Q-Flow

Awesome Lists containing this project

README