https://github.com/zjunlp/knowrl

KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality
https://github.com/zjunlp/knowrl
factuality hallucination knowledge-augmentation knowrl question-answering reinforcement-learning
Last synced: 3 months ago
JSON representation
KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality
Host: GitHub
URL: https://github.com/zjunlp/knowrl
Owner: zjunlp
License: mit
Created: 2025-03-10T01:56:50.000Z (7 months ago)
Default Branch: main
Last Pushed: 2025-06-25T05:45:23.000Z (3 months ago)
Last Synced: 2025-06-25T06:38:06.887Z (3 months ago)
Topics: factuality, hallucination, knowledge-augmentation, knowrl, question-answering, reinforcement-learning
Language: Python
Homepage:
Size: 4.44 MB
Stars: 6
Watchers: 4
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          


 KnowRL 

 Exploring Knowledgeable Reinforcement Learning for Factuality 




  📄arXiv •

  🤗HuggingFace •

  📖Datasets



[![Awesome](https://awesome.re/badge.svg)](https://github.com/zjunlp/KnowRL)

[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)

![](https://img.shields.io/github/last-commit/zjunlp/KnowRL?color=green)



## Table of Contents

- [🌻Acknowledgement](#acknowledgement)

- [🌟Overview](#overview)

- [🔧Installation](#installation)

- [📚Knowledge Base Construction](#knowledge-base-construction)

- [📉Training](#training)

- [🧐Evaluation](#evaluation)

- [🚩Citation](#citation)

---

## 🌻Acknowledgement

Our Cold-Start SFT stage is implemented based on the excellent [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) framework. Our reinforcement learning training code is based on [TRL](https://github.com/huggingface/trl) and [Unsloth](https://github.com/unslothai/unsloth). We thank all authors for their great contributions!

![alt text](./assets/method.jpg)

## 🌟Overview

Large Language Models (LLMs), particularly slow-thinking models, often exhibit severe hallucinations due to an inability to accurately recognize their knowledge boundaries. To address this, we propose **KnowRL**, a novel framework that integrates external knowledge into the reinforcement learning process. KnowRL guides models to perform fact-based slow thinking by incorporating a factuality reward directly into the RL training loop. KnowRL can be seen as leveraging a form of test-time scaling law to reduce hallucinations. This helps models learn their knowledge boundaries and fosters a more reliable, fact-based reasoning process, effectively mitigating hallucinations while maintaining or enhancing strong reasoning capabilities.

## 🔧Installation

We recommend creating a new conda environment to run our project.

```bash

conda create -n knowrl python=3.12

conda activate knowrl

git clone https://github.com/zjunlp/KnowRL.git

cd KnowRL

pip install -r requirements.txt

```

## 📚Knowledge Base Construction

KnowRL's factuality reward relies on an external knowledge base. You can either download our pre-built version or build it from your own corpus.

#### Option 1: Download Pre-built Knowledge Base (Recommended)

This is the easiest way to get started. We have hosted the pre-built `knowledge_base.db` file on [Google Drive](https://drive.google.com/uc?id=1EVFkzuFvqE8AOEcdfSSm03vvvbVDa7bI).

```bash

# The target directory for the knowledge base

cd train/reward_function/FActScore/build_knowledge/

# Download the file from Google Drive and name it knowledge_base.db

gdown https://drive.google.com/uc?id=1EVFkzuFvqE8AOEcdfSSm03vvvbVDa7bI

```

This command will download the database directly into the required folder.

#### Option 2: Build from Scratch

If you wish to build the knowledge base from your own data source (e.g., a specific Wikipedia dump).

1.  Place your source data file (e.g., `wikipedia.jsonl`) in a directory.

2.  Edit the `build_db.sh` script to point `DATA_PATH` to your data file.

3.  Run the script from the `build_knowledge` directory to create the SQLite database.

    ```bash

    cd train/reward_function/FActScore/build_knowledge/

    

    # Edit DATA_PATH in build_db.sh to point to your source file

    bash build_db.sh

    ```

This will create the `knowledge_base.db` file required for the `fact_reward` function during training.

## 📉Training

Our training process consists of two main stages: a Cold-Start Supervised Fine-Tuning (SFT) phase to align the model with factual thinking patterns, followed by the Knowledgeable Reinforcement Learning (RL) phase to enhance factuality. Also, our datasets and models have been uploaded to [huggingface](https://huggingface.co/collections/zjunlp/knowrl-68485613feca77696d252a1d).

### Stage 1: Cold-Start SFT

This initial stage fine-tunes the base model on a high-quality dataset of fact-based question-answering pairs. This pre-aligns the model, making the subsequent RL training more stable and effective. We use the [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) framework for this stage.

**Example LLaMA-Factory SFT Command:**

Below is an example YAML configuration for running the SFT. You can adapt the parameters for your specific setup.

```yaml

# llama_factory_sft.yaml

### model

model_name_or_path: /path/to/your/base_model 

deepspeed: /path/to/your/ds_z3_config.json

### method

stage: sft

do_train: true

finetuning_type: lora

lora_target: q_proj,v_proj

lora_rank: 256

lora_alpha: 512

### dataset

dataset: your_coldstart_dataset # e.g., knowrl_coldstart

template: qwen

cutoff_len: 3072

overwrite_cache: true

preprocessing_num_workers: 16

### output

output_dir: /path/to/your/output_adapter # e.g., /adapter-saves/MyModel-SFT

logging_steps: 10

save_steps: 500

plot_loss: true

overwrite_output_dir: true

save_strategy: 'no'

### train

per_device_train_batch_size: 2

gradient_accumulation_steps: 1

learning_rate: 1.0e-4

num_train_epochs: 4.0

lr_scheduler_type: cosine

warmup_ratio: 0.1

fp16: true

ddp_timeout: 180000000

```

To run the SFT, you would use a command like:

```bash

CUDA_VISIBLE_DEVICES=0,1,2,3 llama-factory-cli train llama_factory_sft.yaml

```

### Stage 2: Knowledgeable Reinforcement Learning (RL)

This stage uses the SFT-tuned model and further trains it with our knowledge-enhanced reward signal. The process is orchestrated by `train/train.sh`, which launches `main.py` using the configuration defined in `script/grpo.yaml`. We are training two 7B models, `DeepSeek-R1-Distill-Qwen-7B` and `Skywork-OR1-7B-Preview`, on 1×A800 GPU.

**a. Environment Variables in `train/train.sh`:**

This script sets up all necessary environment variables and executes the training.

   - Set your API keys for services like OpenAI (`OPENAI_API_KEY_FACTSCORE`, `OPENAI_API_KEY_JUDGE`).

   - Set your `WANDB_API_KEY` for experiment tracking.

   - Ensure `FACTSCORE_DB_PATH` points to the `knowledge_base.db` file you created.

**b. Training Parameters in `script/grpo.yaml`**

This file contains all hyperparameters for the RL stage.

   - `model_name_or_path`: Path to the base model for RL training (this should be your SFT-tuned model).

   - `dataset_id_or_path`: Path to your RL training data.

   - `output_dir`: Directory to save the final trained model.

   - `wandb_project`, `wandb_entity`, `run_name`: WandB configuration.

   - `per_device_train_batch_size`, `learning_rate`, `max_steps`: Standard training hyperparameters.

   - `beta`, `num_generations`: GRPO-specific algorithm parameters.

**c. Launch RL Training**

Once configured, launch the training from the `train` directory:

```bash

cd KnowRL/train/

bash train.sh

```

The script will set the `CUDA_VISIBLE_DEVICES`, print the configuration, and start the training process.

Click to view train.sh

```bash

#!/bin/bash

# ============================================================================

# API Configuration - Replace with your actual credentials

# ============================================================================

export OPENAI_API_KEY_FACTSCORE="your_openai_api_key_here"

export OPENAI_BASE_URL_FACTSCORE="[https://api.openai.com/v1](https://api.openai.com/v1)"

export OPENAI_API_KEY_JUDGE="your_openai_api_key_here"

export OPENAI_API_BASE_JUDGE="[https://api.openai.com/v1](https://api.openai.com/v1)"

export WANDB_API_KEY="your_wandb_api_key_here"

export WANDB_MODE="offline" ## Optional: set to "online" to sync

# ============================================================================

# Configuration

# ============================================================================

export FACTSCORE_DB_PATH="./FActScore/build_knowledge/knowledge_base.db"

export USE_API_MANAGER_FOR_LLM_EVAL=True

export USE_API_MANAGER_FOR_FACTSCORE=True

# Set GPU device

export CUDA_VISIBLE_DEVICES=0

# Configuration file

CONFIG_FILE="./script/grpo.yaml"

# ============================================================================

# Run Training

# ============================================================================

echo "Starting GRPO training..."

echo "Config: $CONFIG_FILE"

echo "GPU: $CUDA_VISIBLE_DEVICES"

python main.py --config "$CONFIG_FILE"

if [ $? -eq 0 ]; then

    echo "✅ Training completed successfully!"

else

    echo "❌ Training failed!"

    exit 1

fi

```

## 🧐Evaluation

All our models are evaluated on the excellent [OpenCompass](https://github.com/open-compass/opencompass) platform. We thank its authors for their great contribution to the community!

Please refer to our paper for the detailed results. For the specific benchmarks, our settings are as follows. On **TruthfulQA**, we use the BLEU score to measure correctness in a 0-shot setting. For both **SimpleQA** and **ChineseSimpleQA**, we use `gpt-4o-mini` to judge the correctness of the answers; specifically for the English SimpleQA, we append the prompt "Let's think step by step" to elicit a reasoning process, while the Chinese version is kept as 0-shot. When evaluating on **GPQA**, we focus exclusively on the diamond subset and determine correctness by extracting the answer from a pre-defined output format, also using a 0-shot prompt. Lastly, the **AIME 2025** benchmark is also judged by `gpt-4o-mini` in a 0-shot setting.

## 🚩Citation

If you find this work useful in your research, please consider citing our paper:

```bibtex

@article{ren2025knowrl,

  title={{KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality}}, 

  author={Ren, Baochang and Qiao, Shuofei and Yu, Wenhao and Chen, Huajun and Zhang, Ningyu},

  journal={arXiv preprint arXiv:2506.19807},

  year={2025}

}

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zjunlp/knowrl

Awesome Lists containing this project

README

KnowRL

Exploring Knowledgeable Reinforcement Learning for Factuality