{"id":29166071,"url":"https://github.com/zjunlp/knowrl","last_synced_at":"2025-07-01T08:09:28.081Z","repository":{"id":301099686,"uuid":"945705699","full_name":"zjunlp/KnowRL","owner":"zjunlp","description":"KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality","archived":false,"fork":false,"pushed_at":"2025-06-25T05:45:23.000Z","size":4655,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-06-25T06:38:06.887Z","etag":null,"topics":["factuality","hallucination","knowledge-augmentation","knowrl","question-answering","reinforcement-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zjunlp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-03-10T01:56:50.000Z","updated_at":"2025-06-25T06:31:46.000Z","dependencies_parsed_at":"2025-06-25T06:48:39.121Z","dependency_job_id":null,"html_url":"https://github.com/zjunlp/KnowRL","commit_stats":null,"previous_names":["zjunlp/knowrl"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/zjunlp/KnowRL","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FKnowRL","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FKnowRL/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FKnowRL/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FKnowRL/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zjunlp","download_url":"https://codeload.github.com/zjunlp/KnowRL/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FKnowRL/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262925005,"owners_count":23385463,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["factuality","hallucination","knowledge-augmentation","knowrl","question-answering","reinforcement-learning"],"created_at":"2025-07-01T08:09:27.388Z","updated_at":"2025-07-01T08:09:28.051Z","avatar_url":"https://github.com/zjunlp.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\u003ch1 align=\"center\"\u003e KnowRL \u003c/h1\u003e\n\u003ch3 align=\"center\"\u003e Exploring Knowledgeable Reinforcement Learning for Factuality \u003c/h3\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://arxiv.org/abs/2506.19807\"\u003e📄arXiv\u003c/a\u003e •\n  \u003ca href=\"https://huggingface.co/collections/zjunlp/knowrl-68485613feca77696d252a1d\"\u003e🤗HuggingFace\u003c/a\u003e •\n  \u003ca href=\"https://huggingface.co/datasets/zjunlp/KnowRL-Train-Data\"\u003e📖Datasets\u003c/a\u003e\n\u003c/p\u003e\n\n[![Awesome](https://awesome.re/badge.svg)](https://github.com/zjunlp/KnowRL)\n[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)\n![](https://img.shields.io/github/last-commit/zjunlp/KnowRL?color=green)\n\n\u003c/div\u003e\n\n## Table of Contents\n- [🌻Acknowledgement](#acknowledgement)\n- [🌟Overview](#overview)\n- [🔧Installation](#installation)\n- [📚Knowledge Base Construction](#knowledge-base-construction)\n- [📉Training](#training)\n- [🧐Evaluation](#evaluation)\n- [🚩Citation](#citation)\n\n---\n\n## 🌻Acknowledgement\nOur Cold-Start SFT stage is implemented based on the excellent [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) framework. Our reinforcement learning training code is based on [TRL](https://github.com/huggingface/trl) and [Unsloth](https://github.com/unslothai/unsloth). We thank all authors for their great contributions!\n![alt text](./assets/method.jpg)\n\n## 🌟Overview\nLarge Language Models (LLMs), particularly slow-thinking models, often exhibit severe hallucinations due to an inability to accurately recognize their knowledge boundaries. To address this, we propose **KnowRL**, a novel framework that integrates external knowledge into the reinforcement learning process. KnowRL guides models to perform fact-based slow thinking by incorporating a factuality reward directly into the RL training loop. KnowRL can be seen as leveraging a form of test-time scaling law to reduce hallucinations. This helps models learn their knowledge boundaries and fosters a more reliable, fact-based reasoning process, effectively mitigating hallucinations while maintaining or enhancing strong reasoning capabilities.\n\n## 🔧Installation\nWe recommend creating a new conda environment to run our project.\n\n```bash\nconda create -n knowrl python=3.12\nconda activate knowrl\n\ngit clone https://github.com/zjunlp/KnowRL.git\ncd KnowRL\n\npip install -r requirements.txt\n```\n\n## 📚Knowledge Base Construction\n\nKnowRL's factuality reward relies on an external knowledge base. You can either download our pre-built version or build it from your own corpus.\n\n#### Option 1: Download Pre-built Knowledge Base (Recommended)\n\nThis is the easiest way to get started. We have hosted the pre-built `knowledge_base.db` file on [Google Drive](https://drive.google.com/uc?id=1EVFkzuFvqE8AOEcdfSSm03vvvbVDa7bI).\n\n```bash\n# The target directory for the knowledge base\ncd train/reward_function/FActScore/build_knowledge/\n\n# Download the file from Google Drive and name it knowledge_base.db\ngdown https://drive.google.com/uc?id=1EVFkzuFvqE8AOEcdfSSm03vvvbVDa7bI\n```\nThis command will download the database directly into the required folder.\n\n#### Option 2: Build from Scratch\n\nIf you wish to build the knowledge base from your own data source (e.g., a specific Wikipedia dump).\n\n1.  Place your source data file (e.g., `wikipedia.jsonl`) in a directory.\n2.  Edit the `build_db.sh` script to point `DATA_PATH` to your data file.\n3.  Run the script from the `build_knowledge` directory to create the SQLite database.\n\n    ```bash\n    cd train/reward_function/FActScore/build_knowledge/\n    \n    # Edit DATA_PATH in build_db.sh to point to your source file\n    bash build_db.sh\n    ```\n\nThis will create the `knowledge_base.db` file required for the `fact_reward` function during training.\n\n\n## 📉Training\nOur training process consists of two main stages: a Cold-Start Supervised Fine-Tuning (SFT) phase to align the model with factual thinking patterns, followed by the Knowledgeable Reinforcement Learning (RL) phase to enhance factuality. Also, our datasets and models have been uploaded to [huggingface](https://huggingface.co/collections/zjunlp/knowrl-68485613feca77696d252a1d).\n\n### Stage 1: Cold-Start SFT\nThis initial stage fine-tunes the base model on a high-quality dataset of fact-based question-answering pairs. This pre-aligns the model, making the subsequent RL training more stable and effective. We use the [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) framework for this stage.\n\n**Example LLaMA-Factory SFT Command:**\nBelow is an example YAML configuration for running the SFT. You can adapt the parameters for your specific setup.\n\n```yaml\n# llama_factory_sft.yaml\n### model\nmodel_name_or_path: /path/to/your/base_model \ndeepspeed: /path/to/your/ds_z3_config.json\n\n### method\nstage: sft\ndo_train: true\nfinetuning_type: lora\nlora_target: q_proj,v_proj\nlora_rank: 256\nlora_alpha: 512\n\n### dataset\ndataset: your_coldstart_dataset # e.g., knowrl_coldstart\ntemplate: qwen\ncutoff_len: 3072\noverwrite_cache: true\npreprocessing_num_workers: 16\n\n### output\noutput_dir: /path/to/your/output_adapter # e.g., /adapter-saves/MyModel-SFT\nlogging_steps: 10\nsave_steps: 500\nplot_loss: true\noverwrite_output_dir: true\nsave_strategy: 'no'\n\n### train\nper_device_train_batch_size: 2\ngradient_accumulation_steps: 1\nlearning_rate: 1.0e-4\nnum_train_epochs: 4.0\nlr_scheduler_type: cosine\nwarmup_ratio: 0.1\nfp16: true\nddp_timeout: 180000000\n```\nTo run the SFT, you would use a command like:\n```bash\nCUDA_VISIBLE_DEVICES=0,1,2,3 llama-factory-cli train llama_factory_sft.yaml\n```\n\n### Stage 2: Knowledgeable Reinforcement Learning (RL)\nThis stage uses the SFT-tuned model and further trains it with our knowledge-enhanced reward signal. The process is orchestrated by `train/train.sh`, which launches `main.py` using the configuration defined in `script/grpo.yaml`. We are training two 7B models, `DeepSeek-R1-Distill-Qwen-7B` and `Skywork-OR1-7B-Preview`, on 1×A800 GPU.\n\n**a. Environment Variables in `train/train.sh`:**\nThis script sets up all necessary environment variables and executes the training.\n   - Set your API keys for services like OpenAI (`OPENAI_API_KEY_FACTSCORE`, `OPENAI_API_KEY_JUDGE`).\n   - Set your `WANDB_API_KEY` for experiment tracking.\n   - Ensure `FACTSCORE_DB_PATH` points to the `knowledge_base.db` file you created.\n\n**b. Training Parameters in `script/grpo.yaml`**\nThis file contains all hyperparameters for the RL stage.\n   - `model_name_or_path`: Path to the base model for RL training (this should be your SFT-tuned model).\n   - `dataset_id_or_path`: Path to your RL training data.\n   - `output_dir`: Directory to save the final trained model.\n   - `wandb_project`, `wandb_entity`, `run_name`: WandB configuration.\n   - `per_device_train_batch_size`, `learning_rate`, `max_steps`: Standard training hyperparameters.\n   - `beta`, `num_generations`: GRPO-specific algorithm parameters.\n\n**c. Launch RL Training**\nOnce configured, launch the training from the `train` directory:\n\n```bash\ncd KnowRL/train/\nbash train.sh\n```\nThe script will set the `CUDA_VISIBLE_DEVICES`, print the configuration, and start the training process.\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to view train.sh\u003c/summary\u003e\n\n```bash\n#!/bin/bash\n# ============================================================================\n# API Configuration - Replace with your actual credentials\n# ============================================================================\nexport OPENAI_API_KEY_FACTSCORE=\"your_openai_api_key_here\"\nexport OPENAI_BASE_URL_FACTSCORE=\"[https://api.openai.com/v1](https://api.openai.com/v1)\"\n\nexport OPENAI_API_KEY_JUDGE=\"your_openai_api_key_here\"\nexport OPENAI_API_BASE_JUDGE=\"[https://api.openai.com/v1](https://api.openai.com/v1)\"\n\nexport WANDB_API_KEY=\"your_wandb_api_key_here\"\nexport WANDB_MODE=\"offline\" ## Optional: set to \"online\" to sync\n# ============================================================================\n# Configuration\n# ============================================================================\nexport FACTSCORE_DB_PATH=\"./FActScore/build_knowledge/knowledge_base.db\"\nexport USE_API_MANAGER_FOR_LLM_EVAL=True\nexport USE_API_MANAGER_FOR_FACTSCORE=True\n\n# Set GPU device\nexport CUDA_VISIBLE_DEVICES=0\n\n# Configuration file\nCONFIG_FILE=\"./script/grpo.yaml\"\n\n# ============================================================================\n# Run Training\n# ============================================================================\necho \"Starting GRPO training...\"\necho \"Config: $CONFIG_FILE\"\necho \"GPU: $CUDA_VISIBLE_DEVICES\"\n\npython main.py --config \"$CONFIG_FILE\"\n\nif [ $? -eq 0 ]; then\n    echo \"✅ Training completed successfully!\"\nelse\n    echo \"❌ Training failed!\"\n    exit 1\nfi\n```\n\u003c/details\u003e\n\n\n## 🧐Evaluation\nAll our models are evaluated on the excellent [OpenCompass](https://github.com/open-compass/opencompass) platform. We thank its authors for their great contribution to the community!\n\nPlease refer to our paper for the detailed results. For the specific benchmarks, our settings are as follows. On **TruthfulQA**, we use the BLEU score to measure correctness in a 0-shot setting. For both **SimpleQA** and **ChineseSimpleQA**, we use `gpt-4o-mini` to judge the correctness of the answers; specifically for the English SimpleQA, we append the prompt \"Let's think step by step\" to elicit a reasoning process, while the Chinese version is kept as 0-shot. When evaluating on **GPQA**, we focus exclusively on the diamond subset and determine correctness by extracting the answer from a pre-defined output format, also using a 0-shot prompt. Lastly, the **AIME 2025** benchmark is also judged by `gpt-4o-mini` in a 0-shot setting.\n\n\n## 🚩Citation\nIf you find this work useful in your research, please consider citing our paper:\n```bibtex\n@article{ren2025knowrl,\n  title={{KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality}}, \n  author={Ren, Baochang and Qiao, Shuofei and Yu, Wenhao and Chen, Huajun and Zhang, Ningyu},\n  journal={arXiv preprint arXiv:2506.19807},\n  year={2025}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzjunlp%2Fknowrl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzjunlp%2Fknowrl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzjunlp%2Fknowrl/lists"}