https://github.com/open-sciencelab/GraphGen
GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
https://github.com/open-sciencelab/GraphGen
ai4science data-generation data-synthesis graphgen knowledge-graph llama-factory llm llm-training pretrain pretraining qa question-answering qwen sft sft-data xtuner
Last synced: about 1 month ago
JSON representation
GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
- Host: GitHub
- URL: https://github.com/open-sciencelab/GraphGen
- Owner: open-sciencelab
- License: apache-2.0
- Created: 2025-01-08T06:49:17.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-11-26T11:48:33.000Z (about 2 months ago)
- Last Synced: 2025-11-27T23:50:31.301Z (about 2 months ago)
- Topics: ai4science, data-generation, data-synthesis, graphgen, knowledge-graph, llama-factory, llm, llm-training, pretrain, pretraining, qa, question-answering, qwen, sft, sft-data, xtuner
- Language: Python
- Homepage: https://chenzihong.gitbook.io/graphgen-cookbook/
- Size: 15.8 MB
- Stars: 573
- Watchers: 7
- Forks: 45
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- Contributing: .github/contributing.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Citation: CITATION.cff
Awesome Lists containing this project
- awesome-ai-for-science - GraphGen - Knowledge graph-guided synthetic data generation for LLM fine-tuning, achieving strong performance on scientific QA (GPQA-Diamond) and math reasoning (AIME) (πΈοΈ Knowledge Extraction & Scholarly KGs / Knowledge Graph Construction)
README
[](https://github.com/open-sciencelab/GraphGen)
[](https://github.com/open-sciencelab/GraphGen)
[](https://github.com/open-sciencelab/GraphGen/issues)
[](https://github.com/open-sciencelab/GraphGen/issues)
[](https://chenzihong.gitbook.io/graphgen-cookbook/)
[](https://pypi.org/project/graphg/)
[](https://cdn.vansin.top/internlm/dou.jpg)
[](https://arxiv.org/abs/2505.20416)
[](https://huggingface.co/papers/2505.20416)
[](https://huggingface.co/spaces/chenzihong/GraphGen)
[](https://modelscope.cn/studios/chenzihong/GraphGen)
GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
[English](README.md) | [δΈζ](README_zh.md)
π Table of Contents
- π [What is GraphGen?](#-what-is-graphgen)
- π [Latest Updates](#-latest-updates)
- βοΈ [Support List](#-support-list)
- π [Quick Start](#-quick-start)
- ποΈ [System Architecture](#-system-architecture)
- π [Acknowledgements](#-acknowledgements)
- π [Citation](#-citation)
- π [License](#-license)
- π
[Star History](#-star-history)
[//]: # (- π [Key Features](#-key-features))
[//]: # (- π° [Cost Analysis](#-cost-analysis))
[//]: # (- βοΈ [Configurations](#-configurations))
## π What is GraphGen?
GraphGen is a framework for synthetic data generation guided by knowledge graphs. Please check the [**paper**](https://arxiv.org/abs/2505.20416) and [best practice](https://github.com/open-sciencelab/GraphGen/issues/17).
Here is post-training result which **over 50% SFT data** comes from GraphGen and our data clean pipeline.
| Domain | Dataset | Ours | Qwen2.5-7B-Instruct (baseline) |
|:---------:|:---------------------------------------------------------:|:--------:|:------------------------------:|
| Plant | [SeedBench](https://github.com/open-sciencelab/SeedBench) | **65.9** | 51.5 |
| Common | CMMLU | 73.6 | **75.8** |
| Knowledge | GPQA-Diamond | **40.0** | 33.3 |
| Math | AIME24 | **20.6** | 16.7 |
| | AIME25 | **22.7** | 7.2 |
It begins by constructing a fine-grained knowledge graph from the source textοΌthen identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge.
Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.
After data generation, you can use [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) and [xtuner](https://github.com/InternLM/xtuner) to finetune your LLMs.
## π Latest Updates
- **2025.10.30**: We support several new LLM clients and inference backends including [Ollama_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/ollama_client.py), [http_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/http_client.py), [HuggingFace Transformers](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/hf_wrapper.py) and [SGLang](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/sglang_wrapper.py).
- **2025.10.23**: We support VQA(Visual Question Answering) data generation now. Run script: `bash scripts/generate/generate_vqa.sh`.
- **2025.10.21**: We support PDF as input format for data generation now via [MinerU](https://github.com/opendatalab/MinerU).
History
- **2025.09.29**: We auto-update gradio demo on [Hugging Face](https://huggingface.co/spaces/chenzihong/GraphGen) and [ModelScope](https://modelscope.cn/studios/chenzihong/GraphGen).
- **2025.08.14**: We have added support for community detection in knowledge graphs using the Leiden algorithm, enabling the synthesis of Chain-of-Thought (CoT) data.
- **2025.07.31**: We have added Google, Bing, Wikipedia, and UniProt as search back-ends.
- **2025.04.21**: We have released the initial version of GraphGen.
## βοΈ Support List
We support various LLM inference servers, API servers, inference clients, input file formats, data modalities, output data formats, and output data types.
Users can flexibly configure according to the needs of synthetic data.
| Inference Server | Api Server | Inference Client | Input File Format | Data Modal | Data Format | Data Type |
|----------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------|------------------------------------|---------------|------------------------------|-------------------------------------------------|
| [![hf-icon]HF][hf]
[![sg-icon]SGLang][sg] | [![sif-icon]Silicon][sif]
[![oai-icon]OpenAI][oai]
[![az-icon]Azure][az] | HTTP
[![ol-icon]Ollama][ol]
[![oai-icon]OpenAI][oai] | CSV
JSON
JSONL
PDF
TXT | TEXT
IMAGE | Alpaca
ChatML
Sharegpt | Aggregated
Atomic
CoT
Multi-hop
VQA |
[hf]: https://huggingface.co/docs/transformers/index
[sg]: https://docs.sglang.ai
[sif]: https://siliconflow.cn
[oai]: https://openai.com
[az]: https://azure.microsoft.com/en-us/services/cognitive-services/openai-service/
[ol]: https://ollama.com
[hf-icon]: https://www.google.com/s2/favicons?domain=https://huggingface.co
[sg-icon]: https://www.google.com/s2/favicons?domain=https://docs.sglang.ai
[sif-icon]: https://www.google.com/s2/favicons?domain=siliconflow.com
[oai-icon]: https://www.google.com/s2/favicons?domain=https://openai.com
[az-icon]: https://www.google.com/s2/favicons?domain=https://azure.microsoft.com
[ol-icon]: https://www.google.com/s2/favicons?domain=https://ollama.com
## π Quick Start
Experience GraphGen Demo through [Huggingface](https://huggingface.co/spaces/chenzihong/GraphGen) or [Modelscope](https://modelscope.cn/studios/chenzihong/GraphGen).
For any questions, please check [FAQ](https://github.com/open-sciencelab/GraphGen/issues/10), open new [issue](https://github.com/open-sciencelab/GraphGen/issues) or join our [wechat group](https://cdn.vansin.top/internlm/dou.jpg) and ask.
### Preparation
1. Install [uv](https://docs.astral.sh/uv/reference/installer/)
```bash
# You could try pipx or pip to install uv when meet network issues, refer the uv doc for more details
curl -LsSf https://astral.sh/uv/install.sh | sh
```
2. Clone the repository
```bash
git clone --depth=1 https://github.com/open-sciencelab/GraphGen
cd GraphGen
```
3. Create a new uv environment
```bash
uv venv --python 3.10
```
4. Configure the dependencies
```bash
uv pip install -r requirements.txt
```
### Run Gradio Demo
```bash
python -m webui.app
```
For hot-reload during development, run
```bash
PYTHONPATH=. gradio webui/app.py
```

### Run from PyPI
1. Install GraphGen
```bash
uv pip install graphg
```
2. Run in CLI
```bash
SYNTHESIZER_MODEL=your_synthesizer_model_name \
SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model \
SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model \
TRAINEE_MODEL=your_trainee_model_name \
TRAINEE_BASE_URL=your_base_url_for_trainee_model \
TRAINEE_API_KEY=your_api_key_for_trainee_model \
graphg --output_dir cache
```
### Run from Source
1. Configure the environment
- Create an `.env` file in the root directory
```bash
cp .env.example .env
```
- Set the following environment variables:
```bash
# Synthesizer is the model used to construct KG and generate data
SYNTHESIZER_MODEL=your_synthesizer_model_name
SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model
SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model
# Trainee is the model used to train with the generated data
TRAINEE_MODEL=your_trainee_model_name
TRAINEE_BASE_URL=your_base_url_for_trainee_model
TRAINEE_API_KEY=your_api_key_for_trainee_model
```
2. (Optional) Customize generation parameters in `graphgen/configs/` folder.
Edit the corresponding YAML file, e.g.:
```yaml
# configs/cot_config.yaml
input_file: resources/input_examples/jsonl_demo.jsonl
output_data_type: cot
tokenizer: cl100k_base
# additional settings...
```
3. Generate data
Pick the desired format and run the matching script:
| Format | Script to run | Notes |
|--------------|------------------------------------------------|-------------------------------------------------------------------|
| `cot` | `bash scripts/generate/generate_cot.sh` | Chain-of-Thought Q\&A pairs |
| `atomic` | `bash scripts/generate/generate_atomic.sh` | Atomic Q\&A pairs covering basic knowledge |
| `aggregated` | `bash scripts/generate/generate_aggregated.sh` | Aggregated Q\&A pairs incorporating complex, integrated knowledge |
| `multi-hop` | `bash scripts/generate/generate_multihop.sh` | Multi-hop reasoning Q\&A pairs |
4. Get the generated data
```bash
ls cache/data/graphgen
```
### Run with Docker
1. Build the Docker image
```bash
docker build -t graphgen .
```
2. Run the Docker container
```bash
docker run -p 7860:7860 graphgen
```
## ποΈ System Architecture
See [analysis](https://deepwiki.com/open-sciencelab/GraphGen) by deepwiki for a technical overview of the GraphGen system, its architecture, and core functionalities.
### Workflow

## π Acknowledgements
- [SiliconFlow](https://siliconflow.cn) Abundant LLM API, some models are free
- [LightRAG](https://github.com/HKUDS/LightRAG) Simple and efficient graph retrieval solution
- [ROGRAG](https://github.com/tpoisonooo/ROGRAG) A robustly optimized GraphRAG framework
- [DB-GPT](https://github.com/eosphoros-ai/DB-GPT) An AI native data app development framework
## π Citation
If you find this repository useful, please consider citing our work:
```bibtex
@misc{chen2025graphgenenhancingsupervisedfinetuning,
title={GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation},
author={Zihong Chen and Wanli Jiang and Jinzhe Li and Zhonghang Yuan and Huanjun Kong and Wanli Ouyang and Nanqing Dong},
year={2025},
eprint={2505.20416},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.20416},
}
```
## π License
This project is licensed under the [Apache License 2.0](LICENSE).
## π
Star History
[](https://www.star-history.com/#open-sciencelab/GraphGen&Date)