{"id":33462601,"url":"https://github.com/open-sciencelab/GraphGen","last_synced_at":"2025-11-29T22:01:52.744Z","repository":{"id":288232894,"uuid":"913684633","full_name":"open-sciencelab/GraphGen","owner":"open-sciencelab","description":"GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation","archived":false,"fork":false,"pushed_at":"2025-11-26T11:48:33.000Z","size":16535,"stargazers_count":573,"open_issues_count":7,"forks_count":45,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-11-27T23:50:31.301Z","etag":null,"topics":["ai4science","data-generation","data-synthesis","graphgen","knowledge-graph","llama-factory","llm","llm-training","pretrain","pretraining","qa","question-answering","qwen","sft","sft-data","xtuner"],"latest_commit_sha":null,"homepage":"https://chenzihong.gitbook.io/graphgen-cookbook/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/open-sciencelab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":".github/contributing.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-01-08T06:49:17.000Z","updated_at":"2025-11-27T10:03:21.000Z","dependencies_parsed_at":"2025-05-28T04:30:21.845Z","dependency_job_id":"9a1c2498-588d-4f22-b297-6ba3f66ded35","html_url":"https://github.com/open-sciencelab/GraphGen","commit_stats":null,"previous_names":["open-sciencelab/graphgen"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/open-sciencelab/GraphGen","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/open-sciencelab%2FGraphGen","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/open-sciencelab%2FGraphGen/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/open-sciencelab%2FGraphGen/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/open-sciencelab%2FGraphGen/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/open-sciencelab","download_url":"https://codeload.github.com/open-sciencelab/GraphGen/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/open-sciencelab%2FGraphGen/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":27366311,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-29T02:00:06.589Z","response_time":56,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai4science","data-generation","data-synthesis","graphgen","knowledge-graph","llama-factory","llm","llm-training","pretrain","pretraining","qa","question-answering","qwen","sft","sft-data","xtuner"],"created_at":"2025-11-25T02:00:18.668Z","updated_at":"2025-11-29T22:01:52.736Z","avatar_url":"https://github.com/open-sciencelab.png","language":"Python","funding_links":[],"categories":["🕸️ Knowledge Extraction \u0026 Scholarly KGs"],"sub_categories":["Knowledge Graph Construction"],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"resources/images/logo.png\"/\u003e\n\u003c/p\u003e\n\n\u003c!-- icon --\u003e\n\n[![stars](https://img.shields.io/github/stars/open-sciencelab/GraphGen.svg)](https://github.com/open-sciencelab/GraphGen)\n[![forks](https://img.shields.io/github/forks/open-sciencelab/GraphGen.svg)](https://github.com/open-sciencelab/GraphGen)\n[![open issues](https://img.shields.io/github/issues-raw/open-sciencelab/GraphGen)](https://github.com/open-sciencelab/GraphGen/issues)\n[![issue resolution](https://img.shields.io/github/issues-closed-raw/open-sciencelab/GraphGen)](https://github.com/open-sciencelab/GraphGen/issues)\n[![documentation](https://img.shields.io/badge/docs-latest-blue)](https://chenzihong.gitbook.io/graphgen-cookbook/)\n[![pypi](https://img.shields.io/pypi/v/graphg.svg?style=flat\u0026logo=pypi\u0026logoColor=white)](https://pypi.org/project/graphg/)\n[![wechat](https://img.shields.io/badge/wechat-brightgreen?logo=wechat\u0026logoColor=white)](https://cdn.vansin.top/internlm/dou.jpg)\n[![arXiv](https://img.shields.io/badge/Paper-arXiv-white)](https://arxiv.org/abs/2505.20416)\n[![Hugging Face](https://img.shields.io/badge/Paper-on%20HF-white?logo=huggingface\u0026logoColor=yellow)](https://huggingface.co/papers/2505.20416)\n\n[![Hugging Face](https://img.shields.io/badge/Demo-on%20HF-blue?logo=huggingface\u0026logoColor=yellow)](https://huggingface.co/spaces/chenzihong/GraphGen)\n[![Model Scope](https://img.shields.io/badge/%F0%9F%A4%96%20Demo-on%20MS-green)](https://modelscope.cn/studios/chenzihong/GraphGen)\n\n\nGraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation\n\n[English](README.md) | [中文](README_zh.md)\n\n\u003cdetails close\u003e\n\u003csummary\u003e\u003cb\u003e📚 Table of Contents\u003c/b\u003e\u003c/summary\u003e\n\n- 📝 [What is GraphGen?](#-what-is-graphgen)\n- 📌 [Latest Updates](#-latest-updates)\n- ⚙️ [Support List](#-support-list)\n- 🚀 [Quick Start](#-quick-start)\n- 🏗️ [System Architecture](#-system-architecture)\n- 🍀 [Acknowledgements](#-acknowledgements)\n- 📚 [Citation](#-citation)\n- 📜 [License](#-license)\n- 📅 [Star History](#-star-history)\n\n[//]: # (- 🌟 [Key Features]\u0026#40;#-key-features\u0026#41;)\n[//]: # (- 💰 [Cost Analysis]\u0026#40;#-cost-analysis\u0026#41;)\n[//]: # (- ⚙️ [Configurations]\u0026#40;#-configurations\u0026#41;)\n\n\u003c/details\u003e\n\n## 📝 What is GraphGen?\n\nGraphGen is a framework for synthetic data generation guided by knowledge graphs. Please check the [**paper**](https://arxiv.org/abs/2505.20416) and [best practice](https://github.com/open-sciencelab/GraphGen/issues/17).\n\nHere is post-training result which **over 50% SFT data** comes from GraphGen and our data clean pipeline.\n\n|  Domain   |                          Dataset                          |   Ours   | Qwen2.5-7B-Instruct (baseline) |\n|:---------:|:---------------------------------------------------------:|:--------:|:------------------------------:|\n|   Plant   | [SeedBench](https://github.com/open-sciencelab/SeedBench) | **65.9** |              51.5              |\n|  Common   |                           CMMLU                           |   73.6   |            **75.8**            |\n| Knowledge |                       GPQA-Diamond                        | **40.0** |              33.3              |\n|   Math    |                          AIME24                           | **20.6** |              16.7              |\n|           |                          AIME25                           | **22.7** |              7.2               |\n\nIt begins by constructing a fine-grained knowledge graph from the source text，then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge.\nFurthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.\n\nAfter data generation, you can use [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) and [xtuner](https://github.com/InternLM/xtuner) to finetune your LLMs.\n\n## 📌 Latest Updates\n\n- **2025.10.30**: We support several new LLM clients and inference backends including [Ollama_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/ollama_client.py), [http_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/http_client.py), [HuggingFace Transformers](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/hf_wrapper.py) and [SGLang](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/sglang_wrapper.py).\n- **2025.10.23**: We support VQA(Visual Question Answering) data generation now. Run script: `bash scripts/generate/generate_vqa.sh`.\n- **2025.10.21**: We support PDF as input format for data generation now via [MinerU](https://github.com/opendatalab/MinerU).\n\n\u003cdetails\u003e\n\u003csummary\u003eHistory\u003c/summary\u003e\n\n- **2025.09.29**: We auto-update gradio demo on [Hugging Face](https://huggingface.co/spaces/chenzihong/GraphGen) and [ModelScope](https://modelscope.cn/studios/chenzihong/GraphGen).\n- **2025.08.14**: We have added support for community detection in knowledge graphs using the Leiden algorithm, enabling the synthesis of Chain-of-Thought (CoT) data.\n- **2025.07.31**: We have added Google, Bing, Wikipedia, and UniProt as search back-ends.\n- **2025.04.21**: We have released the initial version of GraphGen.\n\n\u003c/details\u003e\n\n\n## ⚙️ Support List\n\nWe support various LLM inference servers, API servers, inference clients, input file formats, data modalities, output data formats, and output data types.\nUsers can flexibly configure according to the needs of synthetic data.\n\n| Inference Server                             | Api Server                                                                     | Inference Client                                           | Input File Format                  | Data Modal    | Data Format                  | Data Type                                       |\n|----------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------|------------------------------------|---------------|------------------------------|-------------------------------------------------|\n| [![hf-icon]HF][hf]\u003cbr\u003e[![sg-icon]SGLang][sg] | [![sif-icon]Silicon][sif]\u003cbr\u003e[![oai-icon]OpenAI][oai]\u003cbr\u003e[![az-icon]Azure][az] | HTTP\u003cbr\u003e[![ol-icon]Ollama][ol]\u003cbr\u003e[![oai-icon]OpenAI][oai] | CSV\u003cbr\u003eJSON\u003cbr\u003eJSONL\u003cbr\u003ePDF\u003cbr\u003eTXT | TEXT\u003cbr\u003eIMAGE | Alpaca\u003cbr\u003eChatML\u003cbr\u003eSharegpt | Aggregated\u003cbr\u003eAtomic\u003cbr\u003eCoT\u003cbr\u003eMulti-hop\u003cbr\u003eVQA |\n\n\u003c!-- links --\u003e\n[hf]: https://huggingface.co/docs/transformers/index\n[sg]: https://docs.sglang.ai\n[sif]: https://siliconflow.cn\n[oai]: https://openai.com\n[az]: https://azure.microsoft.com/en-us/services/cognitive-services/openai-service/\n[ol]: https://ollama.com\n\n\u003c!-- icons --\u003e\n[hf-icon]: https://www.google.com/s2/favicons?domain=https://huggingface.co\n[sg-icon]: https://www.google.com/s2/favicons?domain=https://docs.sglang.ai\n[sif-icon]: https://www.google.com/s2/favicons?domain=siliconflow.com\n[oai-icon]: https://www.google.com/s2/favicons?domain=https://openai.com\n[az-icon]: https://www.google.com/s2/favicons?domain=https://azure.microsoft.com\n[ol-icon]: https://www.google.com/s2/favicons?domain=https://ollama.com\n\n\n\n## 🚀 Quick Start\n\nExperience GraphGen Demo through [Huggingface](https://huggingface.co/spaces/chenzihong/GraphGen) or [Modelscope](https://modelscope.cn/studios/chenzihong/GraphGen).\n\nFor any questions, please check [FAQ](https://github.com/open-sciencelab/GraphGen/issues/10), open new [issue](https://github.com/open-sciencelab/GraphGen/issues) or join our [wechat group](https://cdn.vansin.top/internlm/dou.jpg) and ask.\n\n### Preparation\n\n1. Install [uv](https://docs.astral.sh/uv/reference/installer/)\n\n    ```bash\n    # You could try pipx or pip to install uv when meet network issues, refer the uv doc for more details\n    curl -LsSf https://astral.sh/uv/install.sh | sh\n    ```\n2. Clone the repository\n\n    ```bash\n    git clone --depth=1 https://github.com/open-sciencelab/GraphGen\n    cd GraphGen\n    ```\n\n3. Create a new uv environment\n\n    ```bash\n     uv venv --python 3.10\n    ```\n   \n4. Configure the dependencies\n\n    ```bash\n    uv pip install -r requirements.txt\n    ```\n\n### Run Gradio Demo\n\n   ```bash\n   python -m webui.app\n   ```\n   \n   For hot-reload during development, run\n   ```bash\n   PYTHONPATH=. gradio webui/app.py\n   ```\n\n\n![ui](https://github.com/user-attachments/assets/3024e9bc-5d45-45f8-a4e6-b57bd2350d84)\n\n### Run from PyPI\n\n1. Install GraphGen\n   ```bash\n   uv pip install graphg\n   ```\n\n2. Run in CLI\n   ```bash\n   SYNTHESIZER_MODEL=your_synthesizer_model_name \\\n   SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model \\\n   SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model \\\n   TRAINEE_MODEL=your_trainee_model_name \\\n   TRAINEE_BASE_URL=your_base_url_for_trainee_model \\\n   TRAINEE_API_KEY=your_api_key_for_trainee_model \\\n   graphg --output_dir cache\n   ```\n\n### Run from Source\n\n1. Configure the environment\n   - Create an `.env` file in the root directory\n     ```bash\n     cp .env.example .env\n     ```\n   - Set the following environment variables:\n     ```bash\n     # Synthesizer is the model used to construct KG and generate data\n     SYNTHESIZER_MODEL=your_synthesizer_model_name\n     SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model\n     SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model\n     # Trainee is the model used to train with the generated data\n     TRAINEE_MODEL=your_trainee_model_name\n     TRAINEE_BASE_URL=your_base_url_for_trainee_model\n     TRAINEE_API_KEY=your_api_key_for_trainee_model\n     ```\n2. (Optional) Customize generation parameters in `graphgen/configs/` folder.\n\n   Edit the corresponding YAML file, e.g.:\n\n    ```yaml\n      # configs/cot_config.yaml\n      input_file: resources/input_examples/jsonl_demo.jsonl\n      output_data_type: cot\n      tokenizer: cl100k_base\n      # additional settings...\n    ```\n\n3. Generate data\n\n   Pick the desired format and run the matching script:\n   \n   | Format       | Script to run                                  | Notes                                                             |\n   |--------------|------------------------------------------------|-------------------------------------------------------------------|\n   | `cot`        | `bash scripts/generate/generate_cot.sh`        | Chain-of-Thought Q\\\u0026A pairs                                       |\n   | `atomic`     | `bash scripts/generate/generate_atomic.sh`     | Atomic Q\\\u0026A pairs covering basic knowledge                        |\n   | `aggregated` | `bash scripts/generate/generate_aggregated.sh` | Aggregated Q\\\u0026A pairs incorporating complex, integrated knowledge |\n   | `multi-hop`  | `bash scripts/generate/generate_multihop.sh`   | Multi-hop reasoning Q\\\u0026A pairs                                    |\n\n\n4. Get the generated data\n   ```bash\n   ls cache/data/graphgen\n   ```\n\n### Run with Docker\n1. Build the Docker image\n   ```bash\n   docker build -t graphgen .\n   ```\n2. Run the Docker container\n   ```bash\n    docker run -p 7860:7860 graphgen\n    ```\n\n\n## 🏗️ System Architecture\n\nSee [analysis](https://deepwiki.com/open-sciencelab/GraphGen) by deepwiki for a technical overview of the GraphGen system, its architecture, and core functionalities. \n\n\n### Workflow\n![workflow](resources/images/flow.png)\n\n\n## 🍀 Acknowledgements\n- [SiliconFlow](https://siliconflow.cn) Abundant LLM API, some models are free\n- [LightRAG](https://github.com/HKUDS/LightRAG) Simple and efficient graph retrieval solution\n- [ROGRAG](https://github.com/tpoisonooo/ROGRAG) A robustly optimized GraphRAG framework\n- [DB-GPT](https://github.com/eosphoros-ai/DB-GPT) An AI native data app development framework\n\n\n## 📚 Citation\nIf you find this repository useful, please consider citing our work:\n```bibtex\n@misc{chen2025graphgenenhancingsupervisedfinetuning,\n      title={GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation}, \n      author={Zihong Chen and Wanli Jiang and Jinzhe Li and Zhonghang Yuan and Huanjun Kong and Wanli Ouyang and Nanqing Dong},\n      year={2025},\n      eprint={2505.20416},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2505.20416}, \n}\n```\n\n## 📜 License\nThis project is licensed under the [Apache License 2.0](LICENSE).\n\n## 📅 Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=open-sciencelab/GraphGen\u0026type=Date)](https://www.star-history.com/#open-sciencelab/GraphGen\u0026Date)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopen-sciencelab%2FGraphGen","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopen-sciencelab%2FGraphGen","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopen-sciencelab%2FGraphGen/lists"}