https://github.com/tpoisonooo/huixiangdou2
HuixiangDou2: A Robustly Optimized GraphRAG Approach
https://github.com/tpoisonooo/huixiangdou2
knowledge-base knowledge-graph knownledge-augmented-generation llm precision retrieval-augmented-generation
Last synced: 3 months ago
JSON representation
HuixiangDou2: A Robustly Optimized GraphRAG Approach
- Host: GitHub
- URL: https://github.com/tpoisonooo/huixiangdou2
- Owner: tpoisonooo
- License: bsd-3-clause
- Created: 2024-12-18T08:48:58.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-03-18T08:07:51.000Z (3 months ago)
- Last Synced: 2025-03-18T08:29:06.132Z (3 months ago)
- Topics: knowledge-base, knowledge-graph, knownledge-augmented-generation, llm, precision, retrieval-augmented-generation
- Language: Python
- Homepage:
- Size: 1.52 MB
- Stars: 92
- Watchers: 1
- Forks: 7
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
English | [Simplified Chinese](./README_zh_cn.md)
# HuixiangDou2: A Robustly Optimized GraphRAG Approach
## 🔥 Introduction
GraphRAG has many tuning spots, making it hard to discern whether performance gains stem from parameter adjustments or pipeline optimizations. Moreover, RAG test data is embedded in LLM training sets. LLM input tokens impact generation probabilities (background: phi-4 technical report). It's unclear if precision improvements originate from key token searches or retrievals.
Thus, HuixiangDou2 integrated multiple open-source projects (HuixiangDou, KAG, LightRAG, and DB-GPT, totaling 18k lines of code) and conducted comparative experiments on a testset where Qwen2.5-7B-Instruct underperformed. The score rose from 60 to 74.5. Ultimately, a GraphRAG implementation with performance recognized by human domain experts was developed. [Here is the report](https://arxiv.org/abs/2503.06474).
> **Note**: The impact of open-source on different fields/industries varies. Since licensing restriction, we can **only give the code and test conclusions, and the test data cannot be provided**.
![]()
## 📖 Documentation
- [1. Run from Docker (CMD / Swagger Server API / Gradio)](docs/en/doc_how_to_run_from_docker.md)
- [2. Run from Source](docs/en/doc_how_to_run.md)
- [3. Directory Structure and Function](docs/en/doc_architecture.md)
- [**FAQ** about environment and error](https://github.com/tpoisonooo/HuixiangDou2/issues/8)If it is useful to you, please star it ⭐
## 🔆 Version Description
Compared to [HuixiangDou1](https://github.com/internlm/huixiangdou), this repo improves accuracy:
1. **Graph Schema**. Dense retrieval is only for querying similar entities and relationships.
2. Ported/merged multiple open-source implementations, with code differences of nearly 18k lines:
- **Data**. Organized a set of real domain knowledge that LLM has not fully seen for testing (gpt accuracy < 0.6)
- **Ablation**. Confirmed the impact of different stages and parameters on accuracy
- **Improvement**. As shown below.
![]()
3. API remains compatible. That means Wechat/Lark/Web in v1 is also accessible.
```text
# v1 API https://github.com/InternLM/HuixiangDou/blob/main/huixiangdou/service/parallel_pipeline.py#L290
async def generate(self,
query: Union[Query, str],
history: List[Tuple[str]]=[],
language: str='zh',
enable_web_search: bool=True,
enable_code_search: bool=True):# v2 API https://github.com/tpoisonooo/HuixiangDou2/blob/main/huixiangdou/pipeline/parallel.py#L135
async def generate(self,
query: Union[Query, str],
history: List[Pair] = [],
request_id: str = 'default',
language: str = 'zh_cn'):
```
## 🍀 Acknowledgements
- [SiliconCloud](https://siliconflow.cn) Abundant LLM API, some models are free
- [KAG](https://github.com/OpenSPG/KAG) Graph retrieval based on reasoning
- [DB-GPT](https://github.com/eosphoros-ai/DB-GPT) LLM tool collection
- [LightRAG](https://github.com/HKUDS/LightRAG) Simple and efficient graph retrieval solution## 📝 Citation
```text
@misc{kong2024huixiangdou,
title={HuiXiangDou: Overcoming Group Chat Scenarios with LLM-based Technical Assistance},
author={Huanjun Kong and Songyang Zhang and Jiaying Li and Min Xiao and Jun Xu and Kai Chen},
year={2024},
eprint={2401.08772},
archivePrefix={arXiv},
primaryClass={cs.CL}
}@misc{kong2024labelingsupervisedfinetuningdata,
title={Labeling supervised fine-tuning data with the scaling law},
author={Huanjun Kong},
year={2024},
eprint={2405.02817},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2405.02817},
}@misc{kong2025huixiangdou2robustlyoptimizedgraphrag,
title={HuixiangDou2: A Robustly Optimized GraphRAG Approach},
author={Huanjun Kong and Zhefan Wang and Chenyang Wang and Zhe Ma and Nanqing Dong},
year={2025},
eprint={2503.06474},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2503.06474},
}
```