https://github.com/OpenDCAI/DataFlow
Easy Data Preparation with latest LLMs-based Operators and Pipelines.
https://github.com/OpenDCAI/DataFlow
Last synced: 8 months ago
JSON representation
Easy Data Preparation with latest LLMs-based Operators and Pipelines.
- Host: GitHub
- URL: https://github.com/OpenDCAI/DataFlow
- Owner: OpenDCAI
- License: apache-2.0
- Created: 2024-10-13T14:45:45.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-07-05T17:59:25.000Z (8 months ago)
- Last Synced: 2025-07-05T19:06:15.217Z (8 months ago)
- Language: Python
- Homepage: https://OpenDCAI.github.io/DataFlow-Doc/
- Size: 67.3 MB
- Stars: 500
- Watchers: 10
- Forks: 31
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- StarryDivineSky - OpenDCAI/DataFlow
- stars - OpenDCAI/DataFlow - based Operators and Pipelines. (HarmonyOS / Windows Manager)
- awesome-LLM-resources - DataFlow - based Operators and Pipelines. (数据 Data)
README
# DataFlow
[](https://OpenDCAI.github.io/DataFlow-Doc/)
[](https://github.com/OpenDCAI/DataFlow/blob/main/LICENSE)
[](https://github.com/OpenDCAI/DataFlow)
[](https://github.com/OpenDCAI/DataFlow/issues)
[](https://github.com/OpenDCAI/DataFlow/graphs/contributors)
[](https://github.com/OpenDCAI/DataFlow)
[简体中文](./README-zh.md) | English
**[🚀 Features](#Features) • [⚡ Quick Start](#Quick_Start) • [📖 Documentation](https://OpenDCAI.github.io/DataFlow-Doc/) • [🧪 Experiments](#Experiments)**
https://github.com/user-attachments/assets/05e047a5-99bb-4043-bc71-2b5ccdab2126
## 📰 1. News
🎉 [2025-06-28] We’re excited to announce that DataFlow, our Data-centric AI system, is now released! Stay tuned for future updates.
## 🔍 2. Overview

DataFlow is a data preparation and training system designed to **parse, generate, process and evaluate** high-quality data from noisy sources (PDF, plain-text, low-quality QA), thereby improving the performance of large language models (LLMs) in specific domains through targeted training (Pre-training, Supervised Fine-tuing, RL training) or RAG using knowledge base cleaning. **DataFlow has been empirically validated to improve domain-oriented LLM's performance in fields such as healthcare, finance, and law.**
Specifically, we constructing diverse `operators` leveraging rule-based methods, deep learning models, LLMs, and LLM APIs. These operators are systematically integrated into distinct `pipelines`, collectively forming the comprehensive `DataFlow system`. Additionally, we develop an intelligent `DataFlow-agent` capable of dynamically assembling new `pipelines` by recombining existing `operators` on demand.
## 🛠️ 3. Pipelines Functionality
### 🔧 3.1 Ready-to-Use PipeLines
Current Pipelines in Dataflow are as follows:
- 📝 **Text Pipeline**: Mine question-answer pairs from large-scale plain-text data (mostly crawed from InterNet) for use in SFT and RL training.
- 
- [[HuggingFace🤗 demo input & output for **Text Pipeline**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Text)
- 🧠 **Reasoning Pipeline**: Enhances existing question–answer pairs with (1) extended chain-of-thought, (2) category classification, and (3) difficulty estimation.
- 
- [[HuggingFace🤗 demo input & output for **Reasoning Pipeline**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Reasonning)
- 🗃️ **Text2SQL Pipeline**: Translates natural language questions into SQL queries, supplemented with explanations, chain-of-thought reasoning, and contextual schema information.
- 
- [[HuggingFace🤗 demo input & output for **Text2SQL Pipeline**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Text2SQL)
- 📚 **Knowlege Base Cleaning Pipeline**: Extract and structure knowledge from unorganized sources like tables, PDFs, and Word documents into usable entries for downstream RAG or QA pair generation.
- 
- 🤖 **Agentic RAG Pipeline**: Identify and extract QA pairs from existing QA datasets or knowledge bases that require external knowledge to answer, for use in downstream training of Agnetic RAG tasks.
- 
### ⚙️ 3.2 Flexible Operator PipeLines
In this framework, operators are categorized into Fundamental Operators, Generic Operators, Domain-Specific Operators, and Evaluation Operators, etc., supporting data processing and evaluation functionalities. Please refer to the [documentation](https://OpenDCAI.github.io/DataFlow-Doc/) for details.
### 🤖 3.3 Agent Guided Pipelines
- **DataFlow Agent**: Can arrange existing `operators` and automatically construct new pipelines based on task requirements.
- 
- [[HuggingFace🤗 demo input & output for **DataFlow Agent**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Agent)
## ⚡ 4. Quick Start
For environment setup and installation, please using the following commands👇
```shell
conda create -n dataflow python=3.10
conda activate dataflow
pip install open-dataflow
```
If you want to use your own GPU to inference locally, please use:
```shell
pip install open-dataflow[vllm]
```
> Dataflow supports Python>=3.10
You can use follwing command to check if installed correctly:
```shell
dataflow -v
```
You are expected to see following outputs:
```log
open-dataflow codebase version: 1.0.0
Checking for updates...
Local version: 1.0.0
PyPI newest version: 1.0.0
You are using the latest version: 1.0.0.
```
For **Quick-Start** and **Guide**, please visit our [Documentation](https://OpenDCAI.github.io/DataFlow-Doc/).
[](https://OpenDCAI.github.io/DataFlow-Doc/)
## 🧪 5. Experimental Results
For Detailed Experiments setting, please visit our documentation.
### 📝 5.1 Text PipeLine
#### 5.1.1 Pre-training data filter pipeline
The `pre-training data processing pipeline` was applied to randomly sampled data from the RedPajama dataset, resulting in a final data retention rate of 13.65%. The analysis results using `QuratingScorer` are shown in the figure. As can be seen, the filtered pretraining data significantly outperforms the original data across four scoring dimensions: writing style, requirement for expert knowledge, factual content, and educational value. This demonstrates the effectiveness of the DataFlow pretraining data processing.
#### 5.1.2 SFT data filter pipeline
We filted 3k record from `alpaca` dataset and compare it with radom selected 3k data from `alpaca` dataset by training it on Qwen2.5-7B. Results are:
### 🧠 5.2 Reasoning Pipeline
We verify our reasoning pipeline by SFT on a Qwen2.5-32B-Instruct with Reasoning Pipeline synsthized data. We generated 1k and 5k SFT data pairs. Results are:
### 🗃️ 5.3 Text2SQL PipeLine
We fine-tuned the Qwen2.5-Coder-14B model on the Bird dataset using both Supervised Fine-tuning (SFT) and Reinforcement Learning (RL), with data constructed via the DataFlow-Text2SQL Pipeline. Results are:
## 🤝 6. Community & Support
Join the DataFlow open-source community to ask questions, share ideas, and collaborate with other developers!
• 📮 [GitHub Issues](../../issues): Report bugs or suggest features
• 🔧 [GitHub Pull Requests](../../pulls): Contribute code improvements
• 💬 Join our community groups to connect with us and other contributors!
## 📜 7. Citation
If you use DataFlow in your research, feel free to give us a cite.
```bibtex
@misc{dataflow2025,
author = {DataFlow Develop Team},
title = {DataFlow: A Unified Framework for Data-Centric AI},
year = {2025},
howpublished = {\url{https://github.com/OpenDCAI/DataFlow}},
note = {Accessed: 2025-07-08}
}
```
## 📊 8. Statistics
---
Developed and maintained by the
PKU-DCAI Research Team ❤️
Connect with us on Xiaohongshu: 26133106768