{"id":29359379,"url":"https://github.com/OpenDCAI/DataFlow","last_synced_at":"2025-07-09T07:03:21.635Z","repository":{"id":258919103,"uuid":"872005985","full_name":"OpenDCAI/DataFlow","owner":"OpenDCAI","description":"Easy Data Preparation with latest LLMs-based Operators and Pipelines.","archived":false,"fork":false,"pushed_at":"2025-07-05T17:59:25.000Z","size":70580,"stargazers_count":500,"open_issues_count":8,"forks_count":31,"subscribers_count":10,"default_branch":"main","last_synced_at":"2025-07-05T19:06:15.217Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://OpenDCAI.github.io/DataFlow-Doc/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenDCAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-10-13T14:45:45.000Z","updated_at":"2025-07-05T17:54:38.000Z","dependencies_parsed_at":"2025-06-28T09:49:47.346Z","dependency_job_id":null,"html_url":"https://github.com/OpenDCAI/DataFlow","commit_stats":null,"previous_names":["open-dataflow/open-dataflow-eval","open-dataflow/dataflow-eval-process","open-dataflow/dataflow","opendcai/dataflow"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/OpenDCAI/DataFlow","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenDCAI%2FDataFlow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenDCAI%2FDataFlow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenDCAI%2FDataFlow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenDCAI%2FDataFlow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenDCAI","download_url":"https://codeload.github.com/OpenDCAI/DataFlow/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenDCAI%2FDataFlow/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264411124,"owners_count":23603799,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-07-09T07:01:53.250Z","updated_at":"2025-07-09T07:03:21.626Z","avatar_url":"https://github.com/OpenDCAI.png","language":"Python","funding_links":[],"categories":["Python","A01_文本生成_文本对话","HarmonyOS","数据 Data"],"sub_categories":["大语言对话模型及数据","Windows Manager"],"readme":"# DataFlow\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"./static/images/Face.jpg\"\u003e\n\n\n[![Documents](https://img.shields.io/badge/Documents-Click_here-brightgreen?logo=read-the-docs)](https://OpenDCAI.github.io/DataFlow-Doc/)\n[![](https://img.shields.io/github/license/OpenDCAI/DataFlow)](https://github.com/OpenDCAI/DataFlow/blob/main/LICENSE)\n[![](https://img.shields.io/github/stars/OpenDCAI/DataFlow?style=social)](https://github.com/OpenDCAI/DataFlow)\n[![](https://img.shields.io/github/issues-raw/OpenDCAI/DataFlow)](https://github.com/OpenDCAI/DataFlow/issues)\n[![](https://img.shields.io/github/contributors/OpenDCAI/DataFlow)](https://github.com/OpenDCAI/DataFlow/graphs/contributors)\n[![](https://img.shields.io/github/repo-size/OpenDCAI/DataFlow?color=green)](https://github.com/OpenDCAI/DataFlow)\n\n\u003c!-- [![](https://img.shields.io/github/last-commit/OpenDCAI/DataFlow)](https://github.com/OpenDCAI/DataFlow/commits/main/) --\u003e\n\n[简体中文](./README-zh.md) | English\n\n\n**[🚀 Features](#Features) • [⚡ Quick Start](#Quick_Start) • [📖 Documentation](https://OpenDCAI.github.io/DataFlow-Doc/) • [🧪 Experiments](#Experiments)**\n\n\u003c/div\u003e\n\nhttps://github.com/user-attachments/assets/05e047a5-99bb-4043-bc71-2b5ccdab2126\n\n## 📰 1. News\n🎉 [2025-06-28] We’re excited to announce that DataFlow, our Data-centric AI system, is now released! Stay tuned for future updates.\n\n## 🔍 2. Overview\n\n  \u003cimg src=\"./static/images/dataflow_framework.jpg\"\u003e\n\nDataFlow is a data preparation and training system designed to **parse, generate, process and evaluate** high-quality data from noisy sources (PDF, plain-text, low-quality QA), thereby improving the performance of large language models (LLMs) in specific domains through targeted training (Pre-training, Supervised Fine-tuing, RL training) or RAG using knowledge base cleaning. **DataFlow has been empirically validated to improve domain-oriented LLM's performance in fields such as healthcare, finance, and law.**\n\nSpecifically, we constructing diverse `operators` leveraging rule-based methods, deep learning models, LLMs, and LLM APIs. These operators are systematically integrated into distinct `pipelines`, collectively forming the comprehensive `DataFlow system`. Additionally, we develop an intelligent `DataFlow-agent` capable of dynamically assembling new `pipelines` by recombining existing `operators` on demand.\n\n\n\n\u003c!-- Text: 输入是烂数据 通过大模型 输出QA （主要是强化学习）\nNL2SQL: 反向构造SQL QA\nReasonning：Question很短，构建长链COT ，是否有category，是否有难度（通过大模型）\nAgentic RAG: 输入QA，出来是 QA。没有额外信息解决不了，必须要引入\nKnowlege Base Cleaning: PDF，表格+doc text输入，输出是高质量知识库\nDataflow-agent: 用Agent自动合成pipeline。编排已有算子。 --\u003e\n\n## 🛠️ 3. Pipelines Functionality\n### 🔧 3.1 Ready-to-Use PipeLines\nCurrent Pipelines in Dataflow are as follows:\n- 📝 **Text Pipeline**: Mine question-answer pairs from large-scale plain-text data (mostly crawed from InterNet) for use in SFT and RL training.\n  - ![](./static/images/dataflow_text_pipeline.jpg)\n  - [[HuggingFace🤗 demo input \u0026 output for **Text Pipeline**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Text)\n- 🧠 **Reasoning Pipeline**: Enhances existing question–answer pairs with (1) extended chain-of-thought, (2) category classification, and (3) difficulty estimation.\n  - ![](./static/images/dataflow_reasoning_pipeline.jpg)\n  - [[HuggingFace🤗 demo input \u0026 output for **Reasoning Pipeline**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Reasonning)\n- 🗃️ **Text2SQL Pipeline**: Translates natural language questions into SQL queries, supplemented with explanations, chain-of-thought reasoning, and contextual schema information.\n  - ![](./static/images/dataflow_text2sql_pipeline.jpg)\n  - [[HuggingFace🤗 demo input \u0026 output for **Text2SQL Pipeline**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Text2SQL)\n- 📚 **Knowlege Base Cleaning Pipeline**: Extract and structure knowledge from unorganized sources like tables, PDFs, and Word documents into usable entries for downstream RAG or QA pair generation.\n  - ![](./static/images/dataflow_KnowledgeBaseClean_pipeline.jpg)\n- 🤖 **Agentic RAG Pipeline**: Identify and extract QA pairs from existing QA datasets or knowledge bases that require external knowledge to answer, for use in downstream training of Agnetic RAG tasks.\n  - ![](./static/images/dataflow_agenticRAG_pipeline.jpg)\n### ⚙️ 3.2 Flexible Operator PipeLines\nIn this framework, operators are categorized into Fundamental Operators, Generic Operators, Domain-Specific Operators, and Evaluation Operators, etc., supporting data processing and evaluation functionalities. Please refer to the [documentation](https://OpenDCAI.github.io/DataFlow-Doc/) for details.\n\n### 🤖 3.3 Agent Guided Pipelines\n\u003c!-- Building on top of this, we also provide the --\u003e\n- **DataFlow Agent**: Can arrange existing `operators` and automatically construct new pipelines based on task requirements.\n  - ![](./static/images/dataflow_agent_pipeline.jpg)\n  - [[HuggingFace🤗 demo input \u0026 output for **DataFlow Agent**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Agent)\n\n\u003c!-- ### 3.1 Text Pipeline\n![](./static/images/demo_reasoning.png) --\u003e\n\n## ⚡ 4. Quick Start\nFor environment setup and installation, please using the following commands👇\n\n```shell\nconda create -n dataflow python=3.10 \nconda activate dataflow\n\npip install open-dataflow\n```\nIf you want to use your own GPU to inference locally, please use:\n```shell\npip install open-dataflow[vllm]\n```\n\u003e Dataflow supports Python\u003e=3.10\n\nYou can use follwing command to check if installed correctly:\n```shell\ndataflow -v\n```\n\nYou are expected to see following outputs:\n```log\nopen-dataflow codebase version: 1.0.0\n        Checking for updates...\n        Local version:  1.0.0\n        PyPI newest version:  1.0.0\nYou are using the latest version: 1.0.0.\n```\n\nFor **Quick-Start** and **Guide**, please visit our [Documentation](https://OpenDCAI.github.io/DataFlow-Doc/). \n\n[![Documents](https://img.shields.io/badge/Documents-Click_here-brightgreen?logo=read-the-docs)](https://OpenDCAI.github.io/DataFlow-Doc/)\n\n\n## 🧪 5. Experimental Results\nFor Detailed Experiments setting, please visit our documentation.\n\n\n### 📝 5.1 Text PipeLine\n\n#### 5.1.1 Pre-training data filter pipeline\nThe `pre-training data processing pipeline` was applied to randomly sampled data from the RedPajama dataset, resulting in a final data retention rate of 13.65%. The analysis results using `QuratingScorer` are shown in the figure. As can be seen, the filtered pretraining data significantly outperforms the original data across four scoring dimensions: writing style, requirement for expert knowledge, factual content, and educational value. This demonstrates the effectiveness of the DataFlow pretraining data processing.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"./static/images/text-pretrain.png\" width=\"60%\"\u003e\n\u003c/div\u003e\n\n#### 5.1.2 SFT data filter pipeline\nWe filted 3k record from `alpaca` dataset and compare it with radom selected 3k data from `alpaca` dataset by training it on Qwen2.5-7B. Results are:\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"./static/images/text-sft.png\" width=\"60%\"\u003e\n\u003c/div\u003e\n\n### 🧠 5.2 Reasoning Pipeline\n\nWe verify our reasoning pipeline by SFT on a Qwen2.5-32B-Instruct with Reasoning Pipeline synsthized data. We generated 1k and 5k SFT data pairs. Results are: \n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"./static/images/reasoning_performance.png\" width=\"60%\"\u003e\n\u003c/div\u003e\n\n### 🗃️ 5.3 Text2SQL PipeLine\nWe fine-tuned the Qwen2.5-Coder-14B model on the Bird dataset using both Supervised Fine-tuning (SFT) and Reinforcement Learning (RL), with data constructed via the DataFlow-Text2SQL Pipeline. Results are:\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"./static/images/text2sql.png\" width=\"60%\"\u003e\n\u003c/div\u003e\n\n## 🤝 6. Community \u0026 Support\nJoin the DataFlow open-source community to ask questions, share ideas, and collaborate with other developers!\n\n•\t📮 [GitHub Issues](../../issues): Report bugs or suggest features\n \n•\t🔧 [GitHub Pull Requests](../../pulls): Contribute code improvements\n\n•\t💬 Join our community groups to connect with us and other contributors!\n \n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"./static/images/community_en.jpg\" width=\"60%\"\u003e\n\u003c/div\u003e\n\n## 📜 7. Citation\nIf you use DataFlow in your research, feel free to give us a cite.\n```bibtex\n@misc{dataflow2025,\n  author       = {DataFlow Develop Team},\n  title        = {DataFlow: A Unified Framework for Data-Centric AI},\n  year         = {2025},\n  howpublished = {\\url{https://github.com/OpenDCAI/DataFlow}},\n  note         = {Accessed: 2025-07-08}\n}\n```\n\n## 📊 8. Statistics\n\u003cdiv align=\"center\"\u003e\n  \u003ca href=\"https://star-history.com/#OpenDCAI/DataFlow\u0026Date\"\u003e\n    \u003cpicture\u003e\n      \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"https://api.star-history.com/svg?repos=OpenDCAI/DataFlow\u0026type=Date\u0026theme=dark\" /\u003e\n      \u003csource media=\"(prefers-color-scheme: light)\" srcset=\"https://api.star-history.com/svg?repos=OpenDCAI/DataFlow\u0026type=Date\" /\u003e\n      \u003cimg alt=\"Star History Chart\" src=\"https://api.star-history.com/svg?repos=OpenDCAI/DataFlow\u0026type=Date\" style=\"width:50%;\" /\u003e\n    \u003c/picture\u003e\n  \u003c/a\u003e\n  \n\u003c/div\u003e\n\n---\n\u003cdiv align=\"center\"\u003e\n  \u003csub\u003e\n    Developed and maintained by the \n    \u003ca href=\"https://zwt233.github.io/\" target=\"_blank\"\u003e\u003cstrong\u003ePKU-DCAI Research Team\u003c/strong\u003e\u003c/a\u003e ❤️ \u003cbr\u003e\n    Connect with us on Xiaohongshu: \u003cstrong\u003e26133106768\u003c/strong\u003e\n  \u003c/sub\u003e\n\u003c/div\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenDCAI%2FDataFlow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FOpenDCAI%2FDataFlow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenDCAI%2FDataFlow/lists"}