{"id":29031826,"url":"https://github.com/opencsgs/csghub-dataflow","last_synced_at":"2026-02-03T14:05:44.819Z","repository":{"id":301276973,"uuid":"1008735145","full_name":"OpenCSGs/csghub-dataflow","owner":"OpenCSGs","description":"OpenCSG dataflow is a one-stop data processing platform designed to leverage large model technology and advanced algorithms to optimize the entire data processing lifecycle, enhancing efficiency and precision, while addressing enterprise challenges in data management such as inefficiency, adaptability gaps, and security and compliance issues.","archived":false,"fork":false,"pushed_at":"2026-01-13T10:15:08.000Z","size":26097,"stargazers_count":6,"open_issues_count":18,"forks_count":7,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-01-26T18:47:55.354Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenCSGs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-06-26T02:45:00.000Z","updated_at":"2026-01-13T10:12:48.000Z","dependencies_parsed_at":"2025-10-27T05:25:40.733Z","dependency_job_id":"8187aacb-b0f5-4eb3-be61-b1372e2636e6","html_url":"https://github.com/OpenCSGs/csghub-dataflow","commit_stats":null,"previous_names":["opencsgs/csghub-dataflow"],"tags_count":13,"template":false,"template_full_name":null,"purl":"pkg:github/OpenCSGs/csghub-dataflow","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenCSGs%2Fcsghub-dataflow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenCSGs%2Fcsghub-dataflow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenCSGs%2Fcsghub-dataflow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenCSGs%2Fcsghub-dataflow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenCSGs","download_url":"https://codeload.github.com/OpenCSGs/csghub-dataflow/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenCSGs%2Fcsghub-dataflow/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29047100,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-03T10:09:22.136Z","status":"ssl_error","status_checked_at":"2026-02-03T10:09:16.814Z","response_time":96,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-26T10:05:39.942Z","updated_at":"2026-02-03T14:05:41.586Z","avatar_url":"https://github.com/OpenCSGs.png","language":"Python","readme":"# csghub-dataflow\nOpenCSG dataflow is a one-stop data processing platform designed to leverage large model technology and advanced algorithms to optimize the entire data processing lifecycle, enhancing efficiency and precision, while addressing enterprise challenges in data management such as inefficiency, adaptability gaps, and security and compliance issues.\n\n**DataFlow** is an open-source platform engineered to streamline end-to-end data processing within the AI/ML lifecycle. By unifying data workflows and model optimization, it transforms fragmented pipelines into a cohesive, automated system—ideal for enterprises tackling data complexity at scale.  \n\n**🔑 Key Features**\n1. **Full Lifecycle Management**  \n   - Unified handling of data ingestion, transformation, modeling, and evaluation.  \n2. **Seamless CSGHub Integration**  \n   - Directly ingest datasets from CSGHub and push refined data back for model retraining, creating a continuous feedback loop .  \n3. **Modular \u0026 Extensible Design**  \n   - Plug-and-play operators for custom pipelines (e.g., NLP, image, audio processing).  \n4. **Distributed Computing**  \n   - Scale workloads across clusters via Kubernetes integration .  \n5. **Multi-Agent Task Orchestration**  \n   - Dynamically allocate complex tasks (e.g., data validation, anomaly detection) to collaborative agents.  \n6. **MinerU Engine**  \n   - Convert PDFs to structured Markdown/JSON for LLM-friendly datasets .  \n7. **Growing Operator Library**  \n   - Expandable support for multimodal data (text, image, video) and domain-specific transformations.  \n\n## 🔗 Acknowledgements  \n\nThis project is built upon **[Data Juicer](https://github.com/modelscope/data-juicer)**. We sincerely thank the Data Juicer team for their impactful work in data engineering.  \n\n### 📜 License  \nThis project inherits the [Apache License 2.0](LICENSE) from Data Juicer.  \n\n# 🚀 Quick Start\n\n## Building data-flow from Source\n\n```\ndocker build -t dataflow . -f Dockerfile\n\ndocker buildx build --provenance false --platform linux/amd64 -t dataflow . -f Dockerfile\n\ndocker buildx build --provenance false --platform linux/arm64 -t dataflow . -f Dockerfile\n```\n\n## Prerequisites\n\nLaunch postgres container\n\n```bash\ndocker run -d --name dataflow-pg \\\n   -p 5433:5432 \\\n   -v /tmp/data_flow/pgdata:/var/lib/postgresql/data \\\n   -e POSTGRES_DB=data_flow \\\n   -e POSTGRES_USER=postgres \\\n   -e POSTGRES_PASSWORD=postgres \\\n   opencsg-registry.cn-beijing.cr.aliyuncs.com/opencsghq/csghub/postgres:15.10\n```\n\nLaunch mongoDB container\n\n```bash\ndocker run -d --name dataflow-mongo \\\n   -p 27017:27017 \\\n   -v /tmp/data_flow/mongodata:/data/db \\\n   -e MONGO_INITDB_ROOT_USERNAME=root \\\n   -e MONGO_INITDB_ROOT_PASSWORD=example \\\n   opencsg-registry.cn-beijing.cr.aliyuncs.com/opencsghq/mongo:8.0.12\n```\n\nLaunch redis container\n\n```bash\ndocker run -d --name dataflow-redis \\\n   -p 16379:6379 \\\n   -v /tmp/data_flow/redisdata:/data \\\n   opencsg-registry.cn-beijing.cr.aliyuncs.com/opencsghq/redis:7.2.5\n```\n\n## Installation data-flow\n\n```bash\n\ndocker run -d --name dataflow-api -p 8000:8000 \\\n   -v /tmp/data_flow/apidata:/data/dataflow_data \\\n   -c \"uvicorn data_server.main:app --host 0.0.0.0 --port 8000\" \\\n   -e DATA_DIR=/data/dataflow_data \\\n   -e CSGHUB_ENDPOINT=https://hub.opencsg.com \\\n   -e MAX_WORKERS=99 \\\n   -e RAY_ADDRESS=auto \\\n   -e RAY_ENABLE=False \\\n   -e RAY_LOG_DIR=/data/ray_output \\\n   -e API_SERVER=0.0.0.0 \\\n   -e API_PORT=8000 \\\n   -e ENABLE_OPENTELEMETRY=False \\\n   -e DATABASE_DB=data_flow \\\n   -e DATABASE_USERNAME=postgres \\\n   -e DATABASE_PASSWORD=postgres \\\n   -e DATABASE_HOSTNAME=127.0.0.1 \\\n   -e DATABASE_PORT=5433 \\\n   -e STUDIO_JUMP_URL=https://data-label.opencsg.com \\\n   -e REDIS_HOST_URL=redis://127.0.0.1:16379 \\\n   -e MONG_HOST_URL=mongodb://root:example@127.0.0.1:27017 \\\n   dataflow\n\n```\n\n## Installation data-flow-celery\n\n```bash\n\ndocker run -d --name celery-work -p 8001:8001 \\\n   -v /tmp/data_flow/celery-data:/data/dataflow_celery \\\n   -c \"celery -A data_celery.main:celery_app worker --loglevel=info --pool=gevent\" \\\n   -e DATA_DIR=/data/dataflow_celery \\\n   -e CSGHUB_ENDPOINT=https://hub.opencsg.com \\\n   -e MAX_WORKERS=99 \\\n   -e RAY_ADDRESS=auto \\\n   -e RAY_ENABLE=False \\\n   -e RAY_LOG_DIR=/data/ray_output \\\n   -e API_SERVER=0.0.0.0 \\\n   -e API_PORT=8001 \\\n   -e ENABLE_OPENTELEMETRY=False \\\n   -e DATABASE_DB=data_flow \\\n   -e DATABASE_USERNAME=postgres \\\n   -e DATABASE_PASSWORD=postgres \\\n   -e DATABASE_HOSTNAME=127.0.0.1 \\\n   -e DATABASE_PORT=5433 \\\n   -e REDIS_HOST_URL=redis://127.0.0.1:16379 \\\n   -e MONG_HOST_URL=mongodb://root:example@127.0.0.1:27017 \\\n   dataflow-celery\n\n```\n\n## Run data-flow server in development mode locally\n\n### Create a Virtual Environment\n\n```bash\nuv venv --python 3.10\n\nsource .venv/bin/activate\n\n# or\n\nconda create -n  dataflow python=3.10\n```\n\n```bash\n\n# Install dependencies\n#pip install '.[dist]' -i https://pypi.tuna.tsinghua.edu.cn/simple/\n#pip install '.[tools]' -i https://pypi.tuna.tsinghua.edu.cn/simple/\n#pip install '.[sci]' -i https://pypi.tuna.tsinghua.edu.cn/simple/\n#pip install -r docker/requirements.txt\n\nuv pip install -r docker/dataflow_requirements.txt -i https://mirrors.aliyun.com/pypi/simple/\n\n# Run the server locally\nuvicorn data_server.main:app --reload\n```\n\n## Run data-flow-celery server in development mode locally\n\n```bash\n\n# Run the celery server locally\ncelery -A data_celery.main:celery_app worker --loglevel=info --pool=gevent\n```\n\nNotes: \n- `kenlm`, `simhash-pybind`, `opencc==1.1.8`, `imagededup` in file `environments/science_requires.txt` are only support X86 platform. Remove them if you are using ARM platform. \n- The configuration information of `REDIS_HOST_URL` and `MONG_HOST_URL` in `data-flow` and `data-flow-celery` must be consistent.\n- If you want to use the data annotation service, please install and enable the **[Label Studio](https://github.com/OpenCSGs/label-studio)** service. Additionally, you need to set the `STUDIO_JUMP_URL` variable of the `data-flow` service to the address of the `Label Studio` service.\n\n## 🛣️ Roadmap\nUpcoming:  \n- Enhanced real-time data streaming  \n- AutoML integration for automated model tuning  \n- Cross-cloud synchronization\n- Support more data sources\n\n## 🤝 Contributing\nWe welcome contributions! \n\n## 📞 Contact\nFor support or queries:  \n- Email: [community@opencsg.com](mailto:community@opencsg.com)  \n- GitHub: [OpenCSG/DataFlow](https://github.com/OpenCSGs)  \n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopencsgs%2Fcsghub-dataflow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopencsgs%2Fcsghub-dataflow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopencsgs%2Fcsghub-dataflow/lists"}