https://github.com/iannil/one-data-studio

one-data-studio integrates a data governance and development platform, a cloud-native MLOps platform, and a large model application development platform. It connects the entire value chain from raw data governance to model training and deployment, and further to the construction of generative AI applications.
https://github.com/iannil/one-data-studio

data llm model platform

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/iannil/one-data-studio
Owner: iannil
License: apache-2.0
Created: 2026-01-24T09:17:55.000Z (6 months ago)
Default Branch: master
Last Pushed: 2026-02-04T14:16:26.000Z (5 months ago)
Last Synced: 2026-02-05T01:35:52.426Z (5 months ago)
Topics: data, llm, model, platform
Language: Python
Homepage: https://zhurongshuo.com/products/one-data-studio/
Size: 7.56 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
- Security: docs/SECURITY.md

Awesome Lists containing this project

README

# ONE-DATA-STUDIO

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
[![Python](https://img.shields.io/badge/Python-3.10%2B-green.svg)](https://www.python.org/)
[![React](https://img.shields.io/badge/React-18.3-blue.svg)](https://reactjs.org/)
[![TypeScript](https://img.shields.io/badge/TypeScript-5.4-blue.svg)](https://www.typescriptlang.org/)
[![Kubernetes](https://img.shields.io/badge/Kubernetes-1.27%2B-326ce5.svg)](https://kubernetes.io/)
[![Docker](https://img.shields.io/badge/Docker-20.10%2B-2496ED.svg)](https://www.docker.com/)

Enterprise-Grade DataOps + MLOps + LLMOps Converged Platform

*From Raw Data to Intelligent Applications — All in One Platform*

---

## What is ONE-DATA-STUDIO?

ONE-DATA-STUDIO is an open-source enterprise platform that uniquely converges three critical AI infrastructure layers into a unified system:

| Layer | Name | Description |
| ------- | ------ | ------------- |
| Data | DataOps Platform | Data integration, ETL, governance, feature store, and vector storage |
| Model | MLOps Platform | Jupyter notebooks, distributed training, model registry, and serving |
| Agent | LLMOps Platform | RAG pipelines, agent orchestration, workflow builder, and prompt management |

Unlike traditional platforms that treat these as separate silos, ONE-DATA-STUDIO creates seamless integration points between layers, enabling enterprises to build end-to-end AI solutions from raw data to production applications.

### Key Value Propositions

1. Complete Value Chain: Raw data → Governed datasets → Trained models → Deployed applications
2. Unified Governance: Single pane of glass for data lineage, model lineage, and application logs
3. Private & Secure: Deploy entirely on-premises with your own data, compute, and models
4. Production-Ready: Battle-tested with enterprise-grade security, monitoring, and scalability

---

## Features

### Data Layer (DataOps)

| Feature | Description | Implementation |
| --------- | ------------- | ---------------- |
| Data Integration | Connect to 50+ data sources (databases, APIs, files) | Flask-based connectors with async I/O |
| ETL Pipelines | Visual pipeline builder with Flink/Spark execution | Declarative DAG definitions |
| Metadata Management | Automatic schema discovery and cataloging | OpenMetadata integration |
| Data Quality | Rule-based validation and anomaly detection | Custom quality engine |
| Data Lineage | Track data flow from source to consumption | Column-level lineage tracking |
| Feature Store | Unified feature management for ML models | MinIO + versioned datasets |
| Vector Storage | High-performance vector database for RAG | Milvus 2.3 integration |

### Model Layer (MLOps)

| Feature | Description | Implementation |
| --------- | ------------- | ---------------- |
| Notebook Environment | JupyterHub with GPU support | K8s-native deployment |
| Distributed Training | Multi-GPU, multi-node training | Ray integration |
| Model Registry | Version control for models | MLflow-compatible API |
| Model Serving | High-throughput inference | vLLM with OpenAI-compatible API |
| Experiment Tracking | Log metrics, parameters, artifacts | Built-in tracking system |
| A/B Deployment | Gradual rollout with traffic splitting | Istio service mesh |

### Agent Layer (LLMOps)

| Feature | Description | Implementation |
| --------- | ------------- | ---------------- |
| RAG Pipeline | End-to-end retrieval-augmented generation | LangChain + Milvus |
| Agent Orchestration | Multi-agent systems with tool use | Custom agent framework |
| Visual Workflow | Drag-and-drop workflow builder | ReactFlow canvas |
| Prompt Management | Template library with versioning | A/B testing support |
| Knowledge Base | Document ingestion and chunking | PDF, DOCX, Markdown support |
| Text-to-SQL | Natural language database queries | Metadata-enhanced prompts |
| Token Tracking | Usage monitoring and cost control | Per-request token counting |

### Platform Administration

| Feature | Description | Implementation |
| --------- | ------------- | ---------------- |
| Identity Management | SSO with OIDC/SAML support | Keycloak 23.0 |
| Access Control | Fine-grained RBAC | Role-based permissions |
| Multi-tenancy | Isolated workspaces | Namespace-level isolation |
| Audit Logging | Comprehensive activity tracking | Searchable audit trail |
| Observability | Metrics, traces, logs | Prometheus + Grafana + Jaeger |

---

## Architecture

### Four-Layer Architecture

```
┌───────────────────────────────────────────────────────────────────────────┐
│ L4 APPLICATION LAYER (Agent) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ RAG Pipeline│ │Agent System │ │ Workflow │ │ Text-to-SQL │ │
│ │ • Embedding │ │ • Planning │ │ • ReactFlow │ │ • Schema │ │
│ │ • Retrieval │ │ • Tool Use │ │ • Nodes │ │ • Query Gen │ │
│ │ • Generation│ │ • Memory │ │ • Execution │ │ • Results │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
└───────────────────────────────────────────────────────────────────────────┘
↕ OpenAI-Compatible API / Metadata Injection
┌───────────────────────────────────────────────────────────────────────────┐
│ L3 ALGORITHM ENGINE LAYER (Model) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Notebook │ │ Distributed │ │ Model │ │ Inference │ │
│ │ • Jupyter │ │ Training │ │ Registry │ │ • vLLM │ │
│ │ • GPU │ │ • Ray │ │ • Versions │ │ • Batching │ │
│ │ • Kernels │ │ • Multi-GPU │ │ • Artifacts │ │ • Scaling │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
└───────────────────────────────────────────────────────────────────────────┘
↕ Dataset Mounting / Feature Retrieval
┌───────────────────────────────────────────────────────────────────────────┐
│ L2 DATA FOUNDATION LAYER (Data) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Integration │ │ ETL Engine │ │ Governance │ │ Storage │ │
│ │ • Connectors│ │ • Flink │ │ • Metadata │ │ • MinIO │ │
│ │ • CDC │ │ • Spark │ │ • Quality │ │ • Milvus │ │
│ │ • Streaming │ │ • Transform │ │ • Lineage │ │ • Redis │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
└───────────────────────────────────────────────────────────────────────────┘
↕ Storage Protocol / Resource Scheduling
┌───────────────────────────────────────────────────────────────────────────┐
│ L1 INFRASTRUCTURE LAYER (Kubernetes) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Compute │ │ Storage │ │ Network │ │ Observability│ │
│ │ • CPU Pool │ │ • PVC │ │ • Istio │ │ • Prometheus│ │
│ │ • GPU Pool │ │ • MinIO │ │ • Ingress │ │ • Grafana │ │
│ │ • Auto-scale│ │ • HDFS │ │ • DNS │ │ • Jaeger │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
└───────────────────────────────────────────────────────────────────────────┘
```

### Core Services

| Service | Port | Framework | Description |
| --------- | ------ | ----------- | ------------- |
| web | 3000 | React + Vite | Main application frontend |
| agent-api | 8000 | Flask | LLMOps orchestration service |
| data-api | 8001 | Flask | Data governance service |
| model-api | 8002 | FastAPI | MLOps management service |
| openai-proxy | 8003 | FastAPI | OpenAI-compatible proxy |
| admin-api | 8004 | Flask | Platform administration |
| ocr-service | 8005 | FastAPI | Document recognition |
| behavior-service | 8006 | Flask | User analytics |

### Integration Architecture

```
┌─────────────────────────────────────────────────────────────────────────┐
│ Integration Points │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ Data → Model (90%) ┌──────────┐ │
│ │ Data │ ─────────────────────────▶ │ Model │ │
│ │ Layer │ • Unified storage (MinIO) │ Layer │ │
│ │ │ • Dataset versioning │ │ │
│ │ │ • Auto dataset registry │ │ │
│ └──────────┘ └──────────┘ │
│ │ │ │
│ │ │ │
│ │ Data → Agent (75%) Model → Agent (85%) │
│ │ • Metadata injection • OpenAI API │
│ │ • Text-to-SQL • vLLM serving │
│ │ • Schema context • Model routing │
│ ▼ ▼ │
│ ┌──────────┐ │
│ │ Agent │ │
│ │ Layer │ │
│ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
```

---

## Quick Start

### Prerequisites

| Requirement | Version | Notes |
| ------------ | --------- | ------- |
| Docker | 20.10+ | Required for all deployment options |
| Docker Compose | 2.0+ | For local development |
| Node.js | 18+ | For frontend development |
| Python | 3.10+ | For backend development |
| kubectl | 1.25+ | For Kubernetes deployment |
| Helm | 3.x | For Helm deployment |

### Option 1: Docker Compose (Development)

```bash
# Clone the repository
git clone https://github.com/iannil/one-data-studio.git
cd one-data-studio

# Configure environment
cp .env.example .env
# Edit .env to set passwords: MYSQL_PASSWORD, REDIS_PASSWORD, MINIO_SECRET_KEY, etc.

# Start all services
docker-compose -f deploy/local/docker-compose.yml up -d

# Check status
docker-compose -f deploy/local/docker-compose.yml ps

# View logs
docker-compose -f deploy/local/docker-compose.yml logs -f
```

Using Makefile:

```bash
make dev # Start development environment
make dev-status # Check service status
make dev-logs # View service logs
make dev-stop # Stop all services
make dev-clean # Clean up volumes
```

### Option 2: Kubernetes (Production)

```bash
# Create a local Kind cluster (for testing)
make kind-cluster

# Install with Kustomize
kubectl apply -k deploy/kubernetes/overlays/production

# Or install with Helm
helm install one-data deploy/helm/charts/one-data \
--namespace one-data \
--create-namespace \
--values deploy/helm/charts/one-data/values-production.yaml

# Check status
kubectl get pods -n one-data

# Forward ports for local access
make forward
```

### Access the Platform

| Service | URL | Credentials |
| --------- | ----- | ------------- |
| Web UI | | - |
| Agent API | | - |
| Data API | | - |
| Model API | | - |
| OpenAI Proxy | | API Key |
| Keycloak | | admin/admin |
| MinIO | | minioadmin/minioadmin |
| Grafana | | admin/admin |
| Prometheus | | - |

---

## Use Cases

### 1. Enterprise Knowledge Center

Scenario: Enterprises have scattered documents across departments — policies, procedures, technical docs, FAQs. Employees struggle to find information quickly.