An open API service indexing awesome lists of open source software.

https://github.com/devsuthar-ai/mlops-pipeline-framework

πŸ”§ Production ML pipeline framework for model training, deployment, and monitoring. Features: MLflow, Airflow, FastAPI serving, Prometheus monitoring, Kubernetes deployment.
https://github.com/devsuthar-ai/mlops-pipeline-framework

airflow ci-cd data-science docker fastapi kubernetes machine-learning ml-pipeline mlflow mlops model-deployment python

Last synced: 7 months ago
JSON representation

πŸ”§ Production ML pipeline framework for model training, deployment, and monitoring. Features: MLflow, Airflow, FastAPI serving, Prometheus monitoring, Kubernetes deployment.

Awesome Lists containing this project

README

          

# πŸ”§ MLOps Pipeline Framework

### *Production Machine Learning Operations Platform*

[![Python](https://img.shields.io/badge/Python-3.11+-3776AB?style=for-the-badge&logo=python&logoColor=white)](https://www.python.org/)
[![FastAPI](https://img.shields.io/badge/FastAPI-0.104+-009688?style=for-the-badge&logo=fastapi&logoColor=white)](https://fastapi.tiangolo.com/)
[![MLflow](https://img.shields.io/badge/MLflow-2.8+-0194E2?style=for-the-badge&logo=mlflow&logoColor=white)](https://mlflow.org/)
[![Kubernetes](https://img.shields.io/badge/Kubernetes-326CE5?style=for-the-badge&logo=kubernetes&logoColor=white)](https://kubernetes.io/)
[![Docker](https://img.shields.io/badge/Docker-Ready-2496ED?style=for-the-badge&logo=docker&logoColor=white)](https://www.docker.com/)
[![License](https://img.shields.io/badge/License-MIT-yellow?style=for-the-badge)](LICENSE)

**[Live Demo](#) β€’ [Documentation](#) β€’ [API Docs](#) β€’ [Report Bug](../../issues) β€’ [Request Feature](../../issues)**

---

### 🎯 *End-to-end ML pipeline orchestration for production deployments*

Built with ❀️ by [Dev Suthar](https://github.com/devsuthar-ai) | ⭐ **Star us on GitHub!**

---

## πŸ“‹ Table of Contents

- [✨ Features](#-features)
- [🎬 Demo](#-demo)
- [πŸ—οΈ Architecture](#️-architecture)
- [πŸš€ Quick Start](#-quick-start)
- [πŸ“– Documentation](#-documentation)
- [πŸ”§ Pipeline Components](#-pipeline-components)
- [πŸ§ͺ Model Training](#-model-training)
- [🚒 Model Deployment](#-model-deployment)
- [πŸ“Š Monitoring](#-monitoring)
- [πŸ› οΈ Tech Stack](#️-tech-stack)
- [🀝 Contributing](#-contributing)
- [πŸ“„ License](#-license)

---

## ✨ Features

### 🎯 **Core Capabilities**
- πŸ“Š **Data Pipeline** - Automated ingestion & validation
- πŸ€– **Model Training** - Distributed training orchestration
- 🎯 **Hyperparameter Tuning** - Automated optimization
- πŸ“ˆ **Experiment Tracking** - MLflow integration
- πŸš€ **Model Deployment** - One-click deployment
- πŸ“‰ **Model Monitoring** - Real-time performance tracking

### πŸ› οΈ **Technical Excellence**
- ☸️ **Kubernetes Native** - Auto-scaling & orchestration
- πŸ”„ **CI/CD Integration** - Automated ML workflows
- πŸ“Š **Observability** - Prometheus + Grafana
- πŸ§ͺ **A/B Testing** - Model comparison
- πŸ”„ **Auto-Retraining** - Scheduled model updates
- πŸ“¦ **Model Registry** - Version management

---

## 🎬 Demo

### πŸ–ΌοΈ **Platform Screenshots**

**πŸ“Š Pipeline Dashboard**
![Pipeline](https://via.placeholder.com/500x300/667eea/ffffff?text=Pipeline+Orchestration)

**πŸ“ˆ Model Tracking**
![Training](https://via.placeholder.com/500x300/764ba2/ffffff?text=Model+Training)

**πŸš€ Deployment Manager**
![Deploy](https://via.placeholder.com/500x300/f093fb/ffffff?text=Model+Deployment)

**πŸ“‰ Performance Monitoring**
![Monitor](https://via.placeholder.com/500x300/4facfe/ffffff?text=Real-time+Monitoring)

---

## πŸ—οΈ Architecture

```mermaid
graph TB
subgraph "Data Layer"
A[Data Sources] --> B[Data Ingestion Service]
B --> C[Data Validation]
C --> D[(Feature Store)]
end

subgraph "Training Layer"
D --> E[Training Pipeline]
E --> F[Hyperparameter Tuning]
F --> G[Model Evaluation]
G --> H[(Model Registry)]
end

subgraph "Serving Layer"
H --> I[Model Deployment]
I --> J[A/B Testing]
J --> K[Prediction Service]
K --> L[Load Balancer]
end

subgraph "Monitoring Layer"
K --> M[Performance Monitor]
M --> N[Drift Detection]
N --> O[Alert System]
O --> E
end

subgraph "Orchestration"
P[Airflow] -.-> E
P -.-> I
P -.-> M
end

subgraph "Observability"
Q[Prometheus] --> R[Grafana]
K --> Q
M --> Q
end

style E fill:#667eea
style I fill:#764ba2
style M fill:#f093fb
style H fill:#4facfe
```

### πŸ“ **System Components**

| Component | Technology | Purpose |
|-----------|-----------|---------|
| **Orchestration** | Airflow | Workflow management |
| **Experiment Tracking** | MLflow | Model versioning |
| **Model Serving** | FastAPI | High-performance API |
| **Feature Store** | Feast | Feature management |
| **Model Registry** | MLflow | Model storage |
| **Container Runtime** | Docker | Containerization |
| **Orchestration** | Kubernetes | Container orchestration |
| **Monitoring** | Prometheus + Grafana | Metrics & visualization |
| **Streaming** | Kafka | Real-time data |
| **Storage** | MinIO | Model artifacts |

---

## πŸš€ Quick Start

### Prerequisites

```bash
# Required
- Python 3.11+
- Docker & Docker Compose
- Kubernetes cluster (optional)

# Optional for production
- MLflow server
- Airflow instance
- Prometheus + Grafana
```

### ⚑ One-Command Setup

```bash
# Clone repository
git clone https://github.com/devsuthar-ai/mlops-pipeline-framework.git
cd mlops-pipeline-framework

# Start all services
docker-compose up -d

# πŸŽ‰ Done! Access services:
# API: http://localhost:8001
# MLflow UI: http://localhost:5000
# Airflow UI: http://localhost:8080
# Grafana: http://localhost:3000
```

### 🐍 Local Development

```bash
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Start API server
python src/main.py

# In new terminal, start MLflow
mlflow server --host 0.0.0.0 --port 5000

# In new terminal, start Airflow
airflow standalone
```

---

## πŸ“– Documentation

### πŸ“š **Complete Guides**

- [πŸ—οΈ Architecture Overview](docs/ARCHITECTURE.md)
- [πŸ“Š Data Pipeline Guide](docs/DATA_PIPELINE.md)
- [πŸ€– Model Training](docs/TRAINING.md)
- [πŸš€ Deployment Guide](docs/DEPLOYMENT.md)
- [πŸ“ˆ Monitoring & Alerts](docs/MONITORING.md)
- [πŸ”§ Configuration](docs/CONFIGURATION.md)

---

## πŸ”§ Pipeline Components

### πŸ“Š **1. Data Ingestion Pipeline**

```python
from src.data.ingestion import DataPipeline

# Initialize pipeline
pipeline = DataPipeline(
source="s3://my-bucket/data",
destination="feature_store",
validation_rules={"schema": "v1.0"}
)

# Run ingestion
result = pipeline.ingest()
print(f"Ingested {result['rows']} rows")
```

**Features:**
- βœ… Multi-source support (S3, GCS, local, databases)
- βœ… Schema validation
- βœ… Data quality checks
- βœ… Incremental loading
- βœ… Error handling & retry logic

---

### πŸ€– **2. Model Training Pipeline**

```python
from src.models.training import TrainingPipeline

# Configure training
config = {
"model_type": "random_forest",
"hyperparameters": {
"n_estimators": 100,
"max_depth": 10,
"min_samples_split": 5
},
"training_data": "feature_store://train_v1",
"validation_split": 0.2
}

# Initialize and run
pipeline = TrainingPipeline(config)
model = pipeline.train()

# Track with MLflow
pipeline.log_metrics({
"accuracy": 0.95,
"f1_score": 0.93,
"training_time": 120.5
})
```

**Features:**
- βœ… Distributed training (PyTorch, TensorFlow)
- βœ… Automated hyperparameter tuning
- βœ… Cross-validation
- βœ… Early stopping
- βœ… Checkpointing
- βœ… MLflow integration

---

### πŸ“ˆ **3. Model Evaluation**

```python
from src.models.evaluation import ModelEvaluator

# Evaluate model
evaluator = ModelEvaluator(model)
metrics = evaluator.evaluate(test_data)

print(f"""
Evaluation Results:
- Accuracy: {metrics['accuracy']:.3f}
- Precision: {metrics['precision']:.3f}
- Recall: {metrics['recall']:.3f}
- F1 Score: {metrics['f1']:.3f}
- AUC-ROC: {metrics['auc']:.3f}
""")

# Generate reports
evaluator.generate_report(output_path="reports/")
```

**Metrics Tracked:**
- Accuracy, Precision, Recall, F1
- ROC-AUC, PR-AUC
- Confusion Matrix
- Feature Importance
- Prediction Distribution

---

### πŸš€ **4. Model Deployment**

```python
from src.serving.deployment import ModelDeployer

# Deploy model
deployer = ModelDeployer(
model_uri="models:/production/RandomForest/v3",
environment="production",
replicas=3,
resources={
"cpu": "2",
"memory": "4Gi"
}
)

deployment = deployer.deploy()
print(f"Deployed at: {deployment['endpoint']}")

# Test endpoint
response = requests.post(
deployment['endpoint'],
json={"features": [1.2, 3.4, 5.6]}
)
print(f"Prediction: {response.json()['prediction']}")
```

**Deployment Features:**
- βœ… Rolling updates (zero downtime)
- βœ… Canary deployments
- βœ… A/B testing
- βœ… Auto-scaling
- βœ… Health checks
- βœ… Load balancing

---

### πŸ“Š **5. Monitoring & Alerting**

```python
from src.monitoring.monitor import ModelMonitor

# Setup monitoring
monitor = ModelMonitor(
model_name="RandomForest",
metrics=["accuracy", "latency", "throughput"],
alert_thresholds={
"accuracy_drop": 0.05,
"latency_p95": 500 # ms
}
)

# Start monitoring
monitor.start()

# View dashboard
monitor.show_dashboard()
```

**Monitored Metrics:**
- Model performance (accuracy, F1, etc.)
- Prediction latency (p50, p95, p99)
- Throughput (predictions/sec)
- Resource usage (CPU, memory)
- Data drift detection
- Concept drift detection

---

## πŸ§ͺ Model Training

### Training Script Example

```python
# train.py
import mlflow
from sklearn.ensemble import RandomForestClassifier
from src.data import load_data
from src.models import train_model, evaluate_model

# Load data
X_train, y_train, X_test, y_test = load_data()

# Start MLflow run
with mlflow.start_run():
# Train model
model = RandomForestClassifier(
n_estimators=100,
max_depth=10,
random_state=42
)
model.fit(X_train, y_train)

# Evaluate
metrics = evaluate_model(model, X_test, y_test)

# Log to MLflow
mlflow.log_params({
"n_estimators": 100,
"max_depth": 10
})
mlflow.log_metrics(metrics)
mlflow.sklearn.log_model(model, "model")

print(f"Model trained! Accuracy: {metrics['accuracy']:.3f}")
```

### Run Training

```bash
# Local training
python train.py

# Distributed training
python -m torch.distributed.launch train_distributed.py

# With Airflow
airflow dags trigger training_pipeline
```

---

## 🚒 Model Deployment

### Deployment Configuration

```yaml
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-model-serving
spec:
replicas: 3
selector:
matchLabels:
app: ml-model
template:
metadata:
labels:
app: ml-model
spec:
containers:
- name: model-server
image: ml-model:v1.0
ports:
- containerPort: 8001
resources:
requests:
memory: "2Gi"
cpu: "1"
limits:
memory: "4Gi"
cpu: "2"
env:
- name: MODEL_URI
value: "models:/production/latest"
```

### Deploy to Kubernetes

```bash
# Apply deployment
kubectl apply -f deployment.yaml

# Check status
kubectl get pods -l app=ml-model

# Expose service
kubectl expose deployment ml-model-serving --type=LoadBalancer --port=80 --target-port=8001

# Get endpoint
kubectl get svc ml-model-serving
```

---

## πŸ“Š Monitoring

### Prometheus Metrics

```python
from prometheus_client import Counter, Histogram, Gauge

# Define metrics
prediction_counter = Counter(
'model_predictions_total',
'Total predictions made'
)

prediction_latency = Histogram(
'model_prediction_latency_seconds',
'Prediction latency'
)

model_accuracy = Gauge(
'model_accuracy',
'Current model accuracy'
)

# Use in code
@prediction_latency.time()
def predict(features):
prediction_counter.inc()
result = model.predict(features)
return result
```

### Grafana Dashboards

Access dashboards at `http://localhost:3000`

**Available Dashboards:**
1. **Model Performance**
- Accuracy over time
- Precision/Recall trends
- Confusion matrix heatmap

2. **System Metrics**
- CPU/Memory usage
- Request rate
- Error rate

3. **Prediction Analytics**
- Latency distribution
- Throughput
- Feature distribution

4. **Data Drift**
- Feature drift detection
- Concept drift alerts
- Distribution changes

---

## πŸ› οΈ Tech Stack

### **ML & Data**
![Python](https://img.shields.io/badge/Python-3776AB?style=for-the-badge&logo=python&logoColor=white)
![PyTorch](https://img.shields.io/badge/PyTorch-EE4C2C?style=for-the-badge&logo=pytorch&logoColor=white)
![TensorFlow](https://img.shields.io/badge/TensorFlow-FF6F00?style=for-the-badge&logo=tensorflow&logoColor=white)
![scikit-learn](https://img.shields.io/badge/scikit--learn-F7931E?style=for-the-badge&logo=scikit-learn&logoColor=white)

### **MLOps Tools**
![MLflow](https://img.shields.io/badge/MLflow-0194E2?style=for-the-badge&logo=mlflow&logoColor=white)
![Airflow](https://img.shields.io/badge/Airflow-017CEE?style=for-the-badge&logo=apache-airflow&logoColor=white)
![Feast](https://img.shields.io/badge/Feast-6E40C9?style=for-the-badge)

### **Infrastructure**
![Docker](https://img.shields.io/badge/Docker-2496ED?style=for-the-badge&logo=docker&logoColor=white)
![Kubernetes](https://img.shields.io/badge/Kubernetes-326CE5?style=for-the-badge&logo=kubernetes&logoColor=white)
![Prometheus](https://img.shields.io/badge/Prometheus-E6522C?style=for-the-badge&logo=prometheus&logoColor=white)
![Grafana](https://img.shields.io/badge/Grafana-F46800?style=for-the-badge&logo=grafana&logoColor=white)

### **API & Serving**
![FastAPI](https://img.shields.io/badge/FastAPI-009688?style=for-the-badge&logo=fastapi&logoColor=white)
![Kafka](https://img.shields.io/badge/Kafka-231F20?style=for-the-badge&logo=apache-kafka&logoColor=white)

---

## πŸ“ Project Structure

```
mlops-pipeline-framework/
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ data/ # Data pipelines
β”‚ β”‚ β”œβ”€β”€ ingestion/ # Data ingestion
β”‚ β”‚ β”œβ”€β”€ preprocessing/ # Data preprocessing
β”‚ β”‚ └── validation/ # Data validation
β”‚ β”œβ”€β”€ models/ # Model code
β”‚ β”‚ β”œβ”€β”€ training/ # Training logic
β”‚ β”‚ β”œβ”€β”€ evaluation/ # Evaluation
β”‚ β”‚ └── registry/ # Model registry
β”‚ β”œβ”€β”€ serving/ # Model serving
β”‚ β”‚ β”œβ”€β”€ api/ # FastAPI endpoints
β”‚ β”‚ β”œβ”€β”€ batch/ # Batch inference
β”‚ β”‚ └── streaming/ # Stream processing
β”‚ β”œβ”€β”€ monitoring/ # Monitoring
β”‚ β”‚ β”œβ”€β”€ metrics.py # Metrics collection
β”‚ β”‚ └── alerts.py # Alert rules
β”‚ └── orchestration/ # Workflow orchestration
β”‚ └── dags/ # Airflow DAGs
β”œβ”€β”€ pipelines/ # Pipeline definitions
β”‚ β”œβ”€β”€ training_pipeline.py # Training workflow
β”‚ β”œβ”€β”€ inference_pipeline.py # Inference workflow
β”‚ └── retraining_pipeline.py # Auto-retraining
β”œβ”€β”€ tests/ # Tests
β”‚ β”œβ”€β”€ unit/ # Unit tests
β”‚ β”œβ”€β”€ integration/ # Integration tests
β”‚ └── e2e/ # End-to-end tests
β”œβ”€β”€ configs/ # Configuration files
β”‚ β”œβ”€β”€ model_config.yaml # Model configs
β”‚ β”œβ”€β”€ pipeline_config.yaml # Pipeline configs
β”‚ └── deployment_config.yaml # Deployment configs
β”œβ”€β”€ deployments/ # Deployment manifests
β”‚ β”œβ”€β”€ kubernetes/ # K8s manifests
β”‚ └── docker/ # Docker configs
β”œβ”€β”€ docs/ # Documentation
β”œβ”€β”€ monitoring/ # Monitoring configs
β”‚ β”œβ”€β”€ prometheus/ # Prometheus setup
β”‚ └── grafana/ # Grafana dashboards
β”œβ”€β”€ scripts/ # Utility scripts
β”œβ”€β”€ main.py # Application entry
β”œβ”€β”€ requirements.txt # Dependencies
└── README.md # This file
```

---

## πŸš€ Usage Examples

### Complete ML Pipeline

```python
from src.pipeline import MLPipeline

# Initialize pipeline
pipeline = MLPipeline(
name="fraud_detection",
config_path="configs/fraud_model.yaml"
)

# Run full pipeline
results = pipeline.run(
data_source="s3://data/transactions.csv",
experiment_name="fraud_detection_v2"
)

print(f"""
Pipeline Results:
- Model: {results['model_uri']}
- Accuracy: {results['metrics']['accuracy']:.3f}
- Deployment: {results['deployment']['endpoint']}
""")
```

### Batch Inference

```python
from src.serving.batch import BatchPredictor

# Initialize predictor
predictor = BatchPredictor(
model_uri="models:/production/fraud_model/latest"
)

# Run batch predictions
predictions = predictor.predict_batch(
input_path="s3://data/new_transactions.csv",
output_path="s3://predictions/results.csv"
)

print(f"Processed {len(predictions)} predictions")
```

### Real-time Serving

```python
from fastapi import FastAPI
from src.serving import ModelServer

app = FastAPI()
model_server = ModelServer("models:/production/latest")

@app.post("/predict")
async def predict(features: dict):
prediction = model_server.predict(features)
return {
"prediction": prediction,
"model_version": model_server.version,
"latency_ms": model_server.last_latency
}
```

---

## πŸ“Š Performance

### Benchmarks

| Metric | Value | Target |
|--------|-------|--------|
| **Training Time** | 15 min | < 20 min |
| **Inference Latency (p50)** | 25ms | < 50ms |
| **Inference Latency (p95)** | 45ms | < 100ms |
| **Throughput** | 2000 pred/sec | > 1000 pred/sec |
| **Model Accuracy** | 96.5% | > 95% |
| **Deployment Time** | 2 min | < 5 min |

---

## 🀝 Contributing

Contributions welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

---

## πŸ“„ License

MIT License - see [LICENSE](LICENSE) file.

---

## πŸ“ž Contact

[![GitHub](https://img.shields.io/badge/GitHub-devsuthar--ai-181717?style=for-the-badge&logo=github)](https://github.com/devsuthar-ai)
[![LinkedIn](https://img.shields.io/badge/LinkedIn-Dev_Suthar-0A66C2?style=for-the-badge&logo=linkedin)](https://linkedin.com/in/devsuthar)
[![Email](https://img.shields.io/badge/Email-Contact-D14836?style=for-the-badge&logo=gmail&logoColor=white)](mailto:dev.suthar@example.com)

---

**Made with ❀️ by Dev Suthar**

*Building production ML systems at scale*

⭐ **Star this repo if you find it helpful!**