https://github.com/royxlead/production-drift-detection
Production ML monitoring library - KL, PSI, MMD, and ADWIN drift detectors with empirical benchmarks, confidence tracking, and a 6-page FastAPI dashboard.
https://github.com/royxlead/production-drift-detection
data-drift drift-detection fastapi kl-divergence mlops mmd model-monitoring production-ml psi pytorch scikit-learn uncertainty-quantification
Last synced: 1 day ago
JSON representation
Production ML monitoring library - KL, PSI, MMD, and ADWIN drift detectors with empirical benchmarks, confidence tracking, and a 6-page FastAPI dashboard.
- Host: GitHub
- URL: https://github.com/royxlead/production-drift-detection
- Owner: royxlead
- License: mit
- Created: 2026-06-06T05:39:36.000Z (19 days ago)
- Default Branch: main
- Last Pushed: 2026-06-14T15:50:47.000Z (10 days ago)
- Last Synced: 2026-06-14T17:23:33.244Z (10 days ago)
- Topics: data-drift, drift-detection, fastapi, kl-divergence, mlops, mmd, model-monitoring, production-ml, psi, pytorch, scikit-learn, uncertainty-quantification
- Language: Python
- Homepage:
- Size: 208 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Production Drift Detection
### Real-Time Data Drift Detection for Production ML Systems
> A lightweight, pip-installable Python library for monitoring data drift and model confidence in production ML systems. Detects when real-world data begins diverging from the training distribution - before model performance visibly degrades.
---
## Table of Contents
- [The Problem](#the-problem)
- [What This Does](#what-this-does)
- [Benchmark Results](#benchmark-results)
- [Research: Confidence-Drift Correlation](#research-confidence-drift-correlation)
- [Drift Detection Methods](#drift-detection-methods)
- [Architecture](#architecture)
- [Repository Structure](#repository-structure)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Confidence Monitoring](#confidence-monitoring)
- [Model Integrations](#model-integrations)
- [Synthetic Drift Generation](#synthetic-drift-generation)
- [Dashboard](#dashboard)
- [Evaluation](#evaluation)
- [Roadmap](#roadmap)
- [Related Work](#related-work)
- [Citation](#citation)
---
## The Problem
Ground truth labels arrive late in almost every production ML system:
- Credit risk: defaults take months to confirm
- Medical diagnosis: pathology results take days to weeks
- Fraud detection: chargebacks take weeks to materialize
- Recommendations: satisfaction is measured indirectly
By the time accuracy degradation is confirmed, the model may have been producing poor predictions for thousands of inferences. **Unsupervised monitoring - detecting degradation without labels - is essential for production ML reliability.**
---
## What This Does
Production Drift Detection provides:
1. **Four drift detectors** (KL Divergence, PSI, MMD, ADWIN) with a unified `fit/score/detect/summary` API
2. **Stream monitoring** with batch ingestion and rolling statistics
3. **Confidence monitoring** tracking entropy, margin, and trend signals
4. **Confidence-Drift Correlation module** - empirically testing whether confidence degradation precedes drift detection
5. **Alerting system** with four severity levels: Healthy, Watch, Warning, Critical
6. **Interactive 6-page dashboard** (FastAPI + Chart.js)
7. **Synthetic drift generator** with 8 drift types for reproducible experiments
8. **Model integrations** for Scikit-learn, PyTorch, and HuggingFace
---
## Benchmark Results
Rigorous benchmark across **20 random seeds**, 5 features, 2000 reference samples, drift magnitude 2.0 (10-batch ramp + 40 full-strength batches). All metrics reported as mean ± 2×SEM (~95% CI).
### Core Metrics
| Detector | FPR (clean) | Detection Rate | ROC AUC | Cohen's d (strong) | Composite Rank |
|---|---|---|---|---|---|
| **MMD** | **0.0%** | 99.9% ± 0.2% | **1.0000** | 6.38 ± 0.43 | **#1 (0.9225)** |
| PSI | 39.9% ± 3.0% | **100.0%** | **1.0000** | 4.62 ± 0.26 | #2 (0.8537) |
| KL Divergence | **0.0%** | 95.2% ± 1.6% | 0.9995 ± 0.0008 | 1.18 ± 0.11 | #3 (0.7831) |
| ADWIN | 46.5% ± 6.2% | 97.7% ± 1.0% | 0.9751 ± 0.0079 | 2.37 ± 0.13 | #4 (0.7187) |
### Detection Latency
All four detectors achieve **median latency of 0 batches** - drift is flagged in the same batch it begins. KL Divergence averages 0.20 batches (max 1 batch across all seeds), making all methods effectively immediate.
| Detector | Median Latency | Mean Latency | % Detected |
|---|---|---|---|
| KL Divergence | 0 batches | 0.20 batches | 100% |
| PSI | 0 batches | 0.00 batches | 100% |
| MMD | 0 batches | 0.00 batches | 100% |
| ADWIN | 0 batches | 0.00 batches | 100% |
### Threshold Calibration (Youden's F1-Max)
Default thresholds are well-calibrated for KL and MMD. PSI and ADWIN benefit from recalibration:
| Detector | Default Threshold | Optimal Threshold | F1 (default) | F1 (optimal) | Gain |
|---|---|---|---|---|---|
| KL Divergence | 0.1000 | 0.1000 | 0.9708 | 0.9708 | +0.0000 |
| PSI | 0.1000 | 0.1898 | 0.7567 | 0.9499 | **+0.1932** |
| MMD | 0.0500 | 0.0108 | 0.9994 | **1.0000** | +0.0006 |
| ADWIN | 0.1000 | 0.3778 | 0.7700 | 0.9573 | **+0.1872** |
**Recommendation:** Use MMD as the primary detector (zero FPR, perfect AUC, immediate detection). Pair with KL Divergence as a secondary signal. Recalibrate PSI and ADWIN thresholds before production deployment.
---
## Research: Confidence-Drift Correlation
### Hypothesis
> **H1:** Under gradual covariate shift, changes in model confidence precede changes in drift scores by k time steps.
>
> **H2:** The lead-lag relationship is asymmetric - confidence is more likely to lead drift than vice versa.
### Empirical Validation
Tested across **30 trials** (3 experiment types x 10 seeds), two model classes (LogisticRegression, PyTorch NN), two drift types (covariate shift, perturbation).
| Experiment | H1 Support (KL) | H1 Support (PSI) | H1 Support (MMD) | Avg EWS | Conf Drop | Acc Drop |
|---|---|---|---|---|---|---|
| sklearn - covariate shift | 4/10 (40%) | 3/10 (30%) | 2/10 (20%) | 42/100 | -2.69% | -1.45% |
| PyTorch NN - covariate shift | 3/10 (30%) | 3/10 (30%) | 3/10 (30%) | 42/100 | -1.42% | +6.01% |
| sklearn - perturbation | 2/10 (20%) | 5/10 (50%) | 1/10 (10%) | 44/100 | -2.35% | -1.45% |
### Conclusion
**H1 is not supported.** Confidence precedes drift in only 20-40% of trials depending on detector and model class - not reliably enough to serve as a universal early warning signal. Across all 30 trials, roughly 27-30% of detector-seed pairs show confidence leading drift.
This is an honest empirical result. The confidence-drift lead-lag relationship is **detector-dependent and non-deterministic** under gradual covariate shift. The Confidence-Drift Correlation module remains useful as a diagnostic and monitoring signal, but should not be relied upon as a consistent early tripwire without further investigation into conditions where leading behaviour emerges.
Cross-correlation analysis does show meaningful **co-movement** between confidence and drift signals (max correlations 0.38-0.77 across detectors), confirming that confidence tracks drift even when it does not reliably precede it.
```python
from production-drift-detection.correlation.confidence_drift import ConfidenceDriftCorrelation
correlator = ConfidenceDriftCorrelation(max_lag=10)
result = correlator.analyze(confidence_history, drift_history)
print(f"Lead-lag: {result['lead_lag_steps']} steps")
print(f"Early warning score: {result['early_warning_score']}/100")
```
---
## Drift Detection Methods
### KL Divergence
Measures divergence between reference and current probability distributions. Best for categorical features and probability outputs. Laplace smoothing for numerical stability. **Zero FPR in benchmarks, ROC AUC 0.9995.** Default threshold (0.1) is optimally calibrated - no recalibration needed.
### Population Stability Index (PSI)
Measures feature distribution stability over time using binned proportions. Standard convention: PSI < 0.1 (stable), 0.1-0.25 (moderate shift), > 0.25 (significant shift). **Perfect detection rate (100%) but high FPR (39.9%)** - recalibrate threshold to 0.1898 for production use.
### Maximum Mean Discrepancy (MMD)
Kernel-based two-sample test using RBF kernels. Supports multivariate distributions. Median heuristic for automatic bandwidth selection. **Top-ranked detector: zero FPR, perfect AUC (1.0000), Cohen's d of 6.38 under strong drift.** Optimal threshold is 0.0108, not the default 0.05.
### ADWIN-Style Adaptive Windowing
Online drift detection for streaming data. Adaptive window resizing based on Hoeffding-bound change detection. No need to store full historical data. **High FPR (46.5%) and weaker perturbation AUC (0.65)** - requires threshold recalibration to 0.3778. Best suited for scenarios requiring online, memory-efficient detection.
---
## Architecture
```
Production Data Stream
|
v
+---------------------+
| Stream Monitor | Batch ingestion + rolling statistics
| | Coordinates all detectors
+----------+----------+
|
+------+------+
v v
+---------+ +------------------+
| Drift | | Confidence |
|Detectors| | Monitor |
|KL/PSI/ | | Entropy, Margin, |
|MMD/ADWIN| | Trend Analysis |
+----+----+ +--------+---------+
| |
+-------+--------+
|
v
+---------------------+
| Confidence-Drift | Lead-lag correlation analysis
| Correlation Module | Early warning scoring
+----------+----------+
|
v
+---------------------+
| Alert Engine | Threshold + rolling window rules
| | Healthy / Watch / Warning / Critical
+----------+----------+
|
v
Dashboard + Alerts
```
---
## Repository Structure
```
production-drift-detection/
|
+-- src/ # Core library source
| +-- detectors/ # KL, PSI, MMD, ADWIN
| +-- monitors/ # StreamMonitor, ConfidenceMonitor
| +-- alerts/ # Alert schemas, rules, engine
| +-- correlation/ # Confidence-drift correlation module
| +-- data/ # Synthetic drift generator, loaders
| +-- integrations/ # Sklearn, PyTorch, HuggingFace adapters
| +-- evaluation/ # Metrics, benchmarks
| +-- dashboard/ # FastAPI backend + Chart.js frontend
| +-- utils/ # Validation, logging, statistics
|
+-- notebooks/ # Experiment notebooks
+-- tests/ # Test suite
+-- benchmarks.py # Standard benchmark runner
+-- benchmark_rigorous.py # 20-seed rigorous benchmark suite
+-- empirical_validation.py # Confidence-drift correlation validation
+-- demo.py # Demo + dashboard launcher
+-- pyproject.toml
+-- LICENSE
```
---
## Installation
```bash
# From source
git clone https://github.com/royxlead/production-drift-detection.git
cd production-drift-detection
pip install -e .
# With optional dependencies
pip install -e ".[pytorch]" # PyTorch support
pip install -e ".[transformers]" # HuggingFace support
pip install -e ".[dev]" # Testing + notebooks
pip install -e ".[all]" # Everything
```
**Requirements:** Python 3.9+ · NumPy 2.4+ · Pandas 3.0+
---
## Quick Start
```python
import numpy as np
from production-drift-detection.detectors.mmd import MMDDetector # Recommended
from production-drift-detection.detectors.kl import KLDivergenceDetector
from production-drift-detection.detectors.psi import PSIDetector
# Reference distribution (training data)
reference = np.random.normal(0, 1, (2000, 5))
# Production data (shifted)
production = np.random.normal(2, 1, (500, 5))
# Unified API across all detectors
detector = MMDDetector(threshold=0.0108) # Use calibrated threshold
detector.fit(reference)
result = detector.detect(production)
print(f"Drift detected: {result['drift_detected']}")
print(f"Score: {result['score']:.4f}")
```
**Stream monitoring:**
```python
from production-drift-detection.monitors.stream_monitor import StreamMonitor
monitor = StreamMonitor()
monitor.fit(reference)
for batch_idx in range(10):
batch = np.random.normal(batch_idx * 0.2, 1, (100, 5))
result = monitor.process_batch(batch)
print(f"Batch {batch_idx}: Status={result['status']}")
```
---
## Confidence Monitoring
```python
from production-drift-detection.monitors.confidence_monitor import ConfidenceMonitor
monitor = ConfidenceMonitor()
for batch_predictions in prediction_stream:
status = monitor.update(batch_predictions)
print(f"Mean confidence: {status['mean_confidence']:.3f}")
print(f"Entropy trend: {status['entropy_trend']}")
print(f"Degradation detected: {status['degradation_detected']}")
```
**Tracked metrics:** Mean confidence, predictive entropy, margin (top-2 gap), trend direction, ECE approximation, over/underconfidence ratios.
---
## Model Integrations
**Scikit-learn:**
```python
from production-drift-detection.integrations.sklearn_adapter import SklearnAdapter
adapter = SklearnAdapter(sklearn_model)
result = adapter.predict(X_test, y_test)
print(f"Drift: {result['drift']['drift_detected']}")
print(f"Confidence: {result['confidence']['mean_confidence']:.3f}")
```
**PyTorch:**
```python
from production-drift-detection.integrations.pytorch_adapter import PyTorchAdapter
adapter = PyTorchAdapter(torch_model)
result = adapter.predict(X_tensor)
```
**HuggingFace:**
```python
from production-drift-detection.integrations.hf_adapter import HFAdapter
adapter = HFAdapter() # DistilBERT SST-2 by default
result = adapter.predict_text(["Great product!", "Terrible experience."])
```
---
## Synthetic Drift Generation
8 drift types for reproducible research:
| Drift Type | Description |
|---|---|
| Covariate Shift | Shift feature means |
| Prior Shift | Change class balance |
| Gradual Drift | Slowly increasing shift over time |
| Sudden Drift | Abrupt point shift |
| Missingness Drift | Inject NaN values |
| Feature Perturbation | Add noise to a subset of features |
| Gaussian Noise | Add noise to all features |
| Feature Corruption | Set features to constant values |
```python
from production-drift-detection.data.synthetic_drift import DriftGenerator
generator = DriftGenerator(n_features=5, random_state=42)
reference = generator.generate_reference(n_samples=2000)
shifted = generator.gradual_drift(n_samples=500, drift_magnitude=2.0)
```
---
## Dashboard
```bash
python -m production-drift-detection.dashboard.server
# or
python demo.py --dashboard
```
**6 pages:**
1. Overview - current status, active alerts, recent scores
2. Drift Monitoring - PSI, MMD, KL, ADWIN score trends
3. Feature Analysis - per-feature drift heatmap
4. Confidence Monitoring - confidence, entropy, margin history
5. Confidence-Drift Correlation - lead-lag plots, cross-correlation
6. Alerts - filterable log with severity indicators
---
## Evaluation
Run the rigorous benchmark (20 seeds, ~45 seconds):
```bash
python benchmark_rigorous.py
```
Run the confidence-drift empirical validation:
```bash
python empirical_validation.py
```
Run standard benchmarks programmatically:
```python
from production-drift-detection.evaluation.benchmarks import BenchmarkFramework
from production-drift-detection.detectors.kl import KLDivergenceDetector
from production-drift-detection.detectors.mmd import MMDDetector
framework = BenchmarkFramework(detectors=[
KLDivergenceDetector(),
MMDDetector(),
])
results = framework.run_benchmark(drift_magnitude=2.0)
sensitivity = framework.run_sensitivity_analysis(
magnitudes=[0.0, 0.5, 1.0, 2.0, 3.0]
)
```
**Testing:**
```bash
pytest # All tests
pytest --cov=production-drift-detection # With coverage
pytest tests/test_detectors.py # Specific module
pytest -m "integration" # Integration tests only
```
---
## Research Background
Production Drift Detection connects to established literature on distribution shift and uncertainty estimation:
- **Confidence-accuracy gap under shift** - Guo et al. (2017) showed modern networks are systematically overconfident. Under distribution shift, ECE increases before accuracy degrades.
- **Entropy as uncertainty signal** - High entropy predictions signal OOD inputs. Production Drift Detection tracks entropy trends as a population-level monitoring signal.
- **Self-diagnosing models** - Leibig et al. (2017) demonstrated neural networks can estimate their own uncertainty. Production Drift Detection operationalizes this at the monitoring layer.
- **Bayesian uncertainty** - MC Dropout (Gal and Ghahramani, 2016) approximates epistemic uncertainty. Production Drift Detection's ConfidenceMonitor is designed to be compatible with MC Dropout outputs.
**Why confidence tracks drift even without leading it:**
Confidence is continuous - it responds to small perturbations. Accuracy is discontinuous - a prediction is right or wrong. Both signals are informative; their relationship is model-dependent and drift-type-dependent rather than universally ordered.
---
## Roadmap
- [ ] MC Dropout integration for improved epistemic uncertainty estimates
- [ ] Deep kernel learning for MMD-based drift detection
- [ ] Conformal prediction interval monitoring
- [ ] Adaptive threshold calibration (auto-recalibration from validation data)
- [ ] REST API for production deployment
- [ ] Kubernetes-native deployment support
- [ ] Real-time alerting (Slack, PagerDuty, email)
- [ ] SQL/NoSQL storage for drift history
- [ ] Multi-model orchestration
- [ ] A/B test monitoring support
---
## Related Work
- [Unsupervised Confidence Estimation](https://github.com/royxlead/unsupervised-confidence-estimation) - Unsupervised Confidence Estimation
- [CURA](https://github.com/royxlead/cura-python) - RAG reliability and hallucination mitigation
- [AutoLLM Forge](https://github.com/royxlead/autollmforge-python) - Efficient LLM fine-tuning
---
## Citation
```bibtex
@software{roy2026production-drift-detection,
author = {Roy, Sourav},
title = {Production Drift Detection: Real-Time Data Drift Detection for Production ML Systems},
year = {2026},
url = {https://github.com/royxlead/production-drift-detection}
}
```
---
Built by Sourav Roy · Founding AI/ML Engineer · Yuga AI