Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/IFCA-Advanced-Computing/frouros

Frouros: an open-source Python library for drift detection in machine learning systems.
https://github.com/IFCA-Advanced-Computing/frouros

change-detection concept-drift data-drift dataset-drift dataset-shift distribution-shift drift-detection machine-learning machine-learning-engineering machine-learning-operations mle mlops python statistics

Last synced: about 2 months ago
JSON representation

Frouros: an open-source Python library for drift detection in machine learning systems.

Awesome Lists containing this project

README

        


logo

---




ci



coverage



documentation



downloads



downloads



pypi



python



bsd_3_license



SoftwareX

Frouros is a Python library for drift detection in machine learning systems that provides a combination of classical and more recent algorithms for both concept and data drift detection.



"Everything changes and nothing stands still"




"You could not step twice into the same river"





Heraclitus of Ephesus (535-475 BCE.)


----

## ⚡️ Quickstart

### 🔄 Concept drift

As a quick example, we can use the breast cancer dataset to which concept drift it is induced and show the use of a concept drift detector like DDM (Drift Detection Method). We can see how concept drift affects the performance in terms of accuracy.

```python
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from frouros.detectors.concept_drift import DDM, DDMConfig
from frouros.metrics import PrequentialError

np.random.seed(seed=31)

# Load breast cancer dataset
X, y = load_breast_cancer(return_X_y=True)

# Split train (70%) and test (30%)
(
X_train,
X_test,
y_train,
y_test,
) = train_test_split(X, y, train_size=0.7, random_state=31)

# Define and fit model
pipeline = Pipeline(
[
("scaler", StandardScaler()),
("model", LogisticRegression()),
]
)
pipeline.fit(X=X_train, y=y_train)

# Detector configuration and instantiation
config = DDMConfig(
warning_level=2.0,
drift_level=3.0,
min_num_instances=25, # minimum number of instances before checking for concept drift
)
detector = DDM(config=config)

# Metric to compute accuracy
metric = PrequentialError(alpha=1.0) # alpha=1.0 is equivalent to normal accuracy

def stream_test(X_test, y_test, y, metric, detector):
"""Simulate data stream over X_test and y_test. y is the true label."""
drift_flag = False
for i, (X, y) in enumerate(zip(X_test, y_test)):
y_pred = pipeline.predict(X.reshape(1, -1))
error = 1 - (y_pred.item() == y.item())
metric_error = metric(error_value=error)
_ = detector.update(value=error)
status = detector.status
if status["drift"] and not drift_flag:
drift_flag = True
print(f"Concept drift detected at step {i}. Accuracy: {1 - metric_error:.4f}")
if not drift_flag:
print("No concept drift detected")
print(f"Final accuracy: {1 - metric_error:.4f}\n")

# Simulate data stream (assuming test label available after each prediction)
# No concept drift is expected to occur
stream_test(
X_test=X_test,
y_test=y_test,
y=y,
metric=metric,
detector=detector,
)
# >> No concept drift detected
# >> Final accuracy: 0.9766

# IMPORTANT: Induce/simulate concept drift in the last part (20%)
# of y_test by modifying some labels (50% approx). Therefore, changing P(y|X))
drift_size = int(y_test.shape[0] * 0.2)
y_test_drift = y_test[-drift_size:]
modify_idx = np.random.rand(*y_test_drift.shape) <= 0.5
y_test_drift[modify_idx] = (y_test_drift[modify_idx] + 1) % len(np.unique(y_test))
y_test[-drift_size:] = y_test_drift

# Reset detector and metric
detector.reset()
metric.reset()

# Simulate data stream (assuming test label available after each prediction)
# Concept drift is expected to occur because of the label modification
stream_test(
X_test=X_test,
y_test=y_test,
y=y,
metric=metric,
detector=detector,
)
# >> Concept drift detected at step 142. Accuracy: 0.9510
# >> Final accuracy: 0.8480
```

More concept drift examples can be found [here](https://frouros.readthedocs.io/en/latest/examples/concept_drift.html).

### 📊 Data drift

As a quick example, we can use the iris dataset to which data drift is induced and show the use of a data drift detector like Kolmogorov-Smirnov test.

```python
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

from frouros.detectors.data_drift import KSTest

np.random.seed(seed=31)

# Load iris dataset
X, y = load_iris(return_X_y=True)

# Split train (70%) and test (30%)
(
X_train,
X_test,
y_train,
y_test,
) = train_test_split(X, y, train_size=0.7, random_state=31)

# Set the feature index to which detector is applied
feature_idx = 0

# IMPORTANT: Induce/simulate data drift in the selected feature of y_test by
# applying some gaussian noise. Therefore, changing P(X))
X_test[:, feature_idx] += np.random.normal(
loc=0.0,
scale=3.0,
size=X_test.shape[0],
)

# Define and fit model
model = DecisionTreeClassifier(random_state=31)
model.fit(X=X_train, y=y_train)

# Set significance level for hypothesis testing
alpha = 0.001
# Define and fit detector
detector = KSTest()
_ = detector.fit(X=X_train[:, feature_idx])

# Apply detector to the selected feature of X_test
result, _ = detector.compare(X=X_test[:, feature_idx])

# Check if drift is taking place
if result.p_value <= alpha:
print(f"Data drift detected at feature {feature_idx}")
else:
print(f"No data drift detected at feature {feature_idx}")
# >> Data drift detected at feature 0
# Therefore, we can reject H0 (both samples come from the same distribution).
```

More data drift examples can be found [here](https://frouros.readthedocs.io/en/latest/examples/data_drift.html).

## 🛠 Installation

Frouros can be installed via pip:

```bash
pip install frouros
```

## 🕵🏻‍♂️️ Drift detection methods

The currently implemented detectors are listed in the following table.



Drift detector
Type
Family
Univariate (U) / Multivariate (M)
Numerical (N) / Categorical (C)
Method
Reference




Concept drift
Streaming
Change detection
U
N
BOCD
Adams and MacKay (2007)


U
N
CUSUM
Page (1954)


U
N
Geometric moving average
Roberts (1959)


U
N
Page Hinkley
Page (1954)


Statistical process control
U
N
DDM
Gama et al. (2004)


U
N
ECDD-WT
Ross et al. (2012)


U
N
EDDM
Baena-Garcıa et al. (2006)


U
N
HDDM-A
Frias-Blanco et al. (2014)


U
N
HDDM-W
Frias-Blanco et al. (2014)


U
N
RDDM
Barros et al. (2017)


Window based
U
N
ADWIN
Bifet and Gavalda (2007)


U
N
KSWIN
Raab et al. (2020)


U
N
STEPD
Nishida and Yamauchi (2007)


Data drift
Batch
Distance based
U
N
Bhattacharyya distance
Bhattacharyya (1946)


U
N
Earth Mover's distance
Rubner et al. (2000)


U
N
Energy distance
Székely et al. (2013)


U
N
Hellinger distance
Hellinger (1909)


U
N
Histogram intersection normalized complement
Swain and Ballard (1991)


U
N
Jensen-Shannon distance
Lin (1991)


U
N
Kullback-Leibler divergence
Kullback and Leibler (1951)


M
N
Maximum Mean Discrepancy
Gretton et al. (2012)


U
N
Population Stability Index
Wu and Olson (2010)


Statistical test
U
N
Anderson-Darling test
Scholz and Stephens (1987)


U
N
Baumgartner-Weiss-Schindler test
Baumgartner et al. (1998)

U
C
Chi-square test
Pearson (1900)


U
N
Cramér-von Mises test
Cramér (1902)


U
N
Kolmogorov-Smirnov test
Massey Jr (1951)


U
N
Kuiper's test
Kuiper (1960)


U
N
Mann-Whitney U test
Mann and Whitney (1947)


U
N
Welch's t-test
Welch (1947)


Streaming
Distance based
M
N
Maximum Mean Discrepancy
Gretton et al. (2012)


Statistical test
U
N
Incremental Kolmogorov-Smirnov test
dos Reis et al. (2016)

## ❗ What is and what is not Frouros?

Unlike other libraries that in addition to provide drift detection algorithms, include other functionalities such as anomaly/outlier detection, adversarial detection, imbalance learning, among others, Frouros has and will **ONLY** have one purpose: **drift detection**.

We firmly believe that machine learning related libraries or frameworks should not follow [Jack of all trades, master of none](https://en.wikipedia.org/wiki/Jack_of_all_trades,_master_of_none) principle. Instead, they should be focused on a single task and do it well.

## ✅ Who is using Frouros?

Frouros is actively being used by the following projects to implement drift
detection in machine learning pipelines:

* [AI4EOSC](https://ai4eosc.eu).
* [iMagine](https://imagine-ai.eu).

If you want your project listed here, do not hesitate to send us a pull request.

## 👍 Contributing

Check out the [contribution](https://github.com/IFCA/frouros/blob/main/CONTRIBUTING.md) section.

## 💬 Citation

If you want to cite Frouros you can use the [SoftwareX publication](https://doi.org/10.1016/j.softx.2024.101733).

```bibtex
@article{CESPEDESSISNIEGA2024101733,
title = {Frouros: An open-source Python library for drift detection in machine learning systems},
journal = {SoftwareX},
volume = {26},
pages = {101733},
year = {2024},
issn = {2352-7110},
doi = {https://doi.org/10.1016/j.softx.2024.101733},
url = {https://www.sciencedirect.com/science/article/pii/S2352711024001043},
author = {Jaime {Céspedes Sisniega} and Álvaro {López García}},
keywords = {Machine learning, Drift detection, Concept drift, Data drift, Python},
abstract = {Frouros is an open-source Python library capable of detecting drift in machine learning systems. It provides a combination of classical and more recent algorithms for drift detection, covering both concept and data drift. We have designed it to be compatible with any machine learning framework and easily adaptable to real-world use cases. The library is developed following best development and continuous integration practices to ensure ease of maintenance and extensibility.}
}
```

## 📝 License

Frouros is an open-source software licensed under the [BSD-3-Clause license](https://github.com/IFCA/frouros/blob/main/LICENSE).

## 🙏 Acknowledgements

Frouros has received funding from the Agencia Estatal de Investigación, Unidad de Excelencia María de Maeztu, ref. MDM-2017-0765.