https://github.com/IFCA-Advanced-Computing/frouros

Frouros: an open-source Python library for drift detection in machine learning systems.
https://github.com/IFCA-Advanced-Computing/frouros

change-detection concept-drift data-drift dataset-drift dataset-shift distribution-shift drift-detection machine-learning machine-learning-engineering machine-learning-operations mle mlops python statistics

Last synced: 2 months ago
JSON representation

Frouros: an open-source Python library for drift detection in machine learning systems.

Host: GitHub
URL: https://github.com/IFCA-Advanced-Computing/frouros
Owner: IFCA-Advanced-Computing
License: bsd-3-clause
Created: 2022-03-16T09:21:26.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2024-04-13T07:47:17.000Z (about 1 year ago)
Last Synced: 2024-04-14T12:09:37.614Z (about 1 year ago)
Topics: change-detection, concept-drift, data-drift, dataset-drift, dataset-shift, distribution-shift, drift-detection, machine-learning, machine-learning-engineering, machine-learning-operations, mle, mlops, python, statistics
Language: Python
Homepage: https://frouros.readthedocs.io
Size: 22.9 MB
Stars: 158
Watchers: 4
Forks: 10
Open Issues: 10
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Citation: CITATION.cff
- Codeowners: CODEOWNERS

Awesome Lists containing this project

awesome-mlops - Frouros - An open source Python library for drift detection in machine learning systems. (Drift Detection)

README

        


  



---



  

  

    

  

  

  

    

  

  

  

    

  

  

  

    

  

  

  

    

  

  

  

    

  

  

  

    

  

  

  

    

  

  

  

    

  



Frouros is a Python library for drift detection in machine learning systems that provides a combination of classical and more recent algorithms for both concept and data drift detection.



    

        "Everything changes and nothing stands still"

    





    

        "You could not step twice into the same river"

    





    

        

            Heraclitus of Ephesus (535-475 BCE.)

        

    



----

## ⚡️ Quickstart

### 🔄 Concept drift

As a quick example, we can use the breast cancer dataset to which concept drift it is induced and show the use of a concept drift detector like DDM (Drift Detection Method). We can see how concept drift affects the performance in terms of accuracy.

```python

import numpy as np

from sklearn.datasets import load_breast_cancer

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler

from frouros.detectors.concept_drift import DDM, DDMConfig

from frouros.metrics import PrequentialError

np.random.seed(seed=31)

# Load breast cancer dataset

X, y = load_breast_cancer(return_X_y=True)

# Split train (70%) and test (30%)

(

    X_train,

    X_test,

    y_train,

    y_test,

) = train_test_split(X, y, train_size=0.7, random_state=31)

# Define and fit model

pipeline = Pipeline(

    [

        ("scaler", StandardScaler()),

        ("model", LogisticRegression()),

    ]

)

pipeline.fit(X=X_train, y=y_train)

# Detector configuration and instantiation

config = DDMConfig(

    warning_level=2.0,

    drift_level=3.0,

    min_num_instances=25,  # minimum number of instances before checking for concept drift

)

detector = DDM(config=config)

# Metric to compute accuracy

metric = PrequentialError(alpha=1.0)  # alpha=1.0 is equivalent to normal accuracy

def stream_test(X_test, y_test, y, metric, detector):

    """Simulate data stream over X_test and y_test. y is the true label."""

    drift_flag = False

    for i, (X, y) in enumerate(zip(X_test, y_test)):

        y_pred = pipeline.predict(X.reshape(1, -1))

        error = 1 - (y_pred.item() == y.item())

        metric_error = metric(error_value=error)

        _ = detector.update(value=error)

        status = detector.status

        if status["drift"] and not drift_flag:

            drift_flag = True

            print(f"Concept drift detected at step {i}. Accuracy: {1 - metric_error:.4f}")

    if not drift_flag:

        print("No concept drift detected")

    print(f"Final accuracy: {1 - metric_error:.4f}\n")

# Simulate data stream (assuming test label available after each prediction)

# No concept drift is expected to occur

stream_test(

    X_test=X_test,

    y_test=y_test,

    y=y,

    metric=metric,

    detector=detector,

)

# >> No concept drift detected

# >> Final accuracy: 0.9766

# IMPORTANT: Induce/simulate concept drift in the last part (20%)

# of y_test by modifying some labels (50% approx). Therefore, changing P(y|X))

drift_size = int(y_test.shape[0] * 0.2)

y_test_drift = y_test[-drift_size:]

modify_idx = np.random.rand(*y_test_drift.shape) <= 0.5

y_test_drift[modify_idx] = (y_test_drift[modify_idx] + 1) % len(np.unique(y_test))

y_test[-drift_size:] = y_test_drift

# Reset detector and metric

detector.reset()

metric.reset()

# Simulate data stream (assuming test label available after each prediction)

# Concept drift is expected to occur because of the label modification

stream_test(

    X_test=X_test,

    y_test=y_test,

    y=y,

    metric=metric,

    detector=detector,

)

# >> Concept drift detected at step 142. Accuracy: 0.9510

# >> Final accuracy: 0.8480

```

More concept drift examples can be found [here](https://frouros.readthedocs.io/en/latest/examples/concept_drift.html).

### 📊 Data drift

As a quick example, we can use the iris dataset to which data drift is induced and show the use of a data drift detector like Kolmogorov-Smirnov test.

```python

import numpy as np

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier

from frouros.detectors.data_drift import KSTest

np.random.seed(seed=31)

# Load iris dataset

X, y = load_iris(return_X_y=True)

# Split train (70%) and test (30%)

(

    X_train,

    X_test,

    y_train,

    y_test,

) = train_test_split(X, y, train_size=0.7, random_state=31)

# Set the feature index to which detector is applied

feature_idx = 0

# IMPORTANT: Induce/simulate data drift in the selected feature of y_test by

# applying some gaussian noise. Therefore, changing P(X))

X_test[:, feature_idx] += np.random.normal(

    loc=0.0,

    scale=3.0,

    size=X_test.shape[0],

)

# Define and fit model

model = DecisionTreeClassifier(random_state=31)

model.fit(X=X_train, y=y_train)

# Set significance level for hypothesis testing

alpha = 0.001

# Define and fit detector

detector = KSTest()

_ = detector.fit(X=X_train[:, feature_idx])

# Apply detector to the selected feature of X_test

result, _ = detector.compare(X=X_test[:, feature_idx])

# Check if drift is taking place

if result.p_value <= alpha:

    print(f"Data drift detected at feature {feature_idx}")

else:

    print(f"No data drift detected at feature {feature_idx}")

# >> Data drift detected at feature 0

# Therefore, we can reject H0 (both samples come from the same distribution).

```

More data drift examples can be found [here](https://frouros.readthedocs.io/en/latest/examples/data_drift.html).

## 🛠 Installation

Frouros can be installed via pip:

```bash

pip install frouros

```

## 🕵🏻‍♂️️ Drift detection methods

The currently implemented detectors are listed in the following table.

  

    

    Drift detector

    Type

    Family

    Univariate (U) / Multivariate (M)

    Numerical (N) / Categorical (C)

    Method

    Reference

    

  

  

  

    Concept drift

    Streaming

    Change detection

    U

    N

    BOCD

    Adams and MacKay (2007)

  

  

    U

    N

    CUSUM

    Page (1954)

  

  

    U

    N

    Geometric moving average

    Roberts (1959)

  

  

    U

    N

    Page Hinkley

    Page (1954)

  

  

    Statistical process control

    U

    N

    DDM

    Gama et al. (2004)

  

  

    U

    N

    ECDD-WT

    Ross et al. (2012)

  

  

    U

    N

    EDDM

    Baena-Garcıa et al. (2006)

  

  

    U

    N

    HDDM-A

    Frias-Blanco et al. (2014)

  

  

    U

    N

    HDDM-W

    Frias-Blanco et al. (2014)

  

  

    U

    N

    RDDM

    Barros et al. (2017)

  

  

    Window based

    U

    N

    ADWIN

    Bifet and Gavalda (2007)

  

  

    U

    N

    KSWIN

    Raab et al. (2020)

  

  

    U

    N

    STEPD

    Nishida and Yamauchi (2007)

  

  

    Data drift

    Batch

    Distance based

    U

    N

    Bhattacharyya distance

    Bhattacharyya (1946)

  

  

    U

    N

    Earth Mover's distance

    Rubner et al. (2000)

  

  

    U

    N

    Energy distance

    Székely et al. (2013)

  

  

    U

    N

    Hellinger distance

    Hellinger (1909)

  

  

    U

    N

    Histogram intersection normalized complement

    Swain and Ballard (1991)

  

  

    U

    N

    Jensen-Shannon distance

    Lin (1991)

  

  

    U

    N

    Kullback-Leibler divergence

    Kullback and Leibler (1951)

  

  

    M

    N

    Maximum Mean Discrepancy

    Gretton et al. (2012)

  

  

    U

    N

    Population Stability Index

    Wu and Olson (2010)

  

  

    Statistical test

    U

    N

    Anderson-Darling test

    Scholz and Stephens (1987)

  

  

    U

    N

    Baumgartner-Weiss-Schindler test

    Baumgartner et al. (1998)

  

    U

    C

    Chi-square test

    Pearson (1900)

  

  

    U

    N

    Cramér-von Mises test

    Cramér (1902)

  

  

    U

    N

    Kolmogorov-Smirnov test

    Massey Jr (1951)

  

  

    U

    N

    Kuiper's test

    Kuiper (1960)

  

  

    U

    N

    Mann-Whitney U test

    Mann and Whitney (1947)

  

  

    U

    N

    Welch's t-test

    Welch (1947)

  

  

    Streaming

    Distance based

    M

    N

    Maximum Mean Discrepancy

    Gretton et al. (2012)

  

  

    Statistical test

    U

    N

    Incremental Kolmogorov-Smirnov test

    dos Reis et al. (2016)

  

## ❗ What is and what is not Frouros?

Unlike other libraries that in addition to provide drift detection algorithms, include other functionalities such as anomaly/outlier detection, adversarial detection, imbalance learning, among others, Frouros has and will **ONLY** have one purpose: **drift detection**.

We firmly believe that machine learning related libraries or frameworks should not follow [Jack of all trades, master of none](https://en.wikipedia.org/wiki/Jack_of_all_trades,_master_of_none) principle. Instead, they should be focused on a single task and do it well.

## ✅ Who is using Frouros?

Frouros is actively being used by the following projects to implement drift

detection in machine learning pipelines:

 * [AI4EOSC](https://ai4eosc.eu).

 * [iMagine](https://imagine-ai.eu).

If you want your project listed here, do not hesitate to send us a pull request.

## 👍 Contributing

Check out the [contribution](https://github.com/IFCA/frouros/blob/main/CONTRIBUTING.md) section.

## 💬 Citation

If you want to cite Frouros you can use the [SoftwareX publication](https://doi.org/10.1016/j.softx.2024.101733).

```bibtex

@article{CESPEDESSISNIEGA2024101733,

title = {Frouros: An open-source Python library for drift detection in machine learning systems},

journal = {SoftwareX},

volume = {26},

pages = {101733},

year = {2024},

issn = {2352-7110},

doi = {https://doi.org/10.1016/j.softx.2024.101733},

url = {https://www.sciencedirect.com/science/article/pii/S2352711024001043},

author = {Jaime {Céspedes Sisniega} and Álvaro {López García}},

keywords = {Machine learning, Drift detection, Concept drift, Data drift, Python},

abstract = {Frouros is an open-source Python library capable of detecting drift in machine learning systems. It provides a combination of classical and more recent algorithms for drift detection, covering both concept and data drift. We have designed it to be compatible with any machine learning framework and easily adaptable to real-world use cases. The library is developed following best development and continuous integration practices to ensure ease of maintenance and extensibility.}

}

```

## 📝 License

Frouros is an open-source software licensed under the [BSD-3-Clause license](https://github.com/IFCA/frouros/blob/main/LICENSE).

## 🙏 Acknowledgements

Frouros has received funding from the Agencia Estatal de Investigación, Unidad de Excelencia María de Maeztu, ref. MDM-2017-0765.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/IFCA-Advanced-Computing/frouros

Awesome Lists containing this project

README