https://github.com/uncledecart/data_drift
Data Drift detection using auto encoders
https://github.com/uncledecart/data_drift
autoencoder data-drift data-science pytorch-lightning variational-autoencoder
Last synced: 5 months ago
JSON representation
Data Drift detection using auto encoders
- Host: GitHub
- URL: https://github.com/uncledecart/data_drift
- Owner: uncleDecart
- Created: 2022-04-14T12:11:40.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2022-04-27T11:34:00.000Z (about 4 years ago)
- Last Synced: 2025-09-10T02:29:09.450Z (10 months ago)
- Topics: autoencoder, data-drift, data-science, pytorch-lightning, variational-autoencoder
- Language: Jupyter Notebook
- Homepage:
- Size: 8.88 MB
- Stars: 3
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Data drift detection using Autoencoders
*Disclaimer for dark mode users: some of the graphics' title in README cannot be read properly in Github Dark Mode*
## Definition
Data drift in this projects means that a new class which has not been seen is introduced during model run.
It can be an anomaly (changed object of the same class on which training was performed) or completely new object of different class
## Approach

Low-dimensional representations (embeddings) of Autoencoders (AE) are used to cluster them to detect new classes through Sihlouete coefficient,
reconstruction error is also used to detect "anomaly" inside one class. Autoencoder is trained without labels on one (good) class.
When the new class is detected a new AE for average representation of this class can be introduced.
**Following research**
*Researched problem can be treated as small one*
It's unclear wether this approach will work on bigger problems (e.g. higher res images) in order to scale tiling can help
*Tiling big images to create attention maps of a kind to specify drifts*
Tiling images and training AE for each segment of a camera can be done do speficy drifts, moreover this approach can be used in
feredative way to provide better accuracy
*Camera "problems" should be treated as "style"*
So of course, we can say that there is data drift (anomaly if we're talking about one image) when camera is moved, or lightning condition are different,
but question here can be put differently. We actually can say what's wrong can be with camera: lightning conditions, focus, movement, scratches.
And all of these can me modelled (e.g. algoritmically put on images simulating such condition) that gives us two advatages:
1) We can generate data and model drift for such occasions
2) More interesting, given, that we know such things can happen,
can we train an VAEGAN to disentagle style (e.g. camera problems) and features (e.g. objects) and maybe even interpolate car from
information given (e.g. yes, the image is lighted badly, but we can restore bad parts of it and we know with some certainty that this is specified object
and we can say it's not anomaly one) that way we will be more sensitive to react on anomalies.
**Remaining questions:**
- What is maximum capacity for one encoder to distinguish classes?
## Data:
Dataset used for this project was mainly [MVTEC data](https://www.mvtec.com/company/research/datasets/mvtec-ad) you can find its Dataloader at `src` folder.
Dataset needs to be downloaded separately from link provided, no registration / additional fee needed.
## Autoencoders:
Autoencoders are stored in `model` folder. They're written [pytoch-lightning](https://github.com/PyTorchLightning/pytorch-lightning)
Supported architectures:
- Variational Autoencoder
- Vanila Autoencoder
- VAEGAN (has not been tested)
## Results
This is compilation of results from `notebooks` folder, check `.ipynb` files for more details.
### Autoencoders:
- 2 Autoencoders models were trained: "big" and "small". Small AE model embedding size is 8. Big AE model embedding size is 32.
- Variational Autoencoder with embedding size 32 was trained as well and showed similar results to Big AE model.
Big AE can generalize better, this can be seen from PCA's of the same MVTEC data:
PCA on embeddings for Big AE

PCA on embegging for Small AE

Original bottle images (AE was trained only on good bottles)

Bottle reconstructions

Original transistor images

Reconstructed transistor images

### Clusters:
- Even with 32 dimensions clustering works not the best. DBSCAN which should've solved curse of dimensionality, worked not so good as supposed to.
- Combination with 2-component PCA (explaining around 80% of variance) and K-means clustering using Sihlouhete coefficient to determine cluster
cardinality worked fine 
- One can use GMM instead of K-means clustering to quantify uncertainty in class cardinality.

Dynamic results on MVTEC data:

## Setup
Before clonning repository download [MVTEC data](https://www.mvtec.com/company/research/datasets/mvtec-ad)
```bash
git clone https://github.com/uncleDecart/data_drift
cd data_drift
pip install -r requirements.txt
jupyter-notebook
```