Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mayer79/outforest
Outlier detection based on random forest models
https://github.com/mayer79/outforest
machine-learning outlier outlier-analysis outlier-detection random-forest rstats
Last synced: about 2 months ago
JSON representation
Outlier detection based on random forest models
- Host: GitHub
- URL: https://github.com/mayer79/outforest
- Owner: mayer79
- License: gpl-2.0
- Created: 2019-12-19T14:14:12.000Z (about 5 years ago)
- Default Branch: main
- Last Pushed: 2024-07-29T17:10:49.000Z (5 months ago)
- Last Synced: 2024-10-04T12:56:52.559Z (3 months ago)
- Topics: machine-learning, outlier, outlier-analysis, outlier-detection, random-forest, rstats
- Language: R
- Homepage: https://mayer79.github.io/outForest/
- Size: 2.58 MB
- Stars: 14
- Watchers: 4
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: NEWS.md
- License: LICENSE.md
Awesome Lists containing this project
README
[![R-CMD-check](https://github.com/mayer79/outForest/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/mayer79/outForest/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/mayer79/outForest/graph/badge.svg)](https://app.codecov.io/gh/mayer79/outForest?branch=main)
[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/outForest)](https://cran.r-project.org/package=outForest)[![](https://cranlogs.r-pkg.org/badges/outForest)](https://cran.r-project.org/package=outForest)
[![](https://cranlogs.r-pkg.org/badges/grand-total/outForest?color=orange)](https://cran.r-project.org/package=outForest)## Overview
{outForest} is a multivariate anomaly detection method. Each numeric variable is regressed onto all other variables using a random forest. If the scaled absolute difference between observed value and out-of-bag prediction is larger than a prespecified threshold, then a value is considered an outlier. After identification of outliers, they can be replaced, e.g., by predictive mean matching from the non-outliers.
The method can be viewed as a multivariate extension of a basic univariate outlier detection method, in which a value is considered an outlier if it deviates from the mean by more than, say, three times the standard deviation. In the multivariate case, instead of comparing a value with the *overall mean*, rather the difference to the *conditional mean* is considered. {outForest} estimates this conditional mean by a random forest.
Once the method is trained on a reference data set, it can be applied to new data.
## Installation
```r
# From CRAN
install.packages("outForest")# Development version
devtools::install_github("mayer79/outForest")
```## Usage
We first generate a data set with about 2% outliers values in each numeric column. Then, we try to identify them.
``` r
library(outForest)
set.seed(3)# Generate data with outliers in numeric columns
head(irisWithOutliers <- generateOutliers(iris, p = 0.02))# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 5.1 3.500000 1.4 0.2 setosa
# 4.9 3.000000 1.4 0.2 setosa
# 4.7 3.200000 1.3 0.2 setosa
# 4.6 3.100000 1.5 0.2 setosa
# 5.0 -3.744405 1.4 0.2 setosa
# 5.4 3.900000 1.7 0.4 setosa
# Find outliers by random forest regressions and replace them by predictive mean matching
(out <- outForest(irisWithOutliers, allow_predictions = TRUE))# Plot the number of outliers per numeric variable
plot(out)# Information on outliers
head(outliers(out))# row col observed predicted rmse score threshold replacement
# 5 Sepal.Width -3.744405 3.298493 0.7810172 -9.017596 3 2.8
# 20 Sepal.Length 10.164017 5.141093 0.6750468 7.440852 3 5.4
# 138 Petal.Width 4.721186 2.113464 0.3712539 7.024092 3 2.1
# 68 Petal.Width -1.188913 1.305339 0.3712539 -6.718452 3 1.2
# 137 Sepal.Width 8.054524 2.861445 0.7810172 6.649122 3 2.9
# 15 Petal.Length 6.885277 1.875646 0.7767877 6.449163 3 1.3# Resulting data set with replaced outliers
head(Data(out))# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 5.1 3.5 1.4 0.2 setosa
# 4.9 3.0 1.4 0.2 setosa
# 4.7 3.2 1.3 0.2 setosa
# 4.6 3.1 1.5 0.2 setosa
# 5.0 2.8 1.4 0.2 setosa
# 5.4 3.9 1.7 0.4 setosa# Out-of-sample application
iris1 <- iris[1, ]
iris1$Sepal.Length <- -1
pred <- predict(out, newdata = iris1)# Did we find the outlier?
outliers(pred)# row col observed predicted rmse score threshold replacement
# 1 Sepal.Length -1 4.960069 0.6750468 -8.82912 3 6.4# Fixed data
Data(pred)# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 6.4 3.5 1.4 0.2 setosa
```