Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mhahsler/dbscan
Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms - R package
https://github.com/mhahsler/dbscan
clustering cran dbscan density-based-clustering hdbscan lof optics r
Last synced: 20 days ago
JSON representation
Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms - R package
- Host: GitHub
- URL: https://github.com/mhahsler/dbscan
- Owner: mhahsler
- License: gpl-3.0
- Created: 2015-09-30T17:14:37.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2024-08-27T22:14:33.000Z (3 months ago)
- Last Synced: 2024-09-28T07:36:07.139Z (about 1 month ago)
- Topics: clustering, cran, dbscan, density-based-clustering, hdbscan, lof, optics, r
- Language: C++
- Size: 8.7 MB
- Stars: 306
- Watchers: 14
- Forks: 64
- Open Issues: 5
-
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE
Awesome Lists containing this project
README
---
output: github_document
bibliography: vignettes/dbscan.bib
link-citations: yes
---```{r echo=FALSE, results = 'asis'}
pkg <- 'dbscan'source("https://raw.githubusercontent.com/mhahsler/pkg_helpers/main/pkg_helpers.R")
pkg_title(pkg, anaconda = "r-dbscan", stackoverflow = "dbscan+r")
```## Introduction
This R package [@hahsler2019dbscan] provides a fast C++ (re)implementation of several density-based algorithms with a focus on the DBSCAN family for clustering spatial data.
The package includes:
__Clustering__- __DBSCAN:__ Density-based spatial clustering of applications with noise [@ester1996density].
- __Jarvis-Patrick Clustering__: Clustering using a similarity measure based
on shared near neighbors [@jarvis1973].
- __SNN Clustering__: Shared nearest neighbor clustering [@erdoz2003].
- __HDBSCAN:__ Hierarchical DBSCAN with simplified hierarchy extraction [@campello2015hierarchical].
- __FOSC:__ Framework for optimal selection of clusters for unsupervised and semisupervised clustering of hierarchical cluster tree [@campello2013density].
- __OPTICS/OPTICSXi:__ Ordering points to identify the clustering structure and cluster extraction methods
[@ankerst1999optics].__Outlier Detection__
- __LOF:__ Local outlier factor algorithm [@breunig2000lof].
- __GLOSH:__ Global-Local Outlier Score from Hierarchies algorithm [@campello2015hierarchical].__Fast Nearest-Neighbor Search (using kd-trees)__
- __kNN search__
- __Fixed-radius NN search__The implementations use the kd-tree data structure (from library ANN) for faster k-nearest neighbor search, and are
for Euclidean distance typically faster than the native R implementations (e.g., dbscan in package `fpc`), or the
implementations in [WEKA](https://ml.cms.waikato.ac.nz/weka/), [ELKI](https://elki-project.github.io/) and [Python's scikit-learn](https://scikit-learn.org/).```{r echo=FALSE, results = 'asis'}
pkg_usage(pkg)
pkg_citation(pkg, 2)
pkg_install(pkg)
```## Usage
Load the package and use the numeric variables in the iris dataset
```{r}
library("dbscan")data("iris")
x <- as.matrix(iris[, 1:4])
```DBSCAN
```{r}
db <- dbscan(x, eps = .42, minPts = 5)
db
```Visualize the resulting clustering (noise points are shown in black).
```{r dbscan}
pairs(x, col = db$cluster + 1L)
```OPTICS
```{r}
opt <- optics(x, eps = 1, minPts = 4)
opt
```Extract DBSCAN-like clustering from OPTICS
and create a reachability plot (extracted DBSCAN clusters at eps_cl=.4 are colored)
```{r OPTICS_extractDBSCAN, fig.height=3}
opt <- extractDBSCAN(opt, eps_cl = .4)
plot(opt)
```HDBSCAN
```{r}
hdb <- hdbscan(x, minPts = 4)
hdb
```Visualize the hierarchical clustering as a simplified tree. HDBSCAN finds 2 stable clusters.
```{r hdbscan, fig.height=4}
plot(hdb, show_flat = TRUE)
```## Using dbscan with tidyverse
`dbscan` provides for all clustering algorithms `tidy()`, `augment()`, and `glance()` so they can
be easily used with tidyverse, ggplot2 and [tidymodels](https://www.tidymodels.org/learn/statistics/k-means/).```{r tidyverse, message=FALSE, warning=FALSE}
library(tidyverse)
db <- x %>% dbscan(eps = .42, minPts = 5)
```Get cluster statistics as a tibble
```{r tidyverse2}
tidy(db)
```Visualize the clustering with ggplot2 (use an x for noise points)
```{r tidyverse3}
augment(db, x) %>%
ggplot(aes(x = Petal.Length, y = Petal.Width)) +
geom_point(aes(color = .cluster, shape = noise)) +
scale_shape_manual(values=c(19, 4))```
## Using dbscan from Python
R, the R package `dbscan`, and the Python package `rpy2` need to be installed.```{python, eval = FALSE}
import pandas as pd
import numpy as np### prepare data
iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
header = None,
names = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species'])
iris_numeric = iris[['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']]# get R dbscan package
from rpy2.robjects import packages
dbscan = packages.importr('dbscan')# enable automatic conversion of pandas dataframes to R dataframes
from rpy2.robjects import pandas2ri
pandas2ri.activate()db = dbscan.dbscan(iris_numeric, eps = 0.5, MinPts = 5)
print(db)
``````
## DBSCAN clustering for 150 objects.
## Parameters: eps = 0.5, minPts = 5
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 2 cluster(s) and 17 noise points.
##
## 0 1 2
## 17 49 84
##
## Available fields: cluster, eps, minPts, dist, borderPoints
``````{python, eval = FALSE}
# get the cluster assignment vector
labels = np.array(db.rx('cluster'))
labels
``````
## array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
## 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
## 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2,
## 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0,
## 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 0, 0,
## 2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0,
## 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]],
## dtype=int32)
```## License
The dbscan package is licensed under the [GNU General Public License (GPL) Version 3](https://www.gnu.org/licenses/gpl-3.0.en.html). The __OPTICSXi__ R implementation was directly ported from the ELKI framework's Java implementation (GNU AGPLv3), with permission by the original author, Erich Schubert.## Changes
* List of changes from [NEWS.md](https://github.com/mhahsler/dbscan/blob/master/NEWS.md)## References