https://github.com/mhahsler/dbscan

Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms - R package
https://github.com/mhahsler/dbscan

clustering cran dbscan density-based-clustering hdbscan lof optics r

Last synced: 5 months ago
JSON representation

Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms - R package

Host: GitHub
URL: https://github.com/mhahsler/dbscan
Owner: mhahsler
License: gpl-3.0
Created: 2015-09-30T17:14:37.000Z (about 10 years ago)
Default Branch: master
Last Pushed: 2025-04-03T21:23:54.000Z (7 months ago)
Last Synced: 2025-04-13T15:59:53.271Z (6 months ago)
Topics: clustering, cran, dbscan, density-based-clustering, hdbscan, lof, optics, r
Language: C++
Size: 9.39 MB
Stars: 325
Watchers: 13
Forks: 64
Open Issues: 4
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE

Awesome Lists containing this project

README

          ---

output: github_document

bibliography: vignettes/dbscan.bib

link-citations: yes

---

```{r echo=FALSE, results = 'asis'}

pkg <- 'dbscan'

source("https://raw.githubusercontent.com/mhahsler/pkg_helpers/main/pkg_helpers.R")

pkg_title(pkg, anaconda = "r-dbscan", stackoverflow = "dbscan%2br")

```

## Introduction

This R package [@hahsler2019dbscan] provides a fast C++ (re)implementation of several density-based algorithms with a focus on the DBSCAN family for clustering spatial data.

The package includes: 

 

__Clustering__

- __DBSCAN:__ Density-based spatial clustering of applications with noise [@ester1996density].

- __Jarvis-Patrick Clustering__: Clustering using a similarity measure based

on shared near neighbors [@jarvis1973].

- __SNN Clustering__: Shared nearest neighbor clustering [@erdoz2003].

- __HDBSCAN:__  Hierarchical DBSCAN with simplified hierarchy extraction [@campello2015hierarchical].

- __FOSC:__ Framework for optimal selection of clusters for unsupervised and semisupervised clustering of hierarchical cluster tree [@campello2013density].

- __OPTICS/OPTICSXi:__ Ordering points to identify the clustering structure and cluster extraction methods

  [@ankerst1999optics].

__Outlier Detection__

- __LOF:__ Local outlier factor algorithm [@breunig2000lof]. 

- __GLOSH:__ Global-Local Outlier Score from Hierarchies algorithm [@campello2015hierarchical]. 

__Cluster Evaluation__

- __DBCV:__ Density-based clustering validation [@moulavi2014].

__Fast Nearest-Neighbor Search (using kd-trees)__

- __kNN search__

- __Fixed-radius NN search__

The implementations use the kd-tree data structure (from library ANN) for faster k-nearest neighbor search, and are

for Euclidean distance typically faster than the native R implementations (e.g., dbscan in package `fpc`), or the 

implementations in [WEKA](https://ml.cms.waikato.ac.nz/weka/), [ELKI](https://elki-project.github.io/) and [Python's scikit-learn](https://scikit-learn.org/).

```{r echo=FALSE, results = 'asis'}

pkg_usage(pkg)

pkg_citation(pkg, 2)

pkg_install(pkg)

```

## Usage

Load the package and use the numeric variables in the iris dataset

```{r}

library("dbscan")

data("iris")

x <- as.matrix(iris[, 1:4])

```

DBSCAN

```{r}

db <- dbscan(x, eps = .42, minPts = 5)

db

```

Visualize the resulting clustering (noise points are shown in black).

```{r dbscan}

pairs(x, col = db$cluster + 1L)

```

OPTICS

```{r}

opt <- optics(x, eps = 1, minPts = 4)

opt

```

Extract DBSCAN-like clustering from OPTICS 

and create a reachability plot (extracted DBSCAN clusters at eps_cl=.4 are colored)

```{r OPTICS_extractDBSCAN, fig.height=3}

opt <- extractDBSCAN(opt, eps_cl = .4)

plot(opt)

```

HDBSCAN

```{r}

hdb <- hdbscan(x, minPts = 4)

hdb

```

Visualize the hierarchical clustering as a simplified tree. HDBSCAN finds 2 stable clusters.

```{r hdbscan, fig.height=4}

plot(hdb, show_flat = TRUE)

```

## Using dbscan with tidyverse

`dbscan` provides for all clustering algorithms `tidy()`, `augment()`, and `glance()` so they can

be easily used with tidyverse, ggplot2 and [tidymodels](https://www.tidymodels.org/learn/statistics/k-means/).

```{r tidyverse, message=FALSE, warning=FALSE}

library(tidyverse)

db <- x %>% dbscan(eps = .42, minPts = 5)

```

Get cluster statistics as a tibble

```{r tidyverse2}

tidy(db)

```

Visualize the clustering with ggplot2 (use an x for noise points)

```{r tidyverse3}

augment(db, x) %>% 

  ggplot(aes(x = Petal.Length, y = Petal.Width)) +

    geom_point(aes(color = .cluster, shape = noise)) +

    scale_shape_manual(values=c(19, 4))

```

## Using dbscan from Python

R, the R package `dbscan`, and the Python package `rpy2` need to be installed.

```{python, eval = FALSE}

import pandas as pd

import numpy as np

### prepare data

iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', 

                   header = None, 

                   names = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species'])

iris_numeric = iris[['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']]

# get R dbscan package

from rpy2.robjects import packages

dbscan = packages.importr('dbscan')

# enable automatic conversion of pandas dataframes to R dataframes

from rpy2.robjects import pandas2ri

pandas2ri.activate()

db = dbscan.dbscan(iris_numeric, eps = 0.5, MinPts = 5)

print(db)

```

```

## DBSCAN clustering for 150 objects.

## Parameters: eps = 0.5, minPts = 5

## Using euclidean distances and borderpoints = TRUE

## The clustering contains 2 cluster(s) and 17 noise points.

## 

##  0  1  2 

## 17 49 84 

## 

## Available fields: cluster, eps, minPts, dist, borderPoints

```

```{python, eval = FALSE}

# get the cluster assignment vector

labels = np.array(db.rx('cluster'))

labels

```

```

## array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

##         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,

##         1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2,

##         2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0,

##         2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 0, 0,

##         2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0,

##         2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]],

##       dtype=int32)

```

## License 

The dbscan package is licensed under the [GNU General Public License (GPL) Version 3](https://www.gnu.org/licenses/gpl-3.0.en.html). The __OPTICSXi__ R implementation was directly ported from the ELKI framework's Java implementation (GNU AGPLv3), with permission by the original author, Erich Schubert.  

## Changes

* List of changes from [NEWS.md](https://github.com/mhahsler/dbscan/blob/master/NEWS.md)

## References

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mhahsler/dbscan

Awesome Lists containing this project

README