https://github.com/jchiquet/aricode

R package for computation of (adjusted) rand-index and other such scores
https://github.com/jchiquet/aricode

bucket-sort clustering clustering-comparison-measures

Last synced: 7 months ago
JSON representation

R package for computation of (adjusted) rand-index and other such scores

Host: GitHub
URL: https://github.com/jchiquet/aricode
Owner: jchiquet
Created: 2016-09-06T11:46:09.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2024-02-27T18:31:44.000Z (over 1 year ago)
Last Synced: 2025-03-23T21:35:52.683Z (7 months ago)
Topics: bucket-sort, clustering, clustering-comparison-measures
Language: R
Homepage: https://jchiquet.github.io/aricode
Size: 878 KB
Stars: 25
Watchers: 4
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.Rmd
- Authors: AUTHORS

Awesome Lists containing this project

README

          ---

output: github_document

---

```{r setup, include=FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  fig.path = "man/figures/"

)

```

# aricode

 

[![R-CMD-check](https://github.com/jchiquet/aricode/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/jchiquet/aricode/actions/workflows/R-CMD-check.yaml)

[![CRAN Status](https://www.r-pkg.org/badges/version/aricode)](https://CRAN.R-project.org/package=aricode)

[![Coverage status](https://codecov.io/gh/jchiquet/aricode/branch/master/graph/badge.svg)](https://codecov.io/gh/jchiquet/aricode)

[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-blue.svg)](https://www.tidyverse.org/lifecycle/#stable)

[![](https://img.shields.io/github/last-commit/jchiquet/aricode.svg)](https://github.com/jchiquet/aricode/commits/master)

  

A package for efficient computations of standard clustering comparison measures

## Installation

Stable version on the [CRAN](https://cran.rstudio.com/web/packages/aricode/).

```{r install_cran, eval = FALSE}

install.packages("aricode")

```

The development version is available via:

```{r install_github, eval = FALSE}

devtools::install_github("jchiquet/aricode")

```

## Description

Computation of measures for clustering comparison (ARI, AMI, NID and even the $\chi^2$ distance) are usually based on the contingency table. Traditional implementations (e.g., function `adjustedRandIndex` of package **mclust**) are in $\Omega(n + u v)$ where 

- $n$ is the size of the vectors the classifications of which are to be compared,

- $u$ and $v$ are the respective number of classes in each vectors. 

In **aricode** we propose an implementation, based on radix sort, that is in $\Theta(n)$ in time and space.  

Importantly, the complexity does not depends on $u$ and $v$.

Our implementation of the ARI for instance is one or two order of magnitude faster than some standard implementation in `R`.

## Available measures and functions

The functions included in aricode are:

- `ARI`: computes the adjusted rand index

- `Chi2`: computes the Chi-square statistics

- `MARI/MARIraw`: computes the modified adjusted rand index (Sundqvist et al, in preparation)

- `NVI`: computes the the normalized variation information

- `NID`: computes the normalized information distance

- `NMI`: computes the normalized mutual information

- `AMI`: computes the adjusted mutual information

- `expected_MI`: computes the expected mutual information

- `entropy`: computes the conditional and joint entropies

- `clustComp`: computes all clustering comparison measures at once

## Timings

Here are some timings to compare the cost of computing the adjusted Rand Index with **aricode** or with the commonly used function `adjustedRandIndex` of the *mclust* package: the cost of the latter can be prohibitive for large vectors: 

```{r timings_function, echo=FALSE, message=FALSE, warning=FALSE}

library(aricode)

library(mclust)

library(ggplot2)

time.aricode <- function(times, c1, c2){

  replicate(times, system.time(ARI(c1, c2))[3])

}

time.mclust <- function(times, c1, c2){

  replicate(times, system.time(mclust::adjustedRandIndex(c1, c2))[3])

}

time.method <- function(times, c1, c2, n){

  rbind(

    data.frame(time = time.aricode(times, c1, c2), expr = "aricode", n = n),

    data.frame(time = time.mclust(times, c1, c2), expr = "mclust", n = n)

  )

}

# with similar classif, number of classes grows with n

sim.timings <- function(n, times = 10) {

    c1 <- sample(1:(n/200), n, replace=TRUE);c2 <- c1;

    i_change <- sample(1:n, n/50, replace=FALSE)

    c2[i_change] <- c2[rev(i_change)]

    out <- time.method(times, c1, c2, n)

    data.frame(time=out$time, method=out$expr, n = n)

}

```

```{r timings_run, echo=FALSE, message=FALSE, warning=FALSE, cache=TRUE}

# with similar classif, number of classes grows with n

ns <- sort(c(200 * 2^(3:14), 150 * 2^(3:15)))

timings <- do.call("rbind", lapply(ns, sim.timings))

```

```{r timings_plot, echo=FALSE, message=FALSE, warning=FALSE}

p.timings <- ggplot(timings, aes(x=n, y=time, colour=method)) +

  geom_smooth(data = dplyr::filter(timings, n > 1e4), method = "lm") + geom_point(size=0.25, alpha=0.9) + labs(y="time (sec.)") +

    scale_x_log10(

   breaks = scales::trans_breaks("log10", function(x) 10^x),

   labels = scales::trans_format("log10", scales::math_format(10^.x))

 ) +

 scale_y_log10(breaks = scales::trans_breaks("log10", function(x) 10^x),

   labels = scales::trans_format("log10", scales::math_format(10^.x))) +

   annotation_logticks()                 

p.timings + ggtitle("number of classes grows with n") + theme_bw()

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jchiquet/aricode

Awesome Lists containing this project

README