Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/tidymodels/tidyclust

A tidy unified interface to clustering models
https://github.com/tidymodels/tidyclust

Last synced: 3 days ago
JSON representation

A tidy unified interface to clustering models

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```

# tidyclust

[![Codecov test coverage](https://codecov.io/gh/tidymodels/tidyclust/branch/main/graph/badge.svg)](https://app.codecov.io/gh/tidymodels/tidyclust?branch=main)
[![R-CMD-check](https://github.com/tidymodels/tidyclust/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/tidymodels/tidyclust/actions/workflows/R-CMD-check.yaml)

The goal of tidyclust is to provide a tidy, unified interface to clustering models. The packages is closely modeled after the [parsnip](https://parsnip.tidymodels.org/) package.

## Installation

You can install the released version of tidyclust from [CRAN](https://CRAN.R-project.org) with:

``` r
install.packages("tidyclust")
```

and the development version of tidyclust from [GitHub](https://github.com/) with:

``` r
# install.packages("pak")
pak::pak("tidymodels/tidyclust")
```

## Example

The first thing you do is to create a `cluster specification`. For this example we are creating a K-means model, using the `stats` engine.

```{r}
library(tidyclust)
set.seed(1234)

kmeans_spec <- k_means(num_clusters = 3) %>%
set_engine("stats")

kmeans_spec
```

This specification can then be fit using data.

```{r}
kmeans_spec_fit <- kmeans_spec %>%
fit(~., data = mtcars)
kmeans_spec_fit
```

Once you have a fitted tidyclust object, you can do a number of things. `predict()` returns the cluster a new observation belongs to

```{r}
predict(kmeans_spec_fit, mtcars[1:4, ])
```

`extract_cluster_assignment()` returns the cluster assignments of the training observations

```{r}
extract_cluster_assignment(kmeans_spec_fit)
```

and `extract_centroids()` returns the locations of the clusters

```{r}
extract_centroids(kmeans_spec_fit)
```

## Visual comparison of clustering methods

Below is a visualization of the available models and how they compare using 2 dimensional toy data sets.

```{r comparison, echo=FALSE, message=FALSE, fig.asp=1/2, dpi=105, dev="svglite"}
#| fig-alt: "Mock comparison for different clustering methods for different data sets. Each row correspods to a clustering method, each column corresponds to a data set type."
library(tidymodels)
library(tidyclust)
set.seed(1234)

make_circles <- function(n) {
x <- seq(0, pi * 2, length.out = n)
x <- cos(x) * c(0.5, 1)
x <- x + rnorm(n, sd = 0.05)

y <- seq(0, pi * 2, length.out = n)
y <- sin(y) * c(0.5, 1)
y <- y + rnorm(n, sd = 0.05)

out <- data.frame(x, y)
attr(out, "name") <- "circles"
out
}

make_halves <- function(n) {
x <- seq(0, pi * 2, length.out = n)
x <- cos(x) + rep(c(0, 1), each = n / 2)
x <- x + rnorm(n, sd = 0.05)

y <- seq(0, pi * 2, length.out = n)
y <- sin(y) + rep(c(0, 0.5), each = n / 2)
y <- y + rnorm(n, sd = 0.05)
y <- y - 0.25
y <- y * (1 / 0.75)

out <- data.frame(x, y)
attr(out, "name") <- "halves"
out
}

make_uniform <- function(n) {
x <- runif(n, min = -1, max = 1)
y <- runif(n, min = -1, max = 1)

out <- data.frame(x, y)
attr(out, "name") <- "uniform"
out
}

make_blobs <- function(n) {
x <- rep(c(1, 2, 3), length.out = n)
x <- x + rnorm(n, sd = 0.5)
x <- (x - min(x)) / (max(x) - min(x)) * 2 - 1

y <- rep(c(1, 4, 2), length.out = n)
y <- y + rnorm(n, sd = 0.5)
y <- (y - min(y)) / (max(y) - min(y)) * 2 - 1

out <- data.frame(x, y)
attr(out, "name") <- "blobs"
out
}

augment_model <- function(model, data) {
model |>
fit(~., data = data) |>
augment(new_data = data) |>
mutate(
data_name = attr(data, "name"),
model_name = class(model)[1]
)
}

circle_data <- make_circles(500)
halves_data <- make_halves(500)
uniform_data <- make_uniform(500)
blobs_data <- make_blobs(500)

colors <- c("#E49E68", "#6899E4", "#E068E4")

expand_grid(
models = list(
k_means(num_clusters = 3),
hier_clust(num_clusters = 3)
),
datasets = list(
circle_data,
halves_data,
blobs_data,
uniform_data
)
) |>
pmap_dfr(~ augment_model(.x, .y)) |>
ggplot(aes(x, y, color = .pred_cluster)) +
geom_point(alpha = 0.5) +
facet_grid(model_name ~ data_name, scales = "free", switch = "y") +
scale_color_manual(values = colors) +
theme_void() +
theme(
strip.text.x = element_blank(),
strip.text.y = element_text(size = 15)
) +
guides(color = "none")
```

## Contributing

This project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.

- For questions and discussions about tidymodels packages, modeling, and machine learning, please [post on RStudio Community](https://forum.posit.co/new-topic?category_id=15&tags=tidymodels,question).

- If you think you have encountered a bug, please [submit an issue](https://github.com/tidymodels/tidyclust/issues).

- Either way, learn how to create and share a [reprex](https://reprex.tidyverse.org/articles/articles/learn-reprex.html) (a minimal, reproducible example), to clearly communicate about your code.

- Check out further details on [contributing guidelines for tidymodels packages](https://www.tidymodels.org/contribute/) and [how to get help](https://www.tidymodels.org/help/).
Footer