Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tidymodels/tidyclust
A tidy unified interface to clustering models
https://github.com/tidymodels/tidyclust
Last synced: 3 days ago
JSON representation
A tidy unified interface to clustering models
- Host: GitHub
- URL: https://github.com/tidymodels/tidyclust
- Owner: tidymodels
- License: other
- Created: 2021-11-19T19:52:49.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2024-07-02T22:39:15.000Z (4 months ago)
- Last Synced: 2024-10-31T22:03:17.416Z (12 days ago)
- Language: R
- Homepage: https://tidyclust.tidymodels.org/
- Size: 8.1 MB
- Stars: 108
- Watchers: 6
- Forks: 17
- Open Issues: 35
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
- Code of conduct: .github/CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
---
output: github_document
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```# tidyclust
[![Codecov test coverage](https://codecov.io/gh/tidymodels/tidyclust/branch/main/graph/badge.svg)](https://app.codecov.io/gh/tidymodels/tidyclust?branch=main)
[![R-CMD-check](https://github.com/tidymodels/tidyclust/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/tidymodels/tidyclust/actions/workflows/R-CMD-check.yaml)The goal of tidyclust is to provide a tidy, unified interface to clustering models. The packages is closely modeled after the [parsnip](https://parsnip.tidymodels.org/) package.
## Installation
You can install the released version of tidyclust from [CRAN](https://CRAN.R-project.org) with:
``` r
install.packages("tidyclust")
```and the development version of tidyclust from [GitHub](https://github.com/) with:
``` r
# install.packages("pak")
pak::pak("tidymodels/tidyclust")
```## Example
The first thing you do is to create a `cluster specification`. For this example we are creating a K-means model, using the `stats` engine.
```{r}
library(tidyclust)
set.seed(1234)kmeans_spec <- k_means(num_clusters = 3) %>%
set_engine("stats")kmeans_spec
```This specification can then be fit using data.
```{r}
kmeans_spec_fit <- kmeans_spec %>%
fit(~., data = mtcars)
kmeans_spec_fit
```Once you have a fitted tidyclust object, you can do a number of things. `predict()` returns the cluster a new observation belongs to
```{r}
predict(kmeans_spec_fit, mtcars[1:4, ])
````extract_cluster_assignment()` returns the cluster assignments of the training observations
```{r}
extract_cluster_assignment(kmeans_spec_fit)
```and `extract_centroids()` returns the locations of the clusters
```{r}
extract_centroids(kmeans_spec_fit)
```## Visual comparison of clustering methods
Below is a visualization of the available models and how they compare using 2 dimensional toy data sets.
```{r comparison, echo=FALSE, message=FALSE, fig.asp=1/2, dpi=105, dev="svglite"}
#| fig-alt: "Mock comparison for different clustering methods for different data sets. Each row correspods to a clustering method, each column corresponds to a data set type."
library(tidymodels)
library(tidyclust)
set.seed(1234)make_circles <- function(n) {
x <- seq(0, pi * 2, length.out = n)
x <- cos(x) * c(0.5, 1)
x <- x + rnorm(n, sd = 0.05)y <- seq(0, pi * 2, length.out = n)
y <- sin(y) * c(0.5, 1)
y <- y + rnorm(n, sd = 0.05)out <- data.frame(x, y)
attr(out, "name") <- "circles"
out
}make_halves <- function(n) {
x <- seq(0, pi * 2, length.out = n)
x <- cos(x) + rep(c(0, 1), each = n / 2)
x <- x + rnorm(n, sd = 0.05)y <- seq(0, pi * 2, length.out = n)
y <- sin(y) + rep(c(0, 0.5), each = n / 2)
y <- y + rnorm(n, sd = 0.05)
y <- y - 0.25
y <- y * (1 / 0.75)out <- data.frame(x, y)
attr(out, "name") <- "halves"
out
}make_uniform <- function(n) {
x <- runif(n, min = -1, max = 1)
y <- runif(n, min = -1, max = 1)out <- data.frame(x, y)
attr(out, "name") <- "uniform"
out
}make_blobs <- function(n) {
x <- rep(c(1, 2, 3), length.out = n)
x <- x + rnorm(n, sd = 0.5)
x <- (x - min(x)) / (max(x) - min(x)) * 2 - 1y <- rep(c(1, 4, 2), length.out = n)
y <- y + rnorm(n, sd = 0.5)
y <- (y - min(y)) / (max(y) - min(y)) * 2 - 1out <- data.frame(x, y)
attr(out, "name") <- "blobs"
out
}augment_model <- function(model, data) {
model |>
fit(~., data = data) |>
augment(new_data = data) |>
mutate(
data_name = attr(data, "name"),
model_name = class(model)[1]
)
}circle_data <- make_circles(500)
halves_data <- make_halves(500)
uniform_data <- make_uniform(500)
blobs_data <- make_blobs(500)colors <- c("#E49E68", "#6899E4", "#E068E4")
expand_grid(
models = list(
k_means(num_clusters = 3),
hier_clust(num_clusters = 3)
),
datasets = list(
circle_data,
halves_data,
blobs_data,
uniform_data
)
) |>
pmap_dfr(~ augment_model(.x, .y)) |>
ggplot(aes(x, y, color = .pred_cluster)) +
geom_point(alpha = 0.5) +
facet_grid(model_name ~ data_name, scales = "free", switch = "y") +
scale_color_manual(values = colors) +
theme_void() +
theme(
strip.text.x = element_blank(),
strip.text.y = element_text(size = 15)
) +
guides(color = "none")
```## Contributing
This project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.
- For questions and discussions about tidymodels packages, modeling, and machine learning, please [post on RStudio Community](https://forum.posit.co/new-topic?category_id=15&tags=tidymodels,question).
- If you think you have encountered a bug, please [submit an issue](https://github.com/tidymodels/tidyclust/issues).
- Either way, learn how to create and share a [reprex](https://reprex.tidyverse.org/articles/articles/learn-reprex.html) (a minimal, reproducible example), to clearly communicate about your code.
- Check out further details on [contributing guidelines for tidymodels packages](https://www.tidymodels.org/contribute/) and [how to get help](https://www.tidymodels.org/help/).
Footer