Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/jbryer/clav

Cluster Analysis Validation
https://github.com/jbryer/clav

Last synced: about 2 months ago
JSON representation

Cluster Analysis Validation

Awesome Lists containing this project

README

        

---
output: github_document
editor_options:
chunk_output_type: console
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```

# Cluster Analysis Validation (`clav`)

`r badger::badge_cran_release("clav", "orange")`
`r badger::badge_devel("jbryer/clav", "blue")`
`r badger::badge_github_actions("jbryer/clav", action = "R-CMD-check")`

The `clav` package provides utilities for conducting cluster (profile) analysis with an emphasis on the validating the stability of the profiles both within a given data set as well as across data sets. Unlike supervised models where the known class is measured, validation of unsupervised models where there is no known class can be challenging. The approach implemented here attempts to compare the cluster results across many random samples.

## Installation

You can install the development version of clav like so:

```{r install, eval=FALSE}
remotes::install_github('jbryer/clav')
```

## Development

The following commands are useful for working with the package source locally.

```{r development, eval=FALSE}
# Prep the PISA data set. This will take a while to run the first time.
source('data-raw/data-prep-pisa-2015.R')
# Generate the package documentation
usethis::use_tidy_description()
devtools::document()
# Install the package
devtools::install()
# Run CRAN check for any issues
devtools::check(cran = TRUE)
```

## Example

```{r setup, eval=TRUE}
library(clav)
data(pisa2015, package = 'clav')

cluster_vars <- c('interest', 'enjoyment', 'motivation', 'efficacy', "belonging")
outcome_vars <- c('science_score', 'principals')

pisa_usa <- pisa2015 |> dplyr::filter(country == 'UNITED STATES')
```

Finding the optimal number of clusters.

```{r optimal-clusters, cache=TRUE, message=FALSE, warning=FALSE}
optimal <- optimal_clusters(pisa_usa[,cluster_vars], max_k = 5)
optimal
plot(optimal, ncol = 2)
```

Validating cluster profiles using random samples of 50%. Out-of-bag uses the remaining 50% to predict cluster membership.

```{r random-validation, fig.width=10, fig.height=5}
pisa_cv_random <- pisa_usa |>
dplyr::select(dplyr::all_of(cluster_vars)) |>
clav::cluster_validation(
n_clusters = 3,
sample_size = 0.5 * nrow(pisa_usa),
replace = FALSE,
n_samples = 100,
seed = 42
)
plot(pisa_cv_random)
```

Re-estimate the clusters using the OOB sample instead of predicting using the in sample model.

```{r oob-reestimate-validation, fig.width=10, fig.height=5}
pisa_cv_random2 <- pisa_usa |>
dplyr::select(dplyr::all_of(cluster_vars)) |>
clav::cluster_validation(
n_clusters = 3,
oob_predict_fun = function(fit, newdata) {
newfit <- stats::kmeans(newdata, 3)
newfit$cluster
},
sample_size = 0.5 * nrow(pisa_usa),
replace = FALSE,
n_samples = 100,
seed = 42
)
plot(pisa_cv_random2)
```

Bootstrap approach to validation.

```{r bootstrap-validation, fig.width=10, fig.height=5}
pisa_cv_bootstrap <- pisa_usa |>
dplyr::select(dplyr::all_of(cluster_vars)) |>
clav::cluster_validation(
n_clusters = 3,
sample_size = nrow(pisa_usa),
replace = TRUE,
n_samples = 100,
seed = 42
)
plot(pisa_cv_bootstrap)
```

Using latent profile analysis for estimating clusters.

```{r, eval=FALSE, fig.width=10, fig.height=5}
lpa <- estimate_profiles(pisaUSA15, 3)
lpa_predict <- predict(lpa, pisaUSA15[sample(nrow(pisaUSA15), 100),])
# lpa_estimates <- get_estimates(lpa)
lpa_data <- get_data(lpa)

plot_profiles(lpa)
clav::profile_plot(pisaUSA15,
clusters = lpa_data$Class)

lpa_cv_random <- cluster_validation(
pisaUSA15,
n_clusters = 3,
cluster_fun = estimate_profiles,
oob_predict_fun = function(fit, newdata) {
estimate_profiles(newdata, n_clusters)
},
sample_size = 0.5 * nrow(pisaUSA15),
replace = FALSE,
n_samples = 50,
seed = 42
)
plot(lpa_cv_random)
```

## Profile Plot

```{r profile-plot, fig.width=10, fig.height=5}
fit <- pisa_usa |>
dplyr::select(dplyr::all_of(cluster_vars)) |>
stats::kmeans(centers = 3)
clav::profile_plot(pisa_usa[,cluster_vars],
clusters = fit$cluster,
df_dep = pisa_usa[,outcome_vars],
center_band = 0.33,
cluster_order = cluster_vars)
```

```{r palmerpenguins, include=FALSE, eval=FALSE, fig.width=10, fig.height=5}
library(palmerpenguins)
library(ggplot2)

data(penguins)
n_clusters <- 3
cluster_cols <- c('bill_depth_mm', 'bill_length_mm', 'body_mass_g', 'flipper_length_mm')
penguins <- penguins |>
tidyr::drop_na(dplyr::all_of(cluster_cols))
kmeans_fit <- kmeans(penguins[,cluster_cols], n_clusters)
clusters <- factor(kmeans_fit$cluster, labels = LETTERS[1:n_clusters])

profile_plot(df = penguins[,cluster_cols],
clusters = clusters,
df_dep = penguins[,'body_mass_g',drop=FALSE])

ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm, color = clusters)) +
geom_point() +
theme_minimal()

table(clusters, penguins$species)

```