Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jbryer/clav
Cluster Analysis Validation
https://github.com/jbryer/clav
Last synced: about 2 months ago
JSON representation
Cluster Analysis Validation
- Host: GitHub
- URL: https://github.com/jbryer/clav
- Owner: jbryer
- License: gpl-3.0
- Created: 2024-11-13T03:38:16.000Z (about 2 months ago)
- Default Branch: master
- Last Pushed: 2024-11-13T04:23:00.000Z (about 2 months ago)
- Last Synced: 2024-11-13T05:19:28.219Z (about 2 months ago)
- Language: R
- Size: 1.89 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE.md
Awesome Lists containing this project
README
---
output: github_document
editor_options:
chunk_output_type: console
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```# Cluster Analysis Validation (`clav`)
`r badger::badge_cran_release("clav", "orange")`
`r badger::badge_devel("jbryer/clav", "blue")`
`r badger::badge_github_actions("jbryer/clav", action = "R-CMD-check")`The `clav` package provides utilities for conducting cluster (profile) analysis with an emphasis on the validating the stability of the profiles both within a given data set as well as across data sets. Unlike supervised models where the known class is measured, validation of unsupervised models where there is no known class can be challenging. The approach implemented here attempts to compare the cluster results across many random samples.
## Installation
You can install the development version of clav like so:
```{r install, eval=FALSE}
remotes::install_github('jbryer/clav')
```## Development
The following commands are useful for working with the package source locally.
```{r development, eval=FALSE}
# Prep the PISA data set. This will take a while to run the first time.
source('data-raw/data-prep-pisa-2015.R')
# Generate the package documentation
usethis::use_tidy_description()
devtools::document()
# Install the package
devtools::install()
# Run CRAN check for any issues
devtools::check(cran = TRUE)
```## Example
```{r setup, eval=TRUE}
library(clav)
data(pisa2015, package = 'clav')cluster_vars <- c('interest', 'enjoyment', 'motivation', 'efficacy', "belonging")
outcome_vars <- c('science_score', 'principals')pisa_usa <- pisa2015 |> dplyr::filter(country == 'UNITED STATES')
```Finding the optimal number of clusters.
```{r optimal-clusters, cache=TRUE, message=FALSE, warning=FALSE}
optimal <- optimal_clusters(pisa_usa[,cluster_vars], max_k = 5)
optimal
plot(optimal, ncol = 2)
```Validating cluster profiles using random samples of 50%. Out-of-bag uses the remaining 50% to predict cluster membership.
```{r random-validation, fig.width=10, fig.height=5}
pisa_cv_random <- pisa_usa |>
dplyr::select(dplyr::all_of(cluster_vars)) |>
clav::cluster_validation(
n_clusters = 3,
sample_size = 0.5 * nrow(pisa_usa),
replace = FALSE,
n_samples = 100,
seed = 42
)
plot(pisa_cv_random)
```Re-estimate the clusters using the OOB sample instead of predicting using the in sample model.
```{r oob-reestimate-validation, fig.width=10, fig.height=5}
pisa_cv_random2 <- pisa_usa |>
dplyr::select(dplyr::all_of(cluster_vars)) |>
clav::cluster_validation(
n_clusters = 3,
oob_predict_fun = function(fit, newdata) {
newfit <- stats::kmeans(newdata, 3)
newfit$cluster
},
sample_size = 0.5 * nrow(pisa_usa),
replace = FALSE,
n_samples = 100,
seed = 42
)
plot(pisa_cv_random2)
```Bootstrap approach to validation.
```{r bootstrap-validation, fig.width=10, fig.height=5}
pisa_cv_bootstrap <- pisa_usa |>
dplyr::select(dplyr::all_of(cluster_vars)) |>
clav::cluster_validation(
n_clusters = 3,
sample_size = nrow(pisa_usa),
replace = TRUE,
n_samples = 100,
seed = 42
)
plot(pisa_cv_bootstrap)
```Using latent profile analysis for estimating clusters.
```{r, eval=FALSE, fig.width=10, fig.height=5}
lpa <- estimate_profiles(pisaUSA15, 3)
lpa_predict <- predict(lpa, pisaUSA15[sample(nrow(pisaUSA15), 100),])
# lpa_estimates <- get_estimates(lpa)
lpa_data <- get_data(lpa)plot_profiles(lpa)
clav::profile_plot(pisaUSA15,
clusters = lpa_data$Class)lpa_cv_random <- cluster_validation(
pisaUSA15,
n_clusters = 3,
cluster_fun = estimate_profiles,
oob_predict_fun = function(fit, newdata) {
estimate_profiles(newdata, n_clusters)
},
sample_size = 0.5 * nrow(pisaUSA15),
replace = FALSE,
n_samples = 50,
seed = 42
)
plot(lpa_cv_random)
```## Profile Plot
```{r profile-plot, fig.width=10, fig.height=5}
fit <- pisa_usa |>
dplyr::select(dplyr::all_of(cluster_vars)) |>
stats::kmeans(centers = 3)
clav::profile_plot(pisa_usa[,cluster_vars],
clusters = fit$cluster,
df_dep = pisa_usa[,outcome_vars],
center_band = 0.33,
cluster_order = cluster_vars)
``````{r palmerpenguins, include=FALSE, eval=FALSE, fig.width=10, fig.height=5}
library(palmerpenguins)
library(ggplot2)data(penguins)
n_clusters <- 3
cluster_cols <- c('bill_depth_mm', 'bill_length_mm', 'body_mass_g', 'flipper_length_mm')
penguins <- penguins |>
tidyr::drop_na(dplyr::all_of(cluster_cols))
kmeans_fit <- kmeans(penguins[,cluster_cols], n_clusters)
clusters <- factor(kmeans_fit$cluster, labels = LETTERS[1:n_clusters])profile_plot(df = penguins[,cluster_cols],
clusters = clusters,
df_dep = penguins[,'body_mass_g',drop=FALSE])ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm, color = clusters)) +
geom_point() +
theme_minimal()table(clusters, penguins$species)
```