https://github.com/jbryer/clav

Cluster Analysis Validation
https://github.com/jbryer/clav

Last synced: 4 months ago
JSON representation

Cluster Analysis Validation

Host: GitHub
URL: https://github.com/jbryer/clav
Owner: jbryer
License: gpl-3.0
Created: 2024-11-13T03:38:16.000Z (8 months ago)
Default Branch: master
Last Pushed: 2024-11-13T04:23:00.000Z (8 months ago)
Last Synced: 2025-01-17T00:11:54.258Z (6 months ago)
Language: R
Size: 1.89 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.Rmd
- License: LICENSE.md

Awesome Lists containing this project

README

        ---

output: github_document

editor_options: 

  chunk_output_type: console

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%"

)

```

# Cluster Analysis Validation (`clav`)

`r badger::badge_cran_release("clav", "orange")`

`r badger::badge_devel("jbryer/clav", "blue")`

`r badger::badge_github_actions("jbryer/clav", action = "R-CMD-check")`

The `clav` package provides utilities for conducting cluster (profile) analysis with an emphasis on the validating the stability of the profiles both within a given data set as well as across data sets. Unlike supervised models where the known class is measured, validation of unsupervised models where there is no known class can be challenging. The approach implemented here attempts to compare the cluster results across many random samples.

## Installation

You can install the development version of clav like so:

```{r install, eval=FALSE}

remotes::install_github('jbryer/clav')

```

## Development

The following commands are useful for working with the package source locally.

```{r development, eval=FALSE}

# Prep the PISA data set. This will take a while to run the first time.

source('data-raw/data-prep-pisa-2015.R')

# Generate the package documentation

usethis::use_tidy_description()

devtools::document()

# Install the package

devtools::install()

# Run CRAN check for any issues

devtools::check(cran = TRUE)

```

## Example

```{r setup, eval=TRUE}

library(clav)

data(pisa2015, package = 'clav')

cluster_vars <- c('interest', 'enjoyment', 'motivation', 'efficacy', "belonging")

outcome_vars <- c('science_score', 'principals')

pisa_usa <- pisa2015 |> dplyr::filter(country == 'UNITED STATES')

```

Finding the optimal number of clusters.

```{r optimal-clusters, cache=TRUE, message=FALSE, warning=FALSE}

optimal <- optimal_clusters(pisa_usa[,cluster_vars], max_k = 5)

optimal

plot(optimal, ncol = 2)

```

Validating cluster profiles using random samples of 50%. Out-of-bag uses the remaining 50% to predict cluster membership.

```{r random-validation, fig.width=10, fig.height=5}

pisa_cv_random <- pisa_usa |> 

	dplyr::select(dplyr::all_of(cluster_vars)) |>

	clav::cluster_validation(

		n_clusters = 3,

		sample_size = 0.5 * nrow(pisa_usa),

		replace = FALSE,

		n_samples = 100,

		seed = 42

)

plot(pisa_cv_random)

```

Re-estimate the clusters using the OOB sample instead of predicting using the in sample model.

```{r oob-reestimate-validation, fig.width=10, fig.height=5}

pisa_cv_random2 <- pisa_usa |> 

	dplyr::select(dplyr::all_of(cluster_vars)) |>

	clav::cluster_validation(

		n_clusters = 3,

		oob_predict_fun = function(fit, newdata) {

			newfit <- stats::kmeans(newdata, 3)

			newfit$cluster

		},

		sample_size = 0.5 * nrow(pisa_usa),

		replace = FALSE,

		n_samples = 100,

		seed = 42

)

plot(pisa_cv_random2)

```

Bootstrap approach to validation.

```{r bootstrap-validation, fig.width=10, fig.height=5}

pisa_cv_bootstrap <- pisa_usa |> 

	dplyr::select(dplyr::all_of(cluster_vars)) |>

	clav::cluster_validation(

		n_clusters = 3,

		sample_size = nrow(pisa_usa),

		replace = TRUE,

		n_samples = 100,

		seed = 42

)

plot(pisa_cv_bootstrap)

```

Using latent profile analysis for estimating clusters.

```{r, eval=FALSE, fig.width=10, fig.height=5}

lpa <- estimate_profiles(pisaUSA15, 3)

lpa_predict <- predict(lpa, pisaUSA15[sample(nrow(pisaUSA15), 100),])

# lpa_estimates <- get_estimates(lpa)

lpa_data <- get_data(lpa)

plot_profiles(lpa)

clav::profile_plot(pisaUSA15,

			 clusters = lpa_data$Class)

lpa_cv_random <- cluster_validation(

	pisaUSA15,

	n_clusters = 3,

	cluster_fun = estimate_profiles,

	oob_predict_fun = function(fit, newdata) {

		estimate_profiles(newdata, n_clusters)

	},

	sample_size = 0.5 * nrow(pisaUSA15),

	replace = FALSE,

	n_samples = 50,

	seed = 42

)

plot(lpa_cv_random)

```

## Profile Plot

```{r profile-plot, fig.width=10, fig.height=5}

fit <- pisa_usa |> 

	dplyr::select(dplyr::all_of(cluster_vars)) |>

	stats::kmeans(centers = 3)

clav::profile_plot(pisa_usa[,cluster_vars],

				   clusters = fit$cluster,

				   df_dep = pisa_usa[,outcome_vars],

				   center_band = 0.33,

				   cluster_order = cluster_vars)

```

```{r palmerpenguins, include=FALSE, eval=FALSE, fig.width=10, fig.height=5}

library(palmerpenguins)

library(ggplot2)

data(penguins)

n_clusters <- 3

cluster_cols <- c('bill_depth_mm', 'bill_length_mm', 'body_mass_g', 'flipper_length_mm')

penguins <- penguins |>

	tidyr::drop_na(dplyr::all_of(cluster_cols))

kmeans_fit <- kmeans(penguins[,cluster_cols], n_clusters)

clusters <- factor(kmeans_fit$cluster, labels = LETTERS[1:n_clusters])

profile_plot(df = penguins[,cluster_cols],

			 clusters = clusters,

			 df_dep = penguins[,'body_mass_g',drop=FALSE])

ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm, color = clusters)) +

	geom_point() + 

	theme_minimal()

table(clusters, penguins$species)

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jbryer/clav

Awesome Lists containing this project

README