An open API service indexing awesome lists of open source software.

https://github.com/const-ae/mixdir

Cluster high dimensional categorical datasets
https://github.com/const-ae/mixdir

categorical-data clustering questionnaires r-package variational-inference

Last synced: 6 months ago
JSON representation

Cluster high dimensional categorical datasets

Awesome Lists containing this project

README

          

---
output:
md_document:
variant: markdown_github
---

```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README_plots/"
)
```

# mixdir

The goal of mixdir is to cluster high dimensional categorical datasets.

It can

* handle missing data
* infer a reasonable number of latent class (try `mixdir(select_latent=TRUE)`)
* cluster datasets with more than 70,000 observations and 60 features
* propagate uncertainty and produce a soft clustering

A detailed description of the algorithm and the features of the package can
be found in the the accompanying [paper](http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8631438&isnumber=8631391).
If you find the package useful please cite

>C. Ahlmann-Eltze and C. Yau, "MixDir: Scalable Bayesian Clustering for High-Dimensional Categorical Data",
2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), Turin, Italy, 2018, pp. 526-539.

## Installation

```{r installation, eval=FALSE, include=TRUE}
install.packages("mixdir")

# Or to get the latest version from github
devtools::install_github("const-ae/mixdir")
```

## Example

Clustering the [mushroom](https://archive.ics.uci.edu/ml/datasets/mushroom) data set.

![](man/figures/README_plots/clustering_overview.png)

```{r example_load}
# Loading the library and the data
library(mixdir)
set.seed(1)

data("mushroom")
# High dimensional dataset: 8124 mushroom and 23 different features
mushroom[1:10, 1:5]
```

Calling the clustering function `mixdir` on a subset of the data:

```{r}
# Clustering into 3 latent classes
result <- mixdir(mushroom[1:1000, 1:5], n_latent=3)
```

Analyzing the result

```{r example}
# Latent class of of first 10 mushrooms
head(result$pred_class, n=10)

# Soft Clustering for first 10 mushrooms
head(result$class_prob, n=10)
pheatmap::pheatmap(result$class_prob, cluster_cols=FALSE,
labels_col = paste("Class", 1:3))

# Structure of latent class 1
# (bruises, cap color either yellow or white, edible etc.)
purrr::map(result$category_prob, 1)

# The most predicitive features for each class
find_predictive_features(result, top_n=3)
# For example: if all I know about a mushroom is that it has a
# yellow cap, then I am 99% certain that it will be in class 1
predict(result, c(`cap-color`="yellow"))

# Note the most predictive features are different from the most typical ones
find_typical_features(result, top_n=3)
```

Dimensionality Reduction

```{r fig.width=8, fig.asp=0.31}
# Defining Features
def_feat <- find_defining_features(result, mushroom[1:1000, 1:5], n_features = 3)
print(def_feat)

# Plotting the most important features gives an immediate impression
# how the cluster differ
plot_features(def_feat$features, result$category_prob)
```

# Underlying Model

The package implements a variational inference algorithm to solve a Bayesian latent class model (LCM).

![](man/figures/README_plots/model_plate_notation.png)