https://github.com/const-ae/mixdir

Cluster high dimensional categorical datasets
https://github.com/const-ae/mixdir

categorical-data clustering questionnaires r-package variational-inference

Last synced: 6 months ago
JSON representation

Cluster high dimensional categorical datasets

Host: GitHub
URL: https://github.com/const-ae/mixdir
Owner: const-ae
Created: 2018-01-20T19:17:24.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2023-09-11T18:19:38.000Z (about 2 years ago)
Last Synced: 2025-01-23T09:08:14.823Z (9 months ago)
Topics: categorical-data, clustering, questionnaires, r-package, variational-inference
Language: R
Size: 336 KB
Stars: 14
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.Rmd

Awesome Lists containing this project

README

          ---

output:

  md_document:

    variant: markdown_github

---

```{r, echo = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README_plots/"

)

```

# mixdir

The goal of mixdir is to cluster high dimensional categorical datasets.

It can

* handle missing data

* infer a reasonable number of latent class (try `mixdir(select_latent=TRUE)`)

* cluster datasets with more than 70,000 observations and 60 features

* propagate uncertainty and produce a soft clustering

A detailed description of the algorithm and the features of the package can 

be found in the the accompanying [paper](http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8631438&isnumber=8631391).

If you find the package useful please cite

>C. Ahlmann-Eltze and C. Yau, "MixDir: Scalable Bayesian Clustering for High-Dimensional Categorical Data", 

2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), Turin, Italy, 2018, pp. 526-539.

## Installation

```{r installation, eval=FALSE, include=TRUE}

install.packages("mixdir")

# Or to get the latest version from github

devtools::install_github("const-ae/mixdir")

```

## Example

Clustering the [mushroom](https://archive.ics.uci.edu/ml/datasets/mushroom) data set.

![](man/figures/README_plots/clustering_overview.png)

```{r example_load}

# Loading the library and the data

library(mixdir)

set.seed(1)

data("mushroom")

# High dimensional dataset: 8124 mushroom and 23 different features

mushroom[1:10, 1:5]

```

Calling the clustering function `mixdir` on a subset of the data:

```{r}

# Clustering into 3 latent classes

result <- mixdir(mushroom[1:1000,  1:5], n_latent=3)

```

Analyzing the result

```{r example}

# Latent class of of first 10 mushrooms

head(result$pred_class, n=10)

# Soft Clustering for first 10 mushrooms

head(result$class_prob, n=10)

pheatmap::pheatmap(result$class_prob, cluster_cols=FALSE,

                  labels_col = paste("Class", 1:3))

# Structure of latent class 1

# (bruises, cap color either yellow or white, edible etc.)

purrr::map(result$category_prob, 1)

# The most predicitive features for each class

find_predictive_features(result, top_n=3)

# For example: if all I know about a mushroom is that it has a

# yellow cap, then I am 99% certain that it will be in class 1

predict(result, c(`cap-color`="yellow"))

# Note the most predictive features are different from the most typical ones

find_typical_features(result, top_n=3)

```

Dimensionality Reduction

```{r fig.width=8, fig.asp=0.31}

# Defining Features

def_feat <- find_defining_features(result, mushroom[1:1000,  1:5], n_features = 3)

print(def_feat)

# Plotting the most important features gives an immediate impression

# how the cluster differ

plot_features(def_feat$features, result$category_prob)

```

# Underlying Model

The package implements a variational inference algorithm to solve a Bayesian latent class model (LCM). 



![](man/figures/README_plots/model_plate_notation.png)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/const-ae/mixdir

Awesome Lists containing this project

README