https://github.com/const-ae/mixdir
Cluster high dimensional categorical datasets
https://github.com/const-ae/mixdir
categorical-data clustering questionnaires r-package variational-inference
Last synced: 6 months ago
JSON representation
Cluster high dimensional categorical datasets
- Host: GitHub
- URL: https://github.com/const-ae/mixdir
- Owner: const-ae
- Created: 2018-01-20T19:17:24.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2023-09-11T18:19:38.000Z (about 2 years ago)
- Last Synced: 2025-01-23T09:08:14.823Z (9 months ago)
- Topics: categorical-data, clustering, questionnaires, r-package, variational-inference
- Language: R
- Size: 336 KB
- Stars: 14
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project
README
---
output:
md_document:
variant: markdown_github
---```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README_plots/"
)
```# mixdir
The goal of mixdir is to cluster high dimensional categorical datasets.
It can
* handle missing data
* infer a reasonable number of latent class (try `mixdir(select_latent=TRUE)`)
* cluster datasets with more than 70,000 observations and 60 features
* propagate uncertainty and produce a soft clusteringA detailed description of the algorithm and the features of the package can
be found in the the accompanying [paper](http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8631438&isnumber=8631391).
If you find the package useful please cite>C. Ahlmann-Eltze and C. Yau, "MixDir: Scalable Bayesian Clustering for High-Dimensional Categorical Data",
2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), Turin, Italy, 2018, pp. 526-539.## Installation
```{r installation, eval=FALSE, include=TRUE}
install.packages("mixdir")# Or to get the latest version from github
devtools::install_github("const-ae/mixdir")
```## Example
Clustering the [mushroom](https://archive.ics.uci.edu/ml/datasets/mushroom) data set.

```{r example_load}
# Loading the library and the data
library(mixdir)
set.seed(1)data("mushroom")
# High dimensional dataset: 8124 mushroom and 23 different features
mushroom[1:10, 1:5]
```Calling the clustering function `mixdir` on a subset of the data:
```{r}
# Clustering into 3 latent classes
result <- mixdir(mushroom[1:1000, 1:5], n_latent=3)
```Analyzing the result
```{r example}
# Latent class of of first 10 mushrooms
head(result$pred_class, n=10)# Soft Clustering for first 10 mushrooms
head(result$class_prob, n=10)
pheatmap::pheatmap(result$class_prob, cluster_cols=FALSE,
labels_col = paste("Class", 1:3))# Structure of latent class 1
# (bruises, cap color either yellow or white, edible etc.)
purrr::map(result$category_prob, 1)# The most predicitive features for each class
find_predictive_features(result, top_n=3)
# For example: if all I know about a mushroom is that it has a
# yellow cap, then I am 99% certain that it will be in class 1
predict(result, c(`cap-color`="yellow"))# Note the most predictive features are different from the most typical ones
find_typical_features(result, top_n=3)
```Dimensionality Reduction
```{r fig.width=8, fig.asp=0.31}
# Defining Features
def_feat <- find_defining_features(result, mushroom[1:1000, 1:5], n_features = 3)
print(def_feat)# Plotting the most important features gives an immediate impression
# how the cluster differ
plot_features(def_feat$features, result$category_prob)
```# Underlying Model
The package implements a variational inference algorithm to solve a Bayesian latent class model (LCM).
