https://github.com/xiangli2pro/hbcm
A R package to perform clustering for the continuous network data in matrix format.
https://github.com/xiangli2pro/hbcm
Last synced: 5 months ago
JSON representation
A R package to perform clustering for the continuous network data in matrix format.
- Host: GitHub
- URL: https://github.com/xiangli2pro/hbcm
- Owner: xiangli2pro
- License: other
- Created: 2021-09-11T20:04:59.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2024-09-13T21:11:40.000Z (7 months ago)
- Last Synced: 2024-09-15T12:26:52.310Z (7 months ago)
- Language: R
- Homepage:
- Size: 13.9 MB
- Stars: 3
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
- jimsghstars - xiangli2pro/hbcm - A R package to perform clustering for the continuous network data in matrix format. (R)
README
---
output: github_document
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
message = FALSE,
warning = FALSE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)library(badger)
```# hbcm package
`r badge_repostatus("Active")` `r badge_github_actions("rossellhayes/ipa")`Community detection is a clustering method based on objects’ pairwise relationships such that objects classified in the same group are more densely connected than objects from different groups and correlations within the same cluster are homogeneous. Most of the model-based community detection methods such as the stochastic block model and its variants are designed for networks with binary (yes/no) connectivity values, which ignores the practical scenarios where the pairwise relationships are continuous, reflecting different degrees of connectivity. The heterogeneous block covariance model (HBCM) proposes a novel clustering structure applicable on signed and continuous connections, such as the covariances or correlations between objects. Furthermore, it takes into account the heterogeneous property of each object within a community. A novel variational EM algorithm is employed to estimate the optimal group membership. HBCM has provable consistent estimation of clustering memberships and its practical performance is demonstrated by numerical simulations. The HBCM is illustrated on yeast gene expression data and gene groups with correlated expression levels responding to the same transcription factors are detected.
## Installation
You can install the development version from [GitHub](https://github.com/) with:
``` r
# install.packages("devtools")
devtools::install_github("xiangli2pro/hbcm")# load package
library('hbcm')
```## Examples
### 1. Simulation data
Create a matrix data `x` of dimension `NxP=1000x500`, with columns belonging to three (`K=5`) non-overlapping groups (groups labeled as 1 to 5). `x` values are determined by three things: parameter vector `alpha` of size `NxK` (follows multivariate normal distribution), heterogeneous parameters vector `hlambda` and `hsigma` of sizes `Px1` respectively.
```{r}
# check function arguments and return values documentation
?hbcm::data_gen
``````{r}
set.seed(2022)# x dimension
n <- 500
p <- 500# cluster number
centers <- 5
# cluster labels follow a multinulli distribution with probability ppi
ppi <- rep(1/centers, centers)
# simulate a vector of labels from the multinulli distribution
labels <- sample(c(1:centers), size = p, replace = TRUE, prob = ppi)# specify the (mu, omega) of the MVN distribution of alpha
mu <- rep(0, centers)off_diag <- 0.5
omega <- diag(rep(1, centers))
for (i in 1:centers) {
for (j in 1:centers) {
if (i!=j){
omega[i,j] = off_diag
}
}
}# set up the generating function of hlambda and hsigma
hparam_func <- list(
lambda_func = function(p) stats::rnorm(p, 0, 1),
sigma_func = function(p) stats::rchisq(p, 2) + 1
)# set up the number of simulation data
size <- 1# generate data
data_list <- hbcm::data_gen(n, p, centers, mu, omega, labels, size, hparam_func)
x <- data_list$x_list[[1]]```
### 2. Cluster the columns into `K` groups
Use heterogeneous block covariance model (HBCM) to cluster the columns of data `x`. Need to provide a starting label guess and the number of clusters.
```{r}
# check function arguments and return values documentation
?hbcm::heterogbcm()
``````{r}
# use spectral clustering to make a label guess
start_labels <- kernlab::specc(abs(cor(x)), centers = centers)@.Data# use hbcm to perform clustering
hbcm_res <- hbcm::heterogbcm(x, centers = centers,
tol = 1e-3, iter = 100, iter_init = 3,
labels = start_labels,
verbose = FALSE)
```### 3. Evaluate the clustering performance
Use metric [Rand-Index](https://en.wikipedia.org/wiki/Rand_index) and adjusted Rand-Index to compare the estimated label assignment with the true label assignment. The higher the value, the better the performance.
```{r}
# check function arguments and return values documentation
?hbcm::matchLabel()
``````{r message=FALSE, warning=FALSE}
# evaluate clustering performance
library('dplyr')
specc_eval <- hbcm::matchLabel(labels, start_labels) %>%
unlist() %>% round(3)
hbcm_eval <- hbcm::matchLabel(labels, hbcm_res$cluster) %>%
unlist() %>% round(3)# result shows that hbcm model is better than spectral-clustering model in terms of rand index.
print(specc_eval)
print(hbcm_eval)
```## Miscellaneous
### 1. Use cross-Validation to select `K` when it's unknown
In practice the number of clusters is often unknown, in which case we recommend to use the cross-validation with adjusted rand index as standard to select the `K`. The optimal`K` is achieved at the highest adjusted rand index.
```{r}
lapply(c("parallel", "foreach", "doParallel", "tidyverse"), require, char=TRUE)registerDoParallel(detectCores())
kVec <- c(2:8)
cv_res <- foreach(K = kVec,.errorhandling = 'pass',
.packages = c("MASS","Matrix","matrixcalc","kernlab", "RSpectra")) %dopar%
hbcm::crossValid_func_adjR(x, centers = K, pt = 10)# summary & plot
data.frame(kVec, unlist(cv_res)) %>%
`colnames<-`(c("K", "adjR")) %>%
ggplot() +
geom_line(aes(x = K, y = adjR))+
xlab("K")+
ylab("Average Adjusted Rand Index")+
theme_bw() +
theme(panel.grid.minor = element_blank()) +
scale_x_continuous(breaks =kVec)
```### 2. Use heatmap to display the group structure
```{r}
hbcm::colMat_heatMap(
affMatrix = cor(x), centers, labels = hbcm_res$cluster,
margin = 0.5, midpoint = 0, limit = c(-1,1), size = 0.2,
legendName = "Correlation", title = "HeatMap of correlation by groups")
```