https://github.com/mhahsler/remm
https://github.com/mhahsler/remm
clustering data-stream sequence-analysis
Last synced: 10 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/mhahsler/remm
- Owner: mhahsler
- Created: 2021-10-26T16:49:29.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2024-08-27T22:24:16.000Z (over 1 year ago)
- Last Synced: 2025-04-09T22:51:07.708Z (10 months ago)
- Topics: clustering, data-stream, sequence-analysis
- Language: R
- Homepage:
- Size: 421 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
Awesome Lists containing this project
README
---
output: github_document
---
```{r echo=FALSE, results = 'asis'}
pkg <- 'rEMM'
source("https://raw.githubusercontent.com/mhahsler/pkg_helpers/main/pkg_helpers.R")
pkg_title(pkg)
```
Implements TRACDS (Temporal Relationships
between Clusters for Data Streams), a generalization of
Extensible Markov Model (EMM),
to model transition probabilities in sequence data. TRACDS adds a temporal or order model
to data stream clustering by superimposing a dynamically adapting
Markov Chain. Also provides an implementation of EMM (TRACDS on top of tNN
data stream clustering).
Interface classes DSC_tNN and DSC_EMM for the [stream package](https://github.com/mhahsler/stream) are provided.
```{r echo=FALSE, results = 'asis'}
pkg_citation(pkg, 2L)
pkg_install(pkg)
```
## Usage
We use a artificial dataset with a mixture of four clusters components. Points are generated using a fixed sequence
<1,2,1,3,4> through the four clusters. The lines below indicate the sequence.
```{r example_data}
library(rEMM)
data("EMMsim")
plot(EMMsim_train, pch = NA)
lines(EMMsim_train, col = "gray")
points(EMMsim_train, pch = EMMsim_sequence_train)
```
EMM recovers the components and the sequence information. We use EMM and then recluster the found structure assuming
that we know that there are 4 components. The graph below represents a Markov model of the found sequence.
```{r example_model}
emm <- EMM(threshold = 0.1, measure = "euclidean")
build(emm, EMMsim_train)
emmc <- recluster_hclust(emm, k = 4, method = "average")
plot(emmc)
```
We can now score new sequences (we use a test sequence created in the same way as the training data) by calculating the product the transition probabilities in the model. The high score indicates this.
```{r}
score(emmc, EMMsim_test)
```
# References
* Michael Hahsler and Margaret H. Dunham.
[rEMM: Extensible Markov model for data stream clustering in R.](http://dx.doi.org/10.18637/jss.v035.i05)
_Journal of Statistical Software,_ 35(5):1-31, 2010.
* Michael Hahsler and Margaret H. Dunham.
[Temporal structure learning for clustering massive data
streams in real-time](https://doi.org/10.1137/1.9781611972818.57).
In _SIAM Conference on Data Mining (SDM11),_ pages 664--675. SIAM, April 2011.
# Acknowledgements
Development of this package was supported in part by NSF IIS-0948893 and R21HG005912 from
the National Human Genome Research Institute.