Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/till-tietz/gsdmm
GSDMM Short Text Clustering via Dirichlet Mixture Models
https://github.com/till-tietz/gsdmm
cpp r rcpp text-analytics text-clustering
Last synced: about 1 month ago
JSON representation
GSDMM Short Text Clustering via Dirichlet Mixture Models
- Host: GitHub
- URL: https://github.com/till-tietz/gsdmm
- Owner: till-tietz
- License: other
- Created: 2023-07-25T15:02:01.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-07-26T20:53:29.000Z (over 1 year ago)
- Last Synced: 2023-07-26T21:55:58.627Z (over 1 year ago)
- Topics: cpp, r, rcpp, text-analytics, text-clustering
- Language: C++
- Homepage:
- Size: 1.26 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
README
---
output: github_document
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```# gsdmm
`gsdmm` implements short text classification via Dirichlet Mixture Models proposed by [Yin and Wang 2014](https://www.semanticscholar.org/paper/A-dirichlet-multinomial-mixture-model-based-for-Yin-Wang/d03ca28403da15e75bc3e90c21eab44031257e80?p2df). It provides a fast `c++` implementation and R interface for the Gibbs sampler described in the paper. Specifically, `gsdmm` implements the Likelihood function allowing for multiple occurrences of the same word in a given text (EQ4).
**Benefits:** \
- very space and time efficient
- unlike LDA it requires only an upper bound on the number of clusters**Development:** \
- I am planning to add a tuning function for the alpha and beta parameters of the gibbs sampler
## Installation
You can install the development version of gsdmm from [GitHub](https://github.com/) with:
``` r
# install.packages("devtools")
devtools::install_github("till-tietz/gsdmm")
```## Usage
Here is a minimal working example.
```{r, message = FALSE, warning = FALSE}
# we lemmatize and tokenize creating a list of character vector representing each text
text <- c(
"Rockets are amazing.",
"Witnessing a rocket in flight is a marvel of engineering.",
"We should take a rocket to Mars.",
"Rocket",
"Have you ever seen a cat?",
"Cats are fun.",
"Your cat seems sweet.",
"Cat"
) |>
tolower() |>
gsub(pattern = '[[:punct:] ]+', replacement = ' ') |>
textstem::lemmatize_strings() |>
text2vec::word_tokenizer() |>
lapply(function(i) i[!i %in% stopwords::stopwords()])gsdmm::gsdmm(texts = text, n_iter = 100, n_clust = 20, alpha = 0.1, beta = 0.2)
```