https://github.com/till-tietz/gsdmm

GSDMM Short Text Clustering via Dirichlet Mixture Models
https://github.com/till-tietz/gsdmm

cpp r rcpp text-analytics text-clustering

Last synced: 5 months ago
JSON representation

GSDMM Short Text Clustering via Dirichlet Mixture Models

Host: GitHub
URL: https://github.com/till-tietz/gsdmm
Owner: till-tietz
License: other
Created: 2023-07-25T15:02:01.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2025-03-15T15:56:58.000Z (over 1 year ago)
Last Synced: 2025-03-15T16:27:19.258Z (over 1 year ago)
Topics: cpp, r, rcpp, text-analytics, text-clustering
Language: C++
Homepage:
Size: 1.28 MB
Stars: 1
Watchers: 2
Forks: 1
Open Issues: 2
Metadata Files:
- Readme: README.Rmd
- License: LICENSE

Awesome Lists containing this project

README

          ---

output: github_document

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%"

)

pkgload::load_all()

```

# gsdmm

[![gsdmm status badge](https://till-tietz.r-universe.dev/badges/gsdmm)](https://till-tietz.r-universe.dev/gsdmm)

---

`gsdmm` implements short text classification via Dirichlet Mixture Models proposed by [Yin and Wang 2014](https://www.semanticscholar.org/paper/A-dirichlet-multinomial-mixture-model-based-for-Yin-Wang/d03ca28403da15e75bc3e90c21eab44031257e80?p2df). It provides a fast `c++` implementation and R interface for the Gibbs sampler described in the paper. Specifically, `gsdmm` implements the Likelihood function allowing for multiple occurrences of the same word in a given text (EQ4).

**Benefits:** \

- very space and time efficient

- unlike LDA it requires only an upper bound on the number of clusters

**Development:** \

- I am planning to add a tuning function for the alpha and beta parameters of the gibbs sampler

## Installation

You can install the development version of gsdmm from [GitHub](https://github.com/) with:

``` r

# install.packages("devtools")

devtools::install_github("till-tietz/gsdmm")

```

## Usage

Here is a minimal working example.

```{r, message = FALSE, warning = FALSE}

# we lemmatize and tokenize creating a list of character vector representing each text

text <- c(

  "Rockets are amazing.",

  "Witnessing a rocket in flight is a marvel of engineering.",

  "We should take a rocket to Mars.",

  "Rocket",

  "Have you ever seen a cat?",

  "Cats are fun.",

  "Your cat seems sweet.",

  "Cat"

) |>

  tolower() |>

  gsub(pattern = "[[:punct:] ]+", replacement = " ") |>

  textstem::lemmatize_strings() |>

  text2vec::word_tokenizer() |>

  lapply(function(i) i[!i %in% stopwords::stopwords()])

set.seed(42)

gsdmm::gsdmm(texts = text, n_iter = 100, n_clust = 5, alpha = 0.1, beta = 0.01, progress = FALSE)

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/till-tietz/gsdmm

Awesome Lists containing this project

README