https://github.com/bnosac/BTM

Biterm Topic Modelling for Short Text with R
https://github.com/bnosac/BTM

biterm-topic-modelling natural-language-processing r topic-modeling

Last synced: 2 months ago
JSON representation

Biterm Topic Modelling for Short Text with R

Host: GitHub
URL: https://github.com/bnosac/BTM
Owner: bnosac
License: apache-2.0
Created: 2018-12-06T13:47:19.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2023-02-11T14:26:49.000Z (over 2 years ago)
Last Synced: 2024-10-28T17:24:07.604Z (9 months ago)
Topics: biterm-topic-modelling, natural-language-processing, r, topic-modeling
Language: C++
Homepage:
Size: 178 KB
Stars: 95
Watchers: 8
Forks: 15
Open Issues: 4
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

awesome-topic-models - R-BTM - R package wrapping the C++ code from BTM (Models / Topic Models for short documents)

README

        # BTM - Biterm Topic Modelling for Short Text with R

This is an R package wrapping the C++ code available at https://github.com/xiaohuiyan/BTM for constructing a **Biterm Topic Model (BTM)**. This model models word-word co-occurrences patterns (e.g., biterms). 

> Topic modelling using biterms is particularly good for finding topics in short texts (as occurs in short survey answers or twitter data).

### Installation

This R package is on CRAN, just install it with `install.packages('BTM')`

### What

The Biterm Topic Model (BTM) is a word co-occurrence based topic model that learns topics by modeling word-word co-occurrences patterns (e.g., biterms)

- A biterm consists of two words co-occurring in the same context, for example, in the same short text window. 

- BTM models the biterm occurrences in a corpus (unlike LDA models which model the word occurrences in a document). 

- It's a generative model. In the generation procedure, a biterm is generated by drawing two words independently from a same topic `z`. In other words, the distribution of a biterm `b=(wi,wj)` is defined as: `P(b) = sum_k{P(wi|z)*P(wj|z)*P(z)}` where k is the number of topics you want to extract.

- Estimation of the topic model is done with the Gibbs sampling algorithm. Where estimates are provided for `P(w|k)=phi` and `P(z)=theta`.

More detail can be referred to the following paper:

> Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng. A Biterm Topic Model For Short Text. WWW2013.

> https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf

![](tools/biterm-topic-model-example.png)

### Example

```

library(udpipe)

library(BTM)

data("brussels_reviews_anno", package = "udpipe")

## Taking only nouns of Dutch data

x <- subset(brussels_reviews_anno, language == "nl")

x <- subset(x, xpos %in% c("NN", "NNP", "NNS"))

x <- x[, c("doc_id", "lemma")]

## Building the model

set.seed(321)

model  <- BTM(x, k = 3, beta = 0.01, iter = 1000, trace = 100)

## Inspect the model - topic frequency + conditional term probabilities

model$theta

[1] 0.3406998 0.2413721 0.4179281

topicterms <- terms(model, top_n = 10)

topicterms

[[1]]

         token probability

1  appartement  0.06168297

2      brussel  0.04057012

3        kamer  0.02372442

4      centrum  0.01550855

5      locatie  0.01547671

6         stad  0.01229227

7        buurt  0.01181460

8     verblijf  0.01155985

9         huis  0.01111402

10         dag  0.01041345

[[2]]

         token probability

1  appartement  0.05687312

2      brussel  0.01888307

3        buurt  0.01883812

4        kamer  0.01465696

5     verblijf  0.01339812

6     badkamer  0.01285862

7   slaapkamer  0.01276870

8          dag  0.01213928

9          bed  0.01195945

10        raam  0.01164474

[[3]]

         token probability

1  appartement 0.061804812

2      brussel 0.035873377

3      centrum 0.022193831

4         huis 0.020091282

5        buurt 0.019935537

6     verblijf 0.018611710

7     aanrader 0.014614272

8        kamer 0.011447470

9      locatie 0.010902365

10      keuken 0.009448751

scores <- predict(model, newdata = x)

```

**Make a specific topic called the background**

```

# If you set background to TRUE

# The first topic is set to a background topic that equals to the empirical word distribution. 

# This can be used to filter out common words.

set.seed(321)

model      <- BTM(x, k = 5, beta = 0.01, background = TRUE, iter = 1000, trace = 100)

topicterms <- terms(model, top_n = 5)

topicterms

```

### Visualisation of your model

- Can be done using the textplot package (https://github.com/bnosac/textplot), which can be found at CRAN as well (https://cran.r-project.org/package=textplot) 

- An example visualisation built on a model of all R packages from the Natural Language Processing and Machine Learning task views is shown above (see also https://www.bnosac.be/index.php/blog/98-biterm-topic-modelling-for-short-texts)

```

library(textplot)

library(ggraph)

library(concaveman)

plot(model)

```

### Provide your own set of biterms

An interesting use case of this package is to 

- cluster based on parts of speech tags like nouns and adjectives which can be found in the text in the neighbourhood of one another

- cluster dependency relationships provided by NLP tools like udpipe (https://CRAN.R-project.org/package=udpipe)

This can be done by providing your own set of biterms to cluster upon. 

**Example clustering cooccurrences of nouns/adjectives**

```

library(data.table)

library(udpipe)

## Annotate text with parts of speech tags

data("brussels_reviews", package = "udpipe")

anno <- subset(brussels_reviews, language %in% "nl")

anno <- data.frame(doc_id = anno$id, text = anno$feedback, stringsAsFactors = FALSE)

anno <- udpipe(anno, "dutch", trace = 10)

## Get cooccurrences of nouns / adjectives and proper nouns

biterms <- as.data.table(anno)

biterms <- biterms[, cooccurrence(x = lemma, 

                                  relevant = upos %in% c("NOUN", "PROPN", "ADJ"),

                                  skipgram = 2), 

                   by = list(doc_id)]

                   

## Build the model

set.seed(123456)

x     <- subset(anno, upos %in% c("NOUN", "PROPN", "ADJ"))

x     <- x[, c("doc_id", "lemma")]

model <- BTM(x, k = 5, beta = 0.01, iter = 2000, background = TRUE, 

             biterms = biterms, trace = 100)

topicterms <- terms(model, top_n = 5)

topicterms

```

**Example clustering dependency relationships**

```

library(udpipe)

library(tm)

library(data.table)

data("brussels_reviews", package = "udpipe")

exclude <- stopwords("nl")

## Do annotation on Dutch text

anno <- subset(brussels_reviews, language %in% "nl")

anno <- data.frame(doc_id = anno$id, text = anno$feedback, stringsAsFactors = FALSE)

anno <- udpipe(anno, "dutch", trace = 10)

anno <- setDT(anno)

anno <- merge(anno, anno, 

              by.x = c("doc_id", "paragraph_id", "sentence_id", "head_token_id"), 

              by.y = c("doc_id", "paragraph_id", "sentence_id", "token_id"), 

              all.x = TRUE, all.y = FALSE, suffixes = c("", "_parent"), sort = FALSE)

## Specify a set of relationships you are interested in (e.g. objects of a verb)

anno$relevant <- anno$dep_rel %in% c("obj") & !is.na(anno$lemma_parent)

biterms <- subset(anno, relevant == TRUE)

biterms <- data.frame(doc_id = biterms$doc_id, 

                      term1 = biterms$lemma, 

                      term2 = biterms$lemma_parent,

                      cooc = 1, 

                      stringsAsFactors = FALSE)

biterms <- subset(biterms, !term1 %in% exclude & !term2 %in% exclude)

## Put in x only terms whch were used in the biterms object such that frequency stats of terms can be computed in BTM

anno <- anno[, keep := relevant | (token_id %in% head_token_id[relevant == TRUE]), by = list(doc_id, paragraph_id, sentence_id)]

x    <- subset(anno, keep == TRUE, select = c("doc_id", "lemma"))

x    <- subset(x, !lemma %in% exclude)

## Build the topic model

model <- BTM(data = x, 

             biterms = biterms, 

             k = 6, iter = 2000, background = FALSE, trace = 100)

topicterms <- terms(model, top_n = 5)

topicterms

```

## Support in text mining

Need support in text mining?

Contact BNOSAC: http://www.bnosac.be

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bnosac/BTM

Awesome Lists containing this project

README