Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/MIND-Lab/Constrained-RTM

Constrained Relational Topic Models that use potential functions to incorporate label knowledge in the form of document constraints
https://github.com/MIND-Lab/Constrained-RTM

Last synced: about 2 months ago
JSON representation

Constrained Relational Topic Models that use potential functions to incorporate label knowledge in the form of document constraints

Awesome Lists containing this project

README

        

#

Constrained Relational Topic Models


Implementation of Constrained Relational Topic Models (C-RTM), proposed in the paper "Constrained Relational Topic Models" [https://doi.org/10.1016/j.ins.2019.09.039] accepted in Information Sciences, 2020.
CRTM is a family of topic models that extend the well-know [Relational Topic Models (Chang, 2009)](#rtm). It models the structure of a document network and incorporates other types of relational information obtained by prior domain knowledge. This implementation extends the code from the package of ([Weiwei Yang](http://cs.umd.edu/~wwyang/)'s).

##

Execution of the program in Command Line


```
java -cp YWWTools.jar:deps.jar yang.weiwei.Tools --tool lda --model lda --constrained true --vocab --corpus --trained-model
```
- Required arguments
- `--constrained true`: it must be set to true to allow the incorporation of prior knowledge constraints.
- ``: Vocabulary file. Each line contains a unique word.
- ``: Corpus file in which documents are represented by word indexes and frequencies. Each line contains a document in the following format

```
: : ... :
```

`` is the total number of *tokens* in this document. `` denotes the i-th word in ``, starting from 0. Words with zero frequency can be omitted.
- ``: Trained model file in JSON format. Read and written by program.
- `--train-c-file `: File containing the document constraints. Each line contains a constraint in the following format

```

```

`` is row-id of document-1. `` is row-id of document-2. `` must be set to `M` (if it is a must-constraint) or `C` (if it is a cannot-constraint).
- Optional arguments
- `--model `: The topic model you want to use (default: [LDA](#lda_cmd)). Tested `` (case unsensitive) are
- LDA: Constrained LDA
- RTM: Constrained Relational topic model.
- other models as extensions of LDA implemented by Weiwei Yang can be used and are already provided in the code.
- `--newfun `: Type of potential function of the constrained model. Default: `false`. If true, it is normalized. Otherwise it corresponds to the potential function described in [SC-LDA](#sclda).
- `--lambda `: Strength parameter for the potential function described in [SC-LDA](#sclda). It is valid only if `--newfun false`.
- `--no-verbose`: Stop printing log to console.
- `--alpha `: Parameter of Dirichlet prior of document distribution over topics (default: 1.0). Must be a positive real number.
- `--beta `: Parameter of Dirichlet prior of topic distribution over words (default: 0.1). Must be a positive real number.
- `--topics `: Number of topics (default: 10). Must be a positive integer.
- `--iters `: Number of iterations (default: 100). Must be a positive integer.
- `--update`: Update alpha while sampling (default: false). It does not work well.
- `--update-int `: Interval of updating alpha (default: 10). Must be a positive integer.
- `--theta `: File for document distribution over topics. Each line contains a document's topic distribution. Topic weights are separated by space.
- `--output-topic `: File for showing topics.
- `--topic-count `: File for document-topic counts.
- `--top-word `: Number of words to give when showing topics (default: 10). Must be a positive integer.
- `--burn-in `: Number of burn-in iterations. Default: 0.

##

Datasets


Three benchmark relational [datasets](http://www.cs.umd.edu/~sen/lbc-proj/LBC.html) are included in their related folders. They are already preprocessed and ready to be used as input for the model.
Notice that the file `labels.txt` can be used to create the must- and cannot-constraints. Two random documents can be extracted and if their labels are the same, a must-constraint may be added to the ``, otherwise a cannot-constraint may be added.

##

[References](#references)


###

[SC-LDA](#sclda): Sparse Constrained LDA

Yang, Y., Downey, D., Boyd-Graber, J.: Efficient Methods for Incorporating Knowledge into Topic Models. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 308-317 (2015)

###

[RTM](#rtm): Relational Topic Models

Jonathan Chang, David M. Blei: Relational Topic Models for Document Networks. In: Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS) 2009: 81-88

[Back to Top](#top)