Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/MIND-Lab/Constrained-RTM
Constrained Relational Topic Models that use potential functions to incorporate label knowledge in the form of document constraints
https://github.com/MIND-Lab/Constrained-RTM
Last synced: about 1 month ago
JSON representation
Constrained Relational Topic Models that use potential functions to incorporate label knowledge in the form of document constraints
- Host: GitHub
- URL: https://github.com/MIND-Lab/Constrained-RTM
- Owner: MIND-Lab
- License: mit
- Created: 2019-04-19T15:05:33.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2022-07-01T20:39:19.000Z (over 2 years ago)
- Last Synced: 2024-08-03T18:21:08.166Z (4 months ago)
- Language: Java
- Homepage:
- Size: 291 MB
- Stars: 8
- Watchers: 4
- Forks: 1
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
- awesome-topic-models - Constrained-RTM - Java implementation of Contrained RTM [:page_facing_up:](https://doi.org/10.1016/j.ins.2019.09.039) (Models / Relational Topic Model (RTM))
README
#
Constrained Relational Topic Models
Implementation of Constrained Relational Topic Models (C-RTM), proposed in the paper "Constrained Relational Topic Models" [https://doi.org/10.1016/j.ins.2019.09.039] accepted in Information Sciences, 2020.
CRTM is a family of topic models that extend the well-know [Relational Topic Models (Chang, 2009)](#rtm). It models the structure of a document network and incorporates other types of relational information obtained by prior domain knowledge. This implementation extends the code from the package of ([Weiwei Yang](http://cs.umd.edu/~wwyang/)'s).##
Execution of the program in Command Line
```
java -cp YWWTools.jar:deps.jar yang.weiwei.Tools --tool lda --model lda --constrained true --vocab --corpus --trained-model
```
- Required arguments
- `--constrained true`: it must be set to true to allow the incorporation of prior knowledge constraints.
- ``: Vocabulary file. Each line contains a unique word.
- ``: Corpus file in which documents are represented by word indexes and frequencies. Each line contains a document in the following format```
: : ... :
```
`` is the total number of *tokens* in this document. `` denotes the i-th word in ``, starting from 0. Words with zero frequency can be omitted.
- ``: Trained model file in JSON format. Read and written by program.
- `--train-c-file `: File containing the document constraints. Each line contains a constraint in the following format
```
```
`` is row-id of document-1. `` is row-id of document-2. `` must be set to `M` (if it is a must-constraint) or `C` (if it is a cannot-constraint).
- Optional arguments
- `--model `: The topic model you want to use (default: [LDA](#lda_cmd)). Tested `` (case unsensitive) are
- LDA: Constrained LDA
- RTM: Constrained Relational topic model.
- other models as extensions of LDA implemented by Weiwei Yang can be used and are already provided in the code.
- `--newfun `: Type of potential function of the constrained model. Default: `false`. If true, it is normalized. Otherwise it corresponds to the potential function described in [SC-LDA](#sclda).
- `--lambda `: Strength parameter for the potential function described in [SC-LDA](#sclda). It is valid only if `--newfun false`.
- `--no-verbose`: Stop printing log to console.
- `--alpha `: Parameter of Dirichlet prior of document distribution over topics (default: 1.0). Must be a positive real number.
- `--beta `: Parameter of Dirichlet prior of topic distribution over words (default: 0.1). Must be a positive real number.
- `--topics `: Number of topics (default: 10). Must be a positive integer.
- `--iters `: Number of iterations (default: 100). Must be a positive integer.
- `--update`: Update alpha while sampling (default: false). It does not work well.
- `--update-int `: Interval of updating alpha (default: 10). Must be a positive integer.
- `--theta `: File for document distribution over topics. Each line contains a document's topic distribution. Topic weights are separated by space.
- `--output-topic `: File for showing topics.
- `--topic-count `: File for document-topic counts.
- `--top-word `: Number of words to give when showing topics (default: 10). Must be a positive integer.
- `--burn-in `: Number of burn-in iterations. Default: 0.##
Datasets
Three benchmark relational [datasets](http://www.cs.umd.edu/~sen/lbc-proj/LBC.html) are included in their related folders. They are already preprocessed and ready to be used as input for the model.
Notice that the file `labels.txt` can be used to create the must- and cannot-constraints. Two random documents can be extracted and if their labels are the same, a must-constraint may be added to the ``, otherwise a cannot-constraint may be added.##
[References](#references)
###[SC-LDA](#sclda): Sparse Constrained LDA
Yang, Y., Downey, D., Boyd-Graber, J.: Efficient Methods for Incorporating Knowledge into Topic Models. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 308-317 (2015)
###
[RTM](#rtm): Relational Topic Models
Jonathan Chang, David M. Blei: Relational Topic Models for Document Networks. In: Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS) 2009: 81-88
[Back to Top](#top)