https://github.com/ywwbill/YWWTools-v2

Last synced: about 1 year ago
JSON representation
Host: GitHub
URL: https://github.com/ywwbill/YWWTools-v2
Owner: ywwbill
License: mit
Created: 2020-01-07T15:59:41.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2020-09-05T08:08:11.000Z (almost 6 years ago)
Last Synced: 2024-08-03T18:21:11.072Z (almost 2 years ago)
Language: Java
Size: 18.1 MB
Stars: 2
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project

awesome-topic-models - YWWTools - Java-based package for various topic models by Weiwei Yang (Models / Miscellaneous topic models)
README

          # 
YWW Tools


A package of my ([Weiwei Yang](http://cs.umd.edu/~wwyang/)'s) various tools (most for NLP). Feel free to email me at  with any questions.

* [Check Out](#check_out)

* [Dependencies](#dependencies)

* [Use YWW Tools in Command Line](#command)

* [LDA (Latent Dirichlet Allocation) in Command Line](#lda_cmd)

	* [RTM: Relational Topic Model](#rtm_cmd)

		* [Lex-WSB-RTM: RTM with Lexical Weights and Weighted Stochastic Block Priors](#lex_wsb_rtm_cmd)

		* [Lex-WSB-Med-RTM: Lex-WSB-RTM with Hinge Loss](#lex_wsb_med_rtm_cmd)

	* [SLDA: Supervised LDA](#slda_cmd)

		* [BS-LDA: Binary SLDA](#bs_lda_cmd)

		* [Lex-WSB-BS-LDA: BS-LDA with Lexcial Weights and Weighted Stochastic Block Priors](#lex_wsb_bs_lda_cmd)

		* [Lex-WSB-Med-LDA: Lex-WSB-BS-LDA with Hinge Loss](#lex_wsb_med_lda_cmd)

	* [BP-LDA: LDA with Block Priors](#bp_lda_cmd)

	* [ST-LDA: Single Topic LDA](#st_lda_cmd)

	* [WSB-TM: Weighted Stochastic Block Topic Model](#wsb_tm_cmd)

* [tLDA in Command Line](#tlda_cmd)

* [MTM in Command Line](#mtm_cmd)

* [Other Tools in Command Line](#other_cmd)

	* [WSBM: Weighted Stochastic Block Model](#wsbm_cmd)

	* [SCC: Strongly Connected Components](#scc_cmd)

	* [Stoplist](#stoplist_cmd)

	* [Lemmatizer](#lemmatizer_cmd)

	* [POS Tagger](#pos_tagger_cmd)

	* [Stemmer](#stemmer_cmd)

	* [Tokenizer](#tokenizer_cmd)

	* [Corpus Converter](#corpus_converter_cmd)

	* [Tree Builder](#tree_builder_cmd)

* [Use YWW Tools Source Code](#code_examples)

* [LDA Code Examples](#lda_code)

	* [RTM](#rtm_code)

		* [Lex-WSB-RTM](#lex_wsb_rtm_code)

		* [Lex-WSB-Med-RTM](#lex_wsb_med_rtm_code)

	* [SLDA](#slda_code)

		* [BS-LDA](#bs_lda_code)

		* [Lex-WSB-BS-LDA](#lex_wsb_bs_lda_code)

		* [Lex-WSB-Med-LDA](#lex_wsb_med_lda_code)

	* [BP-LDA](#bp_lda_code)

	* [ST-LDA](#st_lda_code)

	* [WSB-TM](#wsb_tm_code)

* [tLDA Code Examples](#tlda_code)

* [MTM Code Examples](#mtm_code)

* [Other Code Examples](#other_code)

	* [WSBM](#wsbm_code)

	* [SCC](#scc_code)

	* [Tree Builder](#tree_builder_code)

	* [English Corpus Preprocessing](#preprocess)

* [Citation](#citation)

* [References](#ref)

## 
Check Out


```

git clone git@github.com:ywwbill/YWWTools-v2.git

```

## 
Dependencies


- Java 8.

- Files in `lib/`.

- Files in `dict/`.

## 
Use YWW Tools in Command Line


```

java -cp YWWTools-v2.jar:lib/* yang.weiwei.Tools 

```

- **Windows users**

	- Please replace `YWWTools-v2.jar:lib/*` with `YWWTools-v2.jar;lib/*`.

	- If you encounter any encoding problems in command line (especially when processing Chinese), please add `-Dfile.encoding=utf8` in your command.

- In ``, specify the tool you want to use:

	```

	tool=

	```

- Supported `` (case unsensitive) include

	- [LDA](#lda_cmd): Latent Dirichlet allocation. Include a variety of extensions.

	- [TLDA](#tlda_cmd): Tree LDA.

	- [MTM](#mtm_cmd): Multilingual Topic Model.

	- [WSBM](#wsbm_cmd): Weighted stochastic block model. Find blocks in a network.

	- [SCC](#scc_cmd): Strongly connected components.

	- [Stoplist](#stoplist_cmd): Remove stop words. Support English only, but can support other languages given dictionary.

	- [Lemmatizer](#lemmatizer_cmd): Lemmatize POS-tagged corpus. Support English only, but can support other languages given dictionary.

	- [POS-Tagger](#pos_tagger_cmd): Tag words' POS. Support English only, but can support other languages given trained models.

	- [Stemmer](#stemmer_cmd): Stem words. Support English only.

	- [Tokenizer](#tokenizer_cmd): Tokenize corpus. Support English only, but can support other languages given trained models.

	- [Corpus-Converter](#corpus_converter_cmd): Convert word corpus into indexed corpus (for [LDA](#lda_cmd)) and vice versa.

	- [Tree Builder](#tree_builder_cmd): Build tree priors from word associations.

- You can always set `help` to true to see help information of 

	- supported tool names if you don't specify a tool name:

		```

		help=true

		```

	- a specific tool if you specify it (take [LDA](#lda_cmd) as an example): 

		```

		help=true

		tool=lda

		```

## 
LDA (Latent Dirichlet Allocation) in Command Line


```

tool=lda

model=lda

vocab=

corpus=

trained_model=

```

- Implementation of [Blei et al. (2003)](#lda_ref).

- Required arguments

	- ``: Vocabulary file. Each line contains a unique word.

	- ``: Corpus file in which documents are represented by word indexes and frequencies. Each line contains a document in the following format

	

		```

		 : : ... :

		```

	

		`` is the total number of *tokens* in this document. `` denotes the i-th word in ``, starting from 0. Words with zero frequency can be omitted.

	- ``: Trained model file in JSON format. Read and written by program.

- Optional arguments

	- `model=`: The topic model you want to use (default: [LDA](#lda_cmd)). Supported `` (case unsensitive) are

		- [LDA](#lda_cmd): Vanilla LDA

		- [RTM](#rtm_cmd): Relational topic model.

			- [Lex-WSB-RTM](#lex_wsb_rtm_cmd): [RTM](#rtm_cmd) with WSB-computed block priors and lexical weights.

			- [Lex-WSB-Med-RTM](#lex_wsb_med_rtm_cmd): [Lex-WSB-RTM](#lex_wsb_rtm_cmd) with hinge loss.

		- [SLDA](#slda_cmd): Supervised [LDA](#lda_cmd). Support multi-class classification.

			- [BS-LDA](#bs_lda_cmd): Binary [SLDA](#slda_cmd).

			- [Lex-WSB-BS-LDA](#lex_wsb_bs_lda_cmd): [BS-LDA](#bs_lda_cmd) with WSB-computed block priors and lexical weights.

			- [Lex-WSB-Med-LDA](#lex_wsb_med_lda_cmd): [Lex-WSB-BS-LDA](#lex_wsb_bs_lda_cmd) with hinge loss.

		- [BP-LDA](#bp_lda_cmd): [LDA](#lda_cmd) with block priors. Blocks are pre-computed.

		- [ST-LDA](#st_lda_cmd): Single topic [LDA](#lda_cmd). Each document can only be assigned to one topic.

		- [WSB-TM](#wsb_tm_cmd): [LDA](#lda_cmd) with block priors. Blocks are computed by [WSBM](#wsbm_cmd).

	- `test=true`: Use the model for test (default: false).

	- `verbose=true`: Print log to console (default:true).

	- `alpha=`: Parameter of Dirichlet prior of document distribution over topics (default: 1.0). Must be a positive real number.

	- `beta=`: Parameter of Dirichlet prior of topic distribution over words (default: 0.1). Must be a positive real number.

	- `topics=`: Number of topics (default: 10). Must be a positive integer.

	- `iters=`: Number of iterations (default: 100). Must be a positive integer.

	- `update=false`: Update alpha while sampling (default: false).

	- `update_interval=`: Interval of updating alpha (default: 10). Must be a positive integer.

	- `theta=`: File for document distribution over topics. Each line contains a document's topic distribution. Topic weights are separated by space.

	- `output_topic=`: File for showing topics.

	- `topic_count=`: File for document-topic counts.

	- `top_word=`: Number of words to give when showing topics (default: 10). Must be a positive integer.

### 
RTM: Relational Topic Model


```

tool=lda

model=rtm

vocab=

corpus=

trained_model=

rtm_train_graph=

```

- Implementation of [Chang and Blei (2010)](#rtm_ref).

- Jointly models topics and document links.

- Extends [LDA](#lda_cmd).

- Semi-optional arguments

	- `rtm_train_graph=` [optional in test]: Link file for RTM to train. Each line contains an edge in the format `node-1 \t node-2 \t weight`. Node number starts from 0. `weight` must be a non-negative integer. `weight` is either 0 or 1 and is optional. Its default value is 1 if not specified.

	- `rtm_test_graph=` [optional in training]: Link file for RTM to evaluate. Can be the same with RTM train graph. Format is the same as ``.

- Optional arguments

	- `nu=`: Variance of normal priors for weight vectors/matrices in RTM and its extensions (default: 1.0). Must be a positive real number.

	- `plr_interval=`: Interval of computing predictive link rank (default: 20). Must be a positive integer.

	- `neg=true`: Sample negative links (default: false).

	- `neg_ratio=`: The ratio of number of negative links to number of positive links (default 1.0). Must be a positive real number.

	- `pred=`: Predicted document link probability matrix file.

	- `reg=`: Doc-doc regression value file.

	- `directed=true`: Set all edges directed (default: false).

#### 
Lex-WSB-RTM: [RTM](#rtm_ref) with Lexical Weights and Weighted Stochastic Block Priors


```

tool=lda

model=lex-wsb-rtm

vocab=

corpus=

trained_model=

rtm_train_graph=

```

- Extends [RTM](#rtm_cmd).

- Optional arguments

	- `wsbm_graph=`: Link file for [WSBM](#cmd) to find blocks. See [WSBM](#wsbm_cmd) for details.

	- `alpha_prime=`: Parameter of Dirichlet prior of block distribution over topics (default: 1.0). Must be a positive real number.

	- `a=`: Parameter of Gamma prior for block link rates (default: 1.0). Must be a positive real number.

	- `b=`: Parameter of Gamma prior for block link rates (default: 1.0). Must be a positive real number.

	- `gamma=`: Parameter of Dirichlet prior for block distribution (default: 1.0). Must be a positive real number.

	- `blocks=`: Number of blocks (default: 10). Must be a positive integer.

	- `output_wsbm=`: File for [WSBM](#wsbm_cmd)-identified blocks. See [WSBM](#wsbm_cmd) for details.

	- `block_feature=true`: Include block features in link prediction (default: false).

#### 
Lex-WSB-Med-RTM: [Lex-WSB-RTM](#lex_wsb_rtm_cmd) with Hinge Loss


```

tool=lda

model=lex-wsb-med-rtm

vocab=

corpus=

trained_model=

rtm_train_graph=

```

- Implementation of [Yang et al. (2016)](#lex_wsb_med_rtm_ref)

- See [Zhu et al. (2012) and Zhu et al. (2014)](#med_lda_ref) for hinge loss.

- Extends [Lex-WSB-RTM](#lex_wsb_rtm_cmd).

- Link weight is either 1 or -1.

- Optional arguments

	- `c=`: Regularization parameter in hinge loss (default: 1.0). Must be a positive real number.

### 
SLDA: Supervised [LDA](#lda_cmd)


```

tool=lda

model=slda

vocab=

corpus=

trained_model=

label=

```

- Implementation of [McAuliffe and Blei (2008)](#slda_ref).

- Jointly models topics and document labels. Support multi-class classification.

- Extends [LDA](#lda_cmd).

- Semi-optional arguments

	- `label=` [optional in test]: Label file. Each line contains corresponding document's numeric label. If a document label is not available, leave the corresponding line empty.

- Optional arguments

	- `sigma=`: Variance for the Gaussian generation of response variable in SLDA (default: 1.0). Must be a positive real number.

	- `nu=`: Variance of normal priors for weight vectors in SLDA and its extensions (default: 1.0). Must be a positive real number.

	- `pred=`: Predicted label file.

	- `reg=`: Regression value file.

#### 
BS-LDA: Binary [SLDA](#slda_ref)


```

tool=lda

model=bs-lda

vocab=

corpus=

trained_model=

label=

```

- For binary classification only.

- Extends [SLDA](#slda_cmd).

- Label is either 1 or 0.

#### 
Lex-WSB-BS-LDA: [BS-LDA](#bs_lda_cmd) with Lexcial Weights and Weighted Stochastic Block Priors


```

tool=lda

model=lex-wsb-bs-lda

vocab=

corpus=

trained_model=

label=

```

- Extends [BS-LDA](#bs_lda_cmd).

- Optional arguments

	- `wsbm_graph=`: Link file for [WSBM](#cmd) to find blocks. See [WSBM](#wsbm_cmd) for details.

	- `alpha_prime=`: Parameter of Dirichlet prior of block distribution over topics (default: 1.0). Must be a positive real number.

	- `a=`: Parameter of Gamma prior for block link rates (default: 1.0). Must be a positive real number.

	- `b=`: Parameter of Gamma prior for block link rates (default: 1.0). Must be a positive real number.

	- `gamma=`: Parameter of Dirichlet prior for block distribution (default: 1.0). Must be a positive real number.

	- `blocks=`: Number of blocks (default: 10). Must be a positive integer.

	- `directed=true`: Set all edges directed (default: false).

	- `output_wsbm=`: File for [WSBM](#wsbm_cmd)-identified blocks. See [WSBM](#wsbm_cmd) for details.

#### 
Lex-WSB-Med-LDA: [Lex-WSB-BS-LDA](#lex_wsb_bs_lda_cmd) with Hinge Loss


```

tool=lda

model=lex-wsb-med-lda

vocab=

corpus=

trained_model=

label=

```

- See [Zhu et al. (2012) and (Zhu et al. (2014)](#med_lda_ref) for hinge loss.

- Extends [Lex-WSB-BS-LDA](#lex_wsb_bs_lda_cmd).

- Label is either 1 or -1.

- Optional arguments

	- `c=`: Regularization parameter in hinge loss (default: 1.0). Must be a positive real number.

### 
BP-LDA: [LDA](#lda_cmd) with Block Priors


```

tool=lda

model=bp-lda

vocab=

corpus=

trained_model=

block_graph=

```

- Use priors from pre-computed blocks.

- Extends [LDA](#lda_cmd).

- Semi-optional arguments

	- `block_graph=` [optional in test]: Pre-computed block file. Each line contains a block and consists of one or more documents denoted by document numbers. Document numbers are separated by space.

- Optional arguments

	- `alpha_prime=`: Parameter of Dirichlet prior of block distribution over topics (default: 1.0). Must be a positive real number.

### 
ST-LDA: Single Topic [LDA](#lda_cmd)


```

tool=lda

model=st-lda

vocab=

corpus=

trained_model=

short_corpus=

```

- Implementation of [Hong et al. (2016)](#st_lda_ref).

- Each document can only be assigned to one topic.

- Extends [LDA](#lda_cmd).

- Semi-optional arguments

	- `short_corpus=` [at least one of `short_corpus` and `corpus` should be specified]: Short corpus file.

- Optional arguments

	- `short_theta=`: Short documents' background topic distribution file.

	- `short_topic_assign=`: Short documents' topic assignment file.

### 
WSB-TM: Weighted Stochastic Block Topic Model


```

tool=lda

model=wsb-tm

vocab=

corpus=

trained_model=

wsbm_graph=

```

- Use priors from [WSBM](#wsbm_cmd)-computed blocks.

- Extends [LDA](#lda_cmd).

- Semi-optional arguments

	- `wsbm_graph=` [optional in test]: Link file for [WSBM](#cmd) to find blocks. See [WSBM](#wsbm_cmd) for details.

- Optional arguments

	- `alpha_prime=`: Parameter of Dirichlet prior of block distribution over topics (default: 1.0). Must be a positive real number.

	- `a=`: Parameter of Gamma prior for block link rates (default: 1.0). Must be a positive real number.

	- `b=`: Parameter of Gamma prior for block link rates (default: 1.0). Must be a positive real number.

	- `gamma=`: Parameter of Dirichlet prior for block distribution (default: 1.0). Must be a positive real number.

	- `blocks=`: Number of blocks (default: 10). Must be a positive integer.

	- `directed=true`: Set all edges directed (default: false).

	- `output_wsbm=`: File for [WSBM](#wsbm_cmd)-identified blocks. See [WSBM](#wsbm_cmd) for details.

## 
tLDA in Command Line


```

tool=tlda

vocab=

tree=

corpus=

trained_model=

```

- Implementation of tree LDA [(Boyd-Graber et al., 2007)](#tlda_ref).

- Required arguments

	- ``: Vocabulary file. Each line contains a unique word.

	- ``: Tree prior file. Generated by [Tree Builder](#tree_builder_cmd)

	- ``: Corpus file in which documents are represented by word indexes and frequencies. Each line contains a document in the following format

	

		```

		 : : ... :

		```

	

		`` is the total number of *tokens* in this document. `` denotes the i-th word in ``, starting from 0. Words with zero frequency can be omitted.

	- ``: Trained model file. Read and written by program.

- Optional arguments

	- `test=true`: Use the model for test (default: false).

	- `verbose=true`: Print log to console (default: true).

	- `alpha=`: Parameter of Dirichlet prior of document distribution over topics (default: 0.01). Must be a positive real number.

	- `beta=`: Parameter of Dirichlet prior of topic distribution over words (default: 0.01). Must be a positive real number.

	- `topics=`: Number of topics (default: 10). Must be a positive integer.

	- `iters=`: Number of iterations (default: 100). Must be a positive integer.

	- `update=false`: Update alpha while sampling (default: false).

	- `update_interval=`: Interval of updating alpha (default: 10). Must be a positive integer.

	- `theta=`: File for document distribution over topics. Each line contains a document's topic distribution. Topic weights are separated by space.

	- `output_topic=`: File for showing topics.

	- `topic_count=`: File for document-topic counts.

	- `top_word=`: Number of words to give when showing topics (default: 10). Must be a positive integer.

## 
MTM in Command Line


```

tool=mtm

num_langs=

dict=

vocab=

corpus=

trained_model=

```

- Implementation of Multilingual Topic Model [(Yang et al., 2019)](#mtm_ref).

- Required arguments

	- ``: Number of languages. Must be a postive integer greater than 1.

	- ``: Dictionary file. Each line contains a word translation pair, represented by four elements separated by tab (\t): language ID of the first word, first word, language ID of the second word, second word.

	- ``: Vocabulary files. One file for each language. File names are separated by comma (,). Each line contains a unique word.

	- ``: Corpus files in which documents are represented by word indexes and frequencies. File names are separated by comma (,). One file for each language. Each line contains a document in the following format

	

		```

		 : : ... :

		```

	

		`` is the total number of *tokens* in this document. `` denotes the i-th word in ``, starting from 0. Words with zero frequency can be omitted.

	- ``: Trained model file. Read and written by program.

- Optional arguments

	- `test=true`: Use the model for test (default: false).

	- `verbose=true`: Print log to console (default: true).

	- `alpha=`: Parameter of Dirichlet prior of document distribution over topics (default: 0.01). One value for each language. Values separated by comma (,). Must be a positive real number.

	- `beta=`: Parameter of Dirichlet prior of topic distribution over words (default: 0.01). One value for each language. Values separated by comma (,). Must be a positive real number.

	- `topics=`: Number of topics (default: 10). One value for each language. Values separated by comma (,). Must be a positive integer.

	- `iters=`: Number of iterations (default: 100). Must be a positive integer.

	- `update=false`: Update alpha while sampling (default: false).

	- `update_interval=`: Interval of updating alpha (default: 10). Must be a positive integer.

	- `theta=`: Files for document distribution over topics. One file for each language. File names are separated by comma (,). Each line contains a document's topic distribution. Topic weights are separated by space.

	- `rho=`: File for topic transformation matrices. Assuming there are $N$ languages, the file contains $N(N-1)$ matrices. Each matrix starts by a line of string `Rho[i][j]` where `i` and `j` indicate two languages. The following $K_i$ rows contains the topic transformation matrix from language `i` to language `j`, and each row has $K_j$ values separated by spaces, where $K_i$ and $K_j$ are the numbers of topics in languages `i` and `j` respectively.

	- `output_topic=`: File for showing topics.

	- `topic_count=`: Files for document-topic counts. One file for each language. File names are separated by comma (,).

	- `top_word=`: Number of words to give when showing topics (default: 10). Must be a positive integer.

	- `reg=`: Regularization option (default: 0). 0 for no regularization, 1 for L1 norm, 2 for L2 norm, 3 for entropy, 4 for identity matrix.

	- `lambda=`: The regularization coefficient (default: 0.0). Only effective when `reg` is not 0.

	- `tfidf=true`: Use TF-IDF weights as word translation pairs' weights (default: false).

	- `word_tf_threshold=`:  Ignore the word translation pairs if either word's term frequency is equal or lower than the given threshold (default: 0). One value for each language. Values are separated by comma (,). Must be non-negative integers.

	

## 
Other Tools in Command Line

### 
WSBM: Weighted Stochastic Block Model


```

tool=wsbm

nodes=

blocks=

graph=

output=

```

- Implementation of [Aicher et al. (2014)](#wsbm_ref).

- Find latent blocks in a network, such that nodes in the same block are densely connected and nodes in different blocks are sparsely connected.

- Required arguments

	- ``: Number of nodes in the graph. Must be a positive integer.

	- ``: Number of blocks. Must be a positive integer.

	- ``: Graph file. Each line contains an edge in the format `node-1 \t node-2 \t weight`. Node number starts from 0. `weight` must be a non-negative integer. `weight` is optional. Its default value is 1 if not specified.

	- ``: Result file. The i-th line contains the block assignment of i-th node.

- Optional arguments

	- `directed=true`: Set the edges as directed (default: false).

	- `a=`: Parameter for edge rates' Gamma prior (default: 1.0). Must be a positive real number.

	- `b=`: Parameter for edge rates' Gamma prior (default: 1.0). Must be a positive real number.

	- `gamma=`: Parameter for block distribution's Dirichlet prior (default 1.0). Must be a positive real number.

	- `iters=`: Number of iterations (default: 100). Must be a positive integer.

	- `verbose=true`: Print log to console (default: true).

### 
SCC: Strongly Connected Components


```

tool=scc

nodes=

graph=

output=

```

- New implementation.

- Find [strongly connected components](https://en.wikipedia.org/wiki/Strongly_connected_component) in an undirected graph. In each component, every node is reachable from any other nodes in the same component.

- Arguments

	- ``: Number of nodes in the graph. Must be a positive integer.

	- ``: Graph file. Each line contains an edge in the format `node-1 \t node-2`. Node number starts from 0.

	- ``: Result file. Each line contains a strongly connected component and consists of one or more nodes denoted by node numbers. Node numbers are separated by space.

### 
Stoplist


```

tool=stoplist

corpus=

output=

```

- New implementation.

- Only supports English, but can support other languages if dictionary is provided.

- Required arguments

	- ``: Corpus file with stop words. Each line contains a document. Words are separated by space.

	- ``: Corpus file without stop words. Each line contains a document. Words are separated by space.

- Optional arguments

	- `dict=`: Dictionary file name. Each line contains a stop word.

### 
Lemmatizer


```

tool=lemmatizer

corpus=

output=

```

- A re-packaging of `opennlp.tools.lemmatizer.SimpleLemmatizer`.

- Only supports English, but can support other languages if dictionary is provided.

- Required arguments

	- ``: Unlemmatized corpus file. Each line contains a unlemmatized, *tokenized*, and *POS-tagged* document.

	- ``: Lemmatized corpus file. Each line contains a lemmatized document. Words are separated by space.

- Optional arguments

	- `dict=`: Dictionary file name. Each line contains a rule in the format `unlemmatized-word \t POS \t lemmatized-word`.

### 
POS Tagger


```

tool=pos-tagger

corpus=

output=

```

- A re-packaing of `opennlp.tools.postag.POSTaggerME` ()

- Only supports English, but can support other languages if model is provided.

- Required arguments

	- ``: Untagged corpus file. Each line contains a *tokenized* untagged document.

	- ``: Tagged corpus file. Each line contains a tagged document. Each word is annotated as `word_POS`.

- Optional arguments

	- `model=`: [Model](https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.postagger.training) file name.

### 
Stemmer


```

tool=stemmer

corpus=

output=

```

- A re-packaging of `PorterStemmer` ()

- Only supports English.

- Arguments

	- ``: Unstemmed corpus file. Each line contains an unstemmed document. Words are separated by space.

	- ``: Stemmed corpus file. Each line contains a stemmed document. Words are separated by space.

### 
Tokenizer


```

tool=tokenizer

corpus=

output=

```

- A re-packaging of `opennlp.tools.tokenize.TokenizerME` ()

- Only supports English, but can support other languages if model is provided.

- Required arguments

	- ``: Untokenized corpus file. Each line contains a untokenized document.

	- ``: Tokenized corpus file. Each line contains a tokenized document.

- Optional arguments

	- `model=`: [Model]() file name.

### 
Corpus Converter


```

tool=corpus-converter

get_vocab|to_index|to_word=true

word_corpus=

index_corpus=

vocab=

```

- New implementation

- Arguments

	- `get_vocab`, `to_index`, `to_word`: Only one of them should be true.

		- `get_vocab`: Collect vocabulary from `` and write them in ``.

		- `to_index`: Convert a word corpus file `` into an indexed corpus file `` and write the vocabulary in ``.

		- `to_word`: Convert an indexed corpus file `` into a word corpus file `` given vocabulary file ``.

	- ``: Corpus file in which documents are represented by words. Each line contains a document. Words are separated by space.

	- ``: Corpus file in which documents are represented by word indexes and frequencies. Not required when using `--get-vocab`. Each line contains a document in the following format

	

		```

		 : : ... :

		```

	

		`` is the total number of *tokens* in this document. `` denotes the i-th word in ``, starting from 0. Words with zero frequency can be omitted.

	- ``: Vocabulary file. Each line contains a unique word.

### 
Tree Builder


```

tool=tree-builder

vocab=

score=

tree=

```

- Implementation of [Yang et al. (2017)](#tree_builder_ref)

- Arguments

	- ``: Vocabulary file. Each line contains a unique word.

	- ``: Word association file. Assume there are V words in ``. There are V lines in the ``. Each line corresponds to a word in the vocabulary and contains V float numbers which denote the word's association scores with all other words.

	- ``: The tree prior file.

- Optional Arguments

	- `type=`: Tree prior type. 1 for two-level tree; 2 for hierarchical agglomerative clustering (HAC) tree; 3 for HAC tree with leaf duplication (default 1).

	- `child=`: Number of child nodes per internal node for a two-level tree (default 10).

	- `thresh=`: The confidence threshold for HAC (default 0.0).

##
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ywwbill/YWWTools-v2

Awesome Lists containing this project

README

YWW Tools

Check Out

Dependencies

Use YWW Tools in Command Line

LDA (Latent Dirichlet Allocation) in Command Line

RTM: Relational Topic Model

Lex-WSB-RTM: [RTM](#rtm_ref) with Lexical Weights and Weighted Stochastic Block Priors

Lex-WSB-Med-RTM: [Lex-WSB-RTM](#lex_wsb_rtm_cmd) with Hinge Loss

SLDA: Supervised [LDA](#lda_cmd)

BS-LDA: Binary [SLDA](#slda_ref)

Lex-WSB-BS-LDA: [BS-LDA](#bs_lda_cmd) with Lexcial Weights and Weighted Stochastic Block Priors

Lex-WSB-Med-LDA: [Lex-WSB-BS-LDA](#lex_wsb_bs_lda_cmd) with Hinge Loss

BP-LDA: [LDA](#lda_cmd) with Block Priors

ST-LDA: Single Topic [LDA](#lda_cmd)

WSB-TM: Weighted Stochastic Block Topic Model

tLDA in Command Line

MTM in Command Line

Other Tools in Command Line

###

WSBM: Weighted Stochastic Block Model

SCC: Strongly Connected Components

Stoplist

Lemmatizer

POS Tagger

Stemmer

Tokenizer

Corpus Converter

Tree Builder

https://github.com/ywwbill/YWWTools-v2

Awesome Lists containing this project

README

YWW Tools

Check Out

Dependencies

Use YWW Tools in Command Line

LDA (Latent Dirichlet Allocation) in Command Line

RTM: Relational Topic Model

Lex-WSB-RTM: [RTM](#rtm_ref) with Lexical Weights and Weighted Stochastic Block Priors

Lex-WSB-Med-RTM: [Lex-WSB-RTM](#lex_wsb_rtm_cmd) with Hinge Loss

SLDA: Supervised [LDA](#lda_cmd)

BS-LDA: Binary [SLDA](#slda_ref)

Lex-WSB-BS-LDA: [BS-LDA](#bs_lda_cmd) with Lexcial Weights and Weighted Stochastic Block Priors

Lex-WSB-Med-LDA: [Lex-WSB-BS-LDA](#lex_wsb_bs_lda_cmd) with Hinge Loss

BP-LDA: [LDA](#lda_cmd) with Block Priors

ST-LDA: Single Topic [LDA](#lda_cmd)

WSB-TM: Weighted Stochastic Block Topic Model

tLDA in Command Line

MTM in Command Line

Other Tools in Command Line ###

WSBM: Weighted Stochastic Block Model

SCC: Strongly Connected Components

Stoplist

Lemmatizer

POS Tagger

Stemmer

Tokenizer

Corpus Converter

Tree Builder

Other Tools in Command Line

###