https://github.com/ywwbill/YWWTools-v2
https://github.com/ywwbill/YWWTools-v2
Last synced: about 1 year ago
JSON representation
- Host: GitHub
- URL: https://github.com/ywwbill/YWWTools-v2
- Owner: ywwbill
- License: mit
- Created: 2020-01-07T15:59:41.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2020-09-05T08:08:11.000Z (over 5 years ago)
- Last Synced: 2024-08-03T18:21:11.072Z (almost 2 years ago)
- Language: Java
- Size: 18.1 MB
- Stars: 2
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
- awesome-topic-models - YWWTools - Java-based package for various topic models by Weiwei Yang (Models / Miscellaneous topic models)
README
#
YWW Tools
A package of my ([Weiwei Yang](http://cs.umd.edu/~wwyang/)'s) various tools (most for NLP). Feel free to email me at with any questions.
* [Check Out](#check_out)
* [Dependencies](#dependencies)
* [Use YWW Tools in Command Line](#command)
* [LDA (Latent Dirichlet Allocation) in Command Line](#lda_cmd)
* [RTM: Relational Topic Model](#rtm_cmd)
* [Lex-WSB-RTM: RTM with Lexical Weights and Weighted Stochastic Block Priors](#lex_wsb_rtm_cmd)
* [Lex-WSB-Med-RTM: Lex-WSB-RTM with Hinge Loss](#lex_wsb_med_rtm_cmd)
* [SLDA: Supervised LDA](#slda_cmd)
* [BS-LDA: Binary SLDA](#bs_lda_cmd)
* [Lex-WSB-BS-LDA: BS-LDA with Lexcial Weights and Weighted Stochastic Block Priors](#lex_wsb_bs_lda_cmd)
* [Lex-WSB-Med-LDA: Lex-WSB-BS-LDA with Hinge Loss](#lex_wsb_med_lda_cmd)
* [BP-LDA: LDA with Block Priors](#bp_lda_cmd)
* [ST-LDA: Single Topic LDA](#st_lda_cmd)
* [WSB-TM: Weighted Stochastic Block Topic Model](#wsb_tm_cmd)
* [tLDA in Command Line](#tlda_cmd)
* [MTM in Command Line](#mtm_cmd)
* [Other Tools in Command Line](#other_cmd)
* [WSBM: Weighted Stochastic Block Model](#wsbm_cmd)
* [SCC: Strongly Connected Components](#scc_cmd)
* [Stoplist](#stoplist_cmd)
* [Lemmatizer](#lemmatizer_cmd)
* [POS Tagger](#pos_tagger_cmd)
* [Stemmer](#stemmer_cmd)
* [Tokenizer](#tokenizer_cmd)
* [Corpus Converter](#corpus_converter_cmd)
* [Tree Builder](#tree_builder_cmd)
* [Use YWW Tools Source Code](#code_examples)
* [LDA Code Examples](#lda_code)
* [RTM](#rtm_code)
* [Lex-WSB-RTM](#lex_wsb_rtm_code)
* [Lex-WSB-Med-RTM](#lex_wsb_med_rtm_code)
* [SLDA](#slda_code)
* [BS-LDA](#bs_lda_code)
* [Lex-WSB-BS-LDA](#lex_wsb_bs_lda_code)
* [Lex-WSB-Med-LDA](#lex_wsb_med_lda_code)
* [BP-LDA](#bp_lda_code)
* [ST-LDA](#st_lda_code)
* [WSB-TM](#wsb_tm_code)
* [tLDA Code Examples](#tlda_code)
* [MTM Code Examples](#mtm_code)
* [Other Code Examples](#other_code)
* [WSBM](#wsbm_code)
* [SCC](#scc_code)
* [Tree Builder](#tree_builder_code)
* [English Corpus Preprocessing](#preprocess)
* [Citation](#citation)
* [References](#ref)
##
Check Out
```
git clone git@github.com:ywwbill/YWWTools-v2.git
```
##
Dependencies
- Java 8.
- Files in `lib/`.
- Files in `dict/`.
##
Use YWW Tools in Command Line
```
java -cp YWWTools-v2.jar:lib/* yang.weiwei.Tools
```
- **Windows users**
- Please replace `YWWTools-v2.jar:lib/*` with `YWWTools-v2.jar;lib/*`.
- If you encounter any encoding problems in command line (especially when processing Chinese), please add `-Dfile.encoding=utf8` in your command.
- In ``, specify the tool you want to use:
```
tool=
```
- Supported `` (case unsensitive) include
- [LDA](#lda_cmd): Latent Dirichlet allocation. Include a variety of extensions.
- [TLDA](#tlda_cmd): Tree LDA.
- [MTM](#mtm_cmd): Multilingual Topic Model.
- [WSBM](#wsbm_cmd): Weighted stochastic block model. Find blocks in a network.
- [SCC](#scc_cmd): Strongly connected components.
- [Stoplist](#stoplist_cmd): Remove stop words. Support English only, but can support other languages given dictionary.
- [Lemmatizer](#lemmatizer_cmd): Lemmatize POS-tagged corpus. Support English only, but can support other languages given dictionary.
- [POS-Tagger](#pos_tagger_cmd): Tag words' POS. Support English only, but can support other languages given trained models.
- [Stemmer](#stemmer_cmd): Stem words. Support English only.
- [Tokenizer](#tokenizer_cmd): Tokenize corpus. Support English only, but can support other languages given trained models.
- [Corpus-Converter](#corpus_converter_cmd): Convert word corpus into indexed corpus (for [LDA](#lda_cmd)) and vice versa.
- [Tree Builder](#tree_builder_cmd): Build tree priors from word associations.
- You can always set `help` to true to see help information of
- supported tool names if you don't specify a tool name:
```
help=true
```
- a specific tool if you specify it (take [LDA](#lda_cmd) as an example):
```
help=true
tool=lda
```
##
LDA (Latent Dirichlet Allocation) in Command Line
```
tool=lda
model=lda
vocab=
corpus=
trained_model=
```
- Implementation of [Blei et al. (2003)](#lda_ref).
- Required arguments
- ``: Vocabulary file. Each line contains a unique word.
- ``: Corpus file in which documents are represented by word indexes and frequencies. Each line contains a document in the following format
```
: : ... :
```
`` is the total number of *tokens* in this document. `` denotes the i-th word in ``, starting from 0. Words with zero frequency can be omitted.
- ``: Trained model file in JSON format. Read and written by program.
- Optional arguments
- `model=`: The topic model you want to use (default: [LDA](#lda_cmd)). Supported `` (case unsensitive) are
- [LDA](#lda_cmd): Vanilla LDA
- [RTM](#rtm_cmd): Relational topic model.
- [Lex-WSB-RTM](#lex_wsb_rtm_cmd): [RTM](#rtm_cmd) with WSB-computed block priors and lexical weights.
- [Lex-WSB-Med-RTM](#lex_wsb_med_rtm_cmd): [Lex-WSB-RTM](#lex_wsb_rtm_cmd) with hinge loss.
- [SLDA](#slda_cmd): Supervised [LDA](#lda_cmd). Support multi-class classification.
- [BS-LDA](#bs_lda_cmd): Binary [SLDA](#slda_cmd).
- [Lex-WSB-BS-LDA](#lex_wsb_bs_lda_cmd): [BS-LDA](#bs_lda_cmd) with WSB-computed block priors and lexical weights.
- [Lex-WSB-Med-LDA](#lex_wsb_med_lda_cmd): [Lex-WSB-BS-LDA](#lex_wsb_bs_lda_cmd) with hinge loss.
- [BP-LDA](#bp_lda_cmd): [LDA](#lda_cmd) with block priors. Blocks are pre-computed.
- [ST-LDA](#st_lda_cmd): Single topic [LDA](#lda_cmd). Each document can only be assigned to one topic.
- [WSB-TM](#wsb_tm_cmd): [LDA](#lda_cmd) with block priors. Blocks are computed by [WSBM](#wsbm_cmd).
- `test=true`: Use the model for test (default: false).
- `verbose=true`: Print log to console (default:true).
- `alpha=`: Parameter of Dirichlet prior of document distribution over topics (default: 1.0). Must be a positive real number.
- `beta=`: Parameter of Dirichlet prior of topic distribution over words (default: 0.1). Must be a positive real number.
- `topics=`: Number of topics (default: 10). Must be a positive integer.
- `iters=`: Number of iterations (default: 100). Must be a positive integer.
- `update=false`: Update alpha while sampling (default: false).
- `update_interval=`: Interval of updating alpha (default: 10). Must be a positive integer.
- `theta=`: File for document distribution over topics. Each line contains a document's topic distribution. Topic weights are separated by space.
- `output_topic=`: File for showing topics.
- `topic_count=`: File for document-topic counts.
- `top_word=`: Number of words to give when showing topics (default: 10). Must be a positive integer.
###
RTM: Relational Topic Model
```
tool=lda
model=rtm
vocab=
corpus=
trained_model=
rtm_train_graph=
```
- Implementation of [Chang and Blei (2010)](#rtm_ref).
- Jointly models topics and document links.
- Extends [LDA](#lda_cmd).
- Semi-optional arguments
- `rtm_train_graph=` [optional in test]: Link file for RTM to train. Each line contains an edge in the format `node-1 \t node-2 \t weight`. Node number starts from 0. `weight` must be a non-negative integer. `weight` is either 0 or 1 and is optional. Its default value is 1 if not specified.
- `rtm_test_graph=` [optional in training]: Link file for RTM to evaluate. Can be the same with RTM train graph. Format is the same as ``.
- Optional arguments
- `nu=`: Variance of normal priors for weight vectors/matrices in RTM and its extensions (default: 1.0). Must be a positive real number.
- `plr_interval=`: Interval of computing predictive link rank (default: 20). Must be a positive integer.
- `neg=true`: Sample negative links (default: false).
- `neg_ratio=`: The ratio of number of negative links to number of positive links (default 1.0). Must be a positive real number.
- `pred=`: Predicted document link probability matrix file.
- `reg=`: Doc-doc regression value file.
- `directed=true`: Set all edges directed (default: false).
####
Lex-WSB-RTM: [RTM](#rtm_ref) with Lexical Weights and Weighted Stochastic Block Priors
```
tool=lda
model=lex-wsb-rtm
vocab=
corpus=
trained_model=
rtm_train_graph=
```
- Extends [RTM](#rtm_cmd).
- Optional arguments
- `wsbm_graph=`: Link file for [WSBM](#cmd) to find blocks. See [WSBM](#wsbm_cmd) for details.
- `alpha_prime=`: Parameter of Dirichlet prior of block distribution over topics (default: 1.0). Must be a positive real number.
- `a=`: Parameter of Gamma prior for block link rates (default: 1.0). Must be a positive real number.
- `b=`: Parameter of Gamma prior for block link rates (default: 1.0). Must be a positive real number.
- `gamma=`: Parameter of Dirichlet prior for block distribution (default: 1.0). Must be a positive real number.
- `blocks=`: Number of blocks (default: 10). Must be a positive integer.
- `output_wsbm=`: File for [WSBM](#wsbm_cmd)-identified blocks. See [WSBM](#wsbm_cmd) for details.
- `block_feature=true`: Include block features in link prediction (default: false).
####
Lex-WSB-Med-RTM: [Lex-WSB-RTM](#lex_wsb_rtm_cmd) with Hinge Loss
```
tool=lda
model=lex-wsb-med-rtm
vocab=
corpus=
trained_model=
rtm_train_graph=
```
- Implementation of [Yang et al. (2016)](#lex_wsb_med_rtm_ref)
- See [Zhu et al. (2012) and Zhu et al. (2014)](#med_lda_ref) for hinge loss.
- Extends [Lex-WSB-RTM](#lex_wsb_rtm_cmd).
- Link weight is either 1 or -1.
- Optional arguments
- `c=`: Regularization parameter in hinge loss (default: 1.0). Must be a positive real number.
###
SLDA: Supervised [LDA](#lda_cmd)
```
tool=lda
model=slda
vocab=
corpus=
trained_model=
label=
```
- Implementation of [McAuliffe and Blei (2008)](#slda_ref).
- Jointly models topics and document labels. Support multi-class classification.
- Extends [LDA](#lda_cmd).
- Semi-optional arguments
- `label=` [optional in test]: Label file. Each line contains corresponding document's numeric label. If a document label is not available, leave the corresponding line empty.
- Optional arguments
- `sigma=`: Variance for the Gaussian generation of response variable in SLDA (default: 1.0). Must be a positive real number.
- `nu=`: Variance of normal priors for weight vectors in SLDA and its extensions (default: 1.0). Must be a positive real number.
- `pred=`: Predicted label file.
- `reg=`: Regression value file.
####
BS-LDA: Binary [SLDA](#slda_ref)
```
tool=lda
model=bs-lda
vocab=
corpus=
trained_model=
label=
```
- For binary classification only.
- Extends [SLDA](#slda_cmd).
- Label is either 1 or 0.
####
Lex-WSB-BS-LDA: [BS-LDA](#bs_lda_cmd) with Lexcial Weights and Weighted Stochastic Block Priors
```
tool=lda
model=lex-wsb-bs-lda
vocab=
corpus=
trained_model=
label=
```
- Extends [BS-LDA](#bs_lda_cmd).
- Optional arguments
- `wsbm_graph=`: Link file for [WSBM](#cmd) to find blocks. See [WSBM](#wsbm_cmd) for details.
- `alpha_prime=`: Parameter of Dirichlet prior of block distribution over topics (default: 1.0). Must be a positive real number.
- `a=`: Parameter of Gamma prior for block link rates (default: 1.0). Must be a positive real number.
- `b=`: Parameter of Gamma prior for block link rates (default: 1.0). Must be a positive real number.
- `gamma=`: Parameter of Dirichlet prior for block distribution (default: 1.0). Must be a positive real number.
- `blocks=`: Number of blocks (default: 10). Must be a positive integer.
- `directed=true`: Set all edges directed (default: false).
- `output_wsbm=`: File for [WSBM](#wsbm_cmd)-identified blocks. See [WSBM](#wsbm_cmd) for details.
####
Lex-WSB-Med-LDA: [Lex-WSB-BS-LDA](#lex_wsb_bs_lda_cmd) with Hinge Loss
```
tool=lda
model=lex-wsb-med-lda
vocab=
corpus=
trained_model=
label=
```
- See [Zhu et al. (2012) and (Zhu et al. (2014)](#med_lda_ref) for hinge loss.
- Extends [Lex-WSB-BS-LDA](#lex_wsb_bs_lda_cmd).
- Label is either 1 or -1.
- Optional arguments
- `c=`: Regularization parameter in hinge loss (default: 1.0). Must be a positive real number.
###
BP-LDA: [LDA](#lda_cmd) with Block Priors
```
tool=lda
model=bp-lda
vocab=
corpus=
trained_model=
block_graph=
```
- Use priors from pre-computed blocks.
- Extends [LDA](#lda_cmd).
- Semi-optional arguments
- `block_graph=` [optional in test]: Pre-computed block file. Each line contains a block and consists of one or more documents denoted by document numbers. Document numbers are separated by space.
- Optional arguments
- `alpha_prime=`: Parameter of Dirichlet prior of block distribution over topics (default: 1.0). Must be a positive real number.
###
ST-LDA: Single Topic [LDA](#lda_cmd)
```
tool=lda
model=st-lda
vocab=
corpus=
trained_model=
short_corpus=
```
- Implementation of [Hong et al. (2016)](#st_lda_ref).
- Each document can only be assigned to one topic.
- Extends [LDA](#lda_cmd).
- Semi-optional arguments
- `short_corpus=` [at least one of `short_corpus` and `corpus` should be specified]: Short corpus file.
- Optional arguments
- `short_theta=`: Short documents' background topic distribution file.
- `short_topic_assign=`: Short documents' topic assignment file.
###
WSB-TM: Weighted Stochastic Block Topic Model
```
tool=lda
model=wsb-tm
vocab=
corpus=
trained_model=
wsbm_graph=
```
- Use priors from [WSBM](#wsbm_cmd)-computed blocks.
- Extends [LDA](#lda_cmd).
- Semi-optional arguments
- `wsbm_graph=` [optional in test]: Link file for [WSBM](#cmd) to find blocks. See [WSBM](#wsbm_cmd) for details.
- Optional arguments
- `alpha_prime=`: Parameter of Dirichlet prior of block distribution over topics (default: 1.0). Must be a positive real number.
- `a=`: Parameter of Gamma prior for block link rates (default: 1.0). Must be a positive real number.
- `b=`: Parameter of Gamma prior for block link rates (default: 1.0). Must be a positive real number.
- `gamma=`: Parameter of Dirichlet prior for block distribution (default: 1.0). Must be a positive real number.
- `blocks=`: Number of blocks (default: 10). Must be a positive integer.
- `directed=true`: Set all edges directed (default: false).
- `output_wsbm=`: File for [WSBM](#wsbm_cmd)-identified blocks. See [WSBM](#wsbm_cmd) for details.
##
tLDA in Command Line
```
tool=tlda
vocab=
tree=
corpus=
trained_model=
```
- Implementation of tree LDA [(Boyd-Graber et al., 2007)](#tlda_ref).
- Required arguments
- ``: Vocabulary file. Each line contains a unique word.
- ``: Tree prior file. Generated by [Tree Builder](#tree_builder_cmd)
- ``: Corpus file in which documents are represented by word indexes and frequencies. Each line contains a document in the following format
```
: : ... :
```
`` is the total number of *tokens* in this document. `` denotes the i-th word in ``, starting from 0. Words with zero frequency can be omitted.
- ``: Trained model file. Read and written by program.
- Optional arguments
- `test=true`: Use the model for test (default: false).
- `verbose=true`: Print log to console (default: true).
- `alpha=`: Parameter of Dirichlet prior of document distribution over topics (default: 0.01). Must be a positive real number.
- `beta=`: Parameter of Dirichlet prior of topic distribution over words (default: 0.01). Must be a positive real number.
- `topics=`: Number of topics (default: 10). Must be a positive integer.
- `iters=`: Number of iterations (default: 100). Must be a positive integer.
- `update=false`: Update alpha while sampling (default: false).
- `update_interval=`: Interval of updating alpha (default: 10). Must be a positive integer.
- `theta=`: File for document distribution over topics. Each line contains a document's topic distribution. Topic weights are separated by space.
- `output_topic=`: File for showing topics.
- `topic_count=`: File for document-topic counts.
- `top_word=`: Number of words to give when showing topics (default: 10). Must be a positive integer.
##
MTM in Command Line
```
tool=mtm
num_langs=
dict=
vocab=
corpus=
trained_model=
```
- Implementation of Multilingual Topic Model [(Yang et al., 2019)](#mtm_ref).
- Required arguments
- ``: Number of languages. Must be a postive integer greater than 1.
- ``: Dictionary file. Each line contains a word translation pair, represented by four elements separated by tab (\t): language ID of the first word, first word, language ID of the second word, second word.
- ``: Vocabulary files. One file for each language. File names are separated by comma (,). Each line contains a unique word.
- ``: Corpus files in which documents are represented by word indexes and frequencies. File names are separated by comma (,). One file for each language. Each line contains a document in the following format
```
: : ... :
```
`` is the total number of *tokens* in this document. `` denotes the i-th word in ``, starting from 0. Words with zero frequency can be omitted.
- ``: Trained model file. Read and written by program.
- Optional arguments
- `test=true`: Use the model for test (default: false).
- `verbose=true`: Print log to console (default: true).
- `alpha=`: Parameter of Dirichlet prior of document distribution over topics (default: 0.01). One value for each language. Values separated by comma (,). Must be a positive real number.
- `beta=`: Parameter of Dirichlet prior of topic distribution over words (default: 0.01). One value for each language. Values separated by comma (,). Must be a positive real number.
- `topics=`: Number of topics (default: 10). One value for each language. Values separated by comma (,). Must be a positive integer.
- `iters=`: Number of iterations (default: 100). Must be a positive integer.
- `update=false`: Update alpha while sampling (default: false).
- `update_interval=`: Interval of updating alpha (default: 10). Must be a positive integer.
- `theta=`: Files for document distribution over topics. One file for each language. File names are separated by comma (,). Each line contains a document's topic distribution. Topic weights are separated by space.
- `rho=`: File for topic transformation matrices. Assuming there are $N$ languages, the file contains $N(N-1)$ matrices. Each matrix starts by a line of string `Rho[i][j]` where `i` and `j` indicate two languages. The following $K_i$ rows contains the topic transformation matrix from language `i` to language `j`, and each row has $K_j$ values separated by spaces, where $K_i$ and $K_j$ are the numbers of topics in languages `i` and `j` respectively.
- `output_topic=`: File for showing topics.
- `topic_count=`: Files for document-topic counts. One file for each language. File names are separated by comma (,).
- `top_word=`: Number of words to give when showing topics (default: 10). Must be a positive integer.
- `reg=`: Regularization option (default: 0). 0 for no regularization, 1 for L1 norm, 2 for L2 norm, 3 for entropy, 4 for identity matrix.
- `lambda=`: The regularization coefficient (default: 0.0). Only effective when `reg` is not 0.
- `tfidf=true`: Use TF-IDF weights as word translation pairs' weights (default: false).
- `word_tf_threshold=`: Ignore the word translation pairs if either word's term frequency is equal or lower than the given threshold (default: 0). One value for each language. Values are separated by comma (,). Must be non-negative integers.
##
Other Tools in Command Line
###
WSBM: Weighted Stochastic Block Model
```
tool=wsbm
nodes=
blocks=
graph=
output=
```
- Implementation of [Aicher et al. (2014)](#wsbm_ref).
- Find latent blocks in a network, such that nodes in the same block are densely connected and nodes in different blocks are sparsely connected.
- Required arguments
- ``: Number of nodes in the graph. Must be a positive integer.
- ``: Number of blocks. Must be a positive integer.
- ``: Graph file. Each line contains an edge in the format `node-1 \t node-2 \t weight`. Node number starts from 0. `weight` must be a non-negative integer. `weight` is optional. Its default value is 1 if not specified.
- ``: Result file. The i-th line contains the block assignment of i-th node.
- Optional arguments
- `directed=true`: Set the edges as directed (default: false).
- `a=`: Parameter for edge rates' Gamma prior (default: 1.0). Must be a positive real number.
- `b=`: Parameter for edge rates' Gamma prior (default: 1.0). Must be a positive real number.
- `gamma=`: Parameter for block distribution's Dirichlet prior (default 1.0). Must be a positive real number.
- `iters=`: Number of iterations (default: 100). Must be a positive integer.
- `verbose=true`: Print log to console (default: true).
###
SCC: Strongly Connected Components
```
tool=scc
nodes=
graph=
output=
```
- New implementation.
- Find [strongly connected components](https://en.wikipedia.org/wiki/Strongly_connected_component) in an undirected graph. In each component, every node is reachable from any other nodes in the same component.
- Arguments
- ``: Number of nodes in the graph. Must be a positive integer.
- ``: Graph file. Each line contains an edge in the format `node-1 \t node-2`. Node number starts from 0.
- ``: Result file. Each line contains a strongly connected component and consists of one or more nodes denoted by node numbers. Node numbers are separated by space.
###
Stoplist
```
tool=stoplist
corpus=
output=
```
- New implementation.
- Only supports English, but can support other languages if dictionary is provided.
- Required arguments
- ``: Corpus file with stop words. Each line contains a document. Words are separated by space.
- ``: Corpus file without stop words. Each line contains a document. Words are separated by space.
- Optional arguments
- `dict=`: Dictionary file name. Each line contains a stop word.
###
Lemmatizer
```
tool=lemmatizer
corpus=
output=
```
- A re-packaging of `opennlp.tools.lemmatizer.SimpleLemmatizer`.
- Only supports English, but can support other languages if dictionary is provided.
- Required arguments
- ``: Unlemmatized corpus file. Each line contains a unlemmatized, *tokenized*, and *POS-tagged* document.
- ``: Lemmatized corpus file. Each line contains a lemmatized document. Words are separated by space.
- Optional arguments
- `dict=`: Dictionary file name. Each line contains a rule in the format `unlemmatized-word \t POS \t lemmatized-word`.
###
POS Tagger
```
tool=pos-tagger
corpus=
output=
```
- A re-packaing of `opennlp.tools.postag.POSTaggerME` ()
- Only supports English, but can support other languages if model is provided.
- Required arguments
- ``: Untagged corpus file. Each line contains a *tokenized* untagged document.
- ``: Tagged corpus file. Each line contains a tagged document. Each word is annotated as `word_POS`.
- Optional arguments
- `model=`: [Model](https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.postagger.training) file name.
###
Stemmer
```
tool=stemmer
corpus=
output=
```
- A re-packaging of `PorterStemmer` ()
- Only supports English.
- Arguments
- ``: Unstemmed corpus file. Each line contains an unstemmed document. Words are separated by space.
- ``: Stemmed corpus file. Each line contains a stemmed document. Words are separated by space.
###
Tokenizer
```
tool=tokenizer
corpus=
output=
```
- A re-packaging of `opennlp.tools.tokenize.TokenizerME` ()
- Only supports English, but can support other languages if model is provided.
- Required arguments
- ``: Untokenized corpus file. Each line contains a untokenized document.
- ``: Tokenized corpus file. Each line contains a tokenized document.
- Optional arguments
- `model=`: [Model]() file name.
###
Corpus Converter
```
tool=corpus-converter
get_vocab|to_index|to_word=true
word_corpus=
index_corpus=
vocab=
```
- New implementation
- Arguments
- `get_vocab`, `to_index`, `to_word`: Only one of them should be true.
- `get_vocab`: Collect vocabulary from `` and write them in ``.
- `to_index`: Convert a word corpus file `` into an indexed corpus file `` and write the vocabulary in ``.
- `to_word`: Convert an indexed corpus file `` into a word corpus file `` given vocabulary file ``.
- ``: Corpus file in which documents are represented by words. Each line contains a document. Words are separated by space.
- ``: Corpus file in which documents are represented by word indexes and frequencies. Not required when using `--get-vocab`. Each line contains a document in the following format
```
: : ... :
```
`` is the total number of *tokens* in this document. `` denotes the i-th word in ``, starting from 0. Words with zero frequency can be omitted.
- ``: Vocabulary file. Each line contains a unique word.
###
Tree Builder
```
tool=tree-builder
vocab=
score=
tree=
```
- Implementation of [Yang et al. (2017)](#tree_builder_ref)
- Arguments
- ``: Vocabulary file. Each line contains a unique word.
- ``: Word association file. Assume there are V words in ``. There are V lines in the ``. Each line corresponds to a word in the vocabulary and contains V float numbers which denote the word's association scores with all other words.
- ``: The tree prior file.
- Optional Arguments
- `type=`: Tree prior type. 1 for two-level tree; 2 for hierarchical agglomerative clustering (HAC) tree; 3 for HAC tree with leaf duplication (default 1).
- `child=`: Number of child nodes per internal node for a two-level tree (default 10).
- `thresh=`: The confidence threshold for HAC (default 0.0).
##