https://github.com/xiaohan2012/chowmein
Automatic labeling for topic model
https://github.com/xiaohan2012/chowmein
Last synced: 7 days ago
JSON representation
Automatic labeling for topic model
- Host: GitHub
- URL: https://github.com/xiaohan2012/chowmein
- Owner: xiaohan2012
- License: mit
- Created: 2015-08-04T10:34:25.000Z (almost 10 years ago)
- Default Branch: master
- Last Pushed: 2015-08-09T20:24:14.000Z (almost 10 years ago)
- Last Synced: 2024-04-14T18:06:58.034Z (about 1 year ago)
- Language: Python
- Homepage:
- Size: 5.68 MB
- Stars: 57
- Watchers: 4
- Forks: 9
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[](https://travis-ci.org/xiaohan2012/chowmein)
[](https://coveralls.io/github/xiaohan2012/chowmein?branch=master)# chowmein
Automatic labeling of topic models.
The alogirithm is described in [Automatic Labeling of Multinomial Topic Models](http://sifaka.cs.uiuc.edu/czhai/pub/kdd07-label.pdf)
# Example
We model the abstracts of `NIPS 2014`(NIPS abstracts from 2008 to 2014 is available under `datasets/`).
Meanwhile, we contrain the labels to be tagged as `NN,NN` or `JJ,NN` and use the top 200 most informative labels.```
>>> python label_topic.py --line_corpus_path datasets/nips-2014.dat --preprocessing wordlen tag --label_tags NN,NN JJ,NN --n_cand_labels 200
...
Topical words:
--------------------
Topic 0: model data framework clustering information distributions two number world propose noise real work small
Topic 1: learning algorithm time problem online regret information decision conditional new stochastic algorithms selection problems
Topic 2: algorithm algorithms problem results learning optimal show function class functions graph bounds based general
Topic 3: learning training networks data tasks features neural kernel performance classification model datasets feature deep
Topic 4: matrix method sparse convex problems methods dimensional problem rank analysis propose regression norm gradient
Topic 5: model models inference approach data linear based gaussian method methods process sampling structure timeTopical labels:
--------------------
Topic labels:
Topic 0: neural population, inference algorithm, likelihood estimator, stochastic optimization, matrix recovery, paper develop, empirical study, covariance matrix
Topic 1: bandit problem, near-optimal regret, function approximation, paper consider, general class, multi-armed bandit, value function, statistical learning
Topic 2: logarithmic factor, statistical learning, convergence rate, communication cost, other hand, main result, solution quality, function approximation
Topic 3: pascal voc, major challenge, natural language, paper introduce, object recognition, policy search, empirical study, classification accuracy
Topic 4: low-rank tensor, low-rank matrix, matrix recovery, coordinate descent, problem finding, direction method, statistical learning, risk minimization
Topic 5: inference algorithm, introduce novel, exponential family, probabilistic inference, neural population, value function, policy search, other hand
```# Usage
## Command line
For example:
>>> python label_topic.py --line_corpus_path datasets/nips-2014.dat --preprocess wordlen tag --label_tags NN,NN
For more details:
>>> python label_topic.py --help
## Programmatically
Please refer to `label_topic.py`.
# How it works
The current version goes through the following steps
1. Preprocessing using [nltk](http://www.nltk.org/)'s `word_tokenize`, `stem` and `pos_tag`.
1. Candidate phrase detection using *pointwise mutual information*: POS tag constraint can be applied. For now, only **bigrams** are considered.
2. Topic modeling using [LDA](https://pypi.python.org/pypi/lda).
3. Candidate label ranking using the algorithm [here](http://sifaka.cs.uiuc.edu/czhai/pub/kdd07-label.pdf).# TODO
- Better phrase detection thorugh better POS tagging
- Better ways to compute language models for labels to support `intra topical coverage` heuristic(which is now **disabled**)
- Support for user defined candidate labels
- Faster PMI computation(using Cythong for example)
- More flexibity/option on preprocessing
- Leveraging knowledge base to refine the labels