https://github.com/xiaohan2012/chowmein

Automatic labeling for topic model
https://github.com/xiaohan2012/chowmein

Last synced: 7 days ago
JSON representation

Automatic labeling for topic model

Host: GitHub
URL: https://github.com/xiaohan2012/chowmein
Owner: xiaohan2012
License: mit
Created: 2015-08-04T10:34:25.000Z (almost 10 years ago)
Default Branch: master
Last Pushed: 2015-08-09T20:24:14.000Z (almost 10 years ago)
Last Synced: 2024-04-14T18:06:58.034Z (about 1 year ago)
Language: Python
Homepage:
Size: 5.68 MB
Stars: 57
Watchers: 4
Forks: 9
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        [![Build Status](https://travis-ci.org/xiaohan2012/chowmein.svg?branch=master)](https://travis-ci.org/xiaohan2012/chowmein)

[![Coverage Status](https://coveralls.io/repos/xiaohan2012/chowmein/badge.svg?branch=master&service=github)](https://coveralls.io/github/xiaohan2012/chowmein?branch=master)

# chowmein

Automatic labeling of topic models.

The alogirithm is described in [Automatic Labeling of Multinomial Topic Models](http://sifaka.cs.uiuc.edu/czhai/pub/kdd07-label.pdf)

# Example 

We model the abstracts of `NIPS 2014`(NIPS abstracts from 2008 to 2014 is available under `datasets/`).

Meanwhile, we contrain the labels to be tagged as `NN,NN` or `JJ,NN` and use the top 200 most informative labels.

```

>>> python label_topic.py --line_corpus_path datasets/nips-2014.dat --preprocessing wordlen tag --label_tags NN,NN JJ,NN --n_cand_labels 200

...

Topical words:

--------------------

Topic 0: model data framework clustering information distributions two number world propose noise real work small

Topic 1: learning algorithm time problem online regret information decision conditional new stochastic algorithms selection problems

Topic 2: algorithm algorithms problem results learning optimal show function class functions graph bounds based general

Topic 3: learning training networks data tasks features neural kernel performance classification model datasets feature deep

Topic 4: matrix method sparse convex problems methods dimensional problem rank analysis propose regression norm gradient

Topic 5: model models inference approach data linear based gaussian method methods process sampling structure time

Topical labels:

--------------------

Topic labels:

Topic 0: neural population, inference algorithm, likelihood estimator, stochastic optimization, matrix recovery, paper develop, empirical study, covariance matrix

Topic 1: bandit problem, near-optimal regret, function approximation, paper consider, general class, multi-armed bandit, value function, statistical learning

Topic 2: logarithmic factor, statistical learning, convergence rate, communication cost, other hand, main result, solution quality, function approximation

Topic 3: pascal voc, major challenge, natural language, paper introduce, object recognition, policy search, empirical study, classification accuracy

Topic 4: low-rank tensor, low-rank matrix, matrix recovery, coordinate descent, problem finding, direction method, statistical learning, risk minimization

Topic 5: inference algorithm, introduce novel, exponential family, probabilistic inference, neural population, value function, policy search, other hand

```

# Usage

## Command line

For example:

    >>> python label_topic.py --line_corpus_path datasets/nips-2014.dat  --preprocess wordlen tag --label_tags NN,NN

For more details:

    >>> python label_topic.py --help

## Programmatically

Please refer to `label_topic.py`.

# How it works

The current version goes through the following steps

1. Preprocessing using [nltk](http://www.nltk.org/)'s `word_tokenize`, `stem` and `pos_tag`.

1. Candidate phrase detection using *pointwise mutual information*: POS tag constraint can be applied. For now, only **bigrams** are considered.

2. Topic modeling using [LDA](https://pypi.python.org/pypi/lda).

3. Candidate label ranking using the algorithm [here](http://sifaka.cs.uiuc.edu/czhai/pub/kdd07-label.pdf).

# TODO

- Better phrase detection thorugh better POS tagging

- Better ways to compute language models for labels to support `intra topical coverage` heuristic(which is now **disabled**)

- Support for user defined candidate labels

- Faster PMI computation(using Cythong for example)

- More flexibity/option on preprocessing

- Leveraging knowledge base to refine the labels

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/xiaohan2012/chowmein

Awesome Lists containing this project

README