https://github.com/chyikwei/bnp

Bayesian nonparametric models for python
https://github.com/chyikwei/bnp

bayesian data-analysis probabilistic-graphical-models python topic-modeling

Last synced: 5 months ago
JSON representation

Bayesian nonparametric models for python

Host: GitHub
URL: https://github.com/chyikwei/bnp
Owner: chyikwei
Created: 2017-03-04T04:08:08.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2018-09-11T14:39:53.000Z (over 6 years ago)
Last Synced: 2024-08-03T18:21:02.244Z (8 months ago)
Topics: bayesian, data-analysis, probabilistic-graphical-models, python, topic-modeling
Language: Python
Homepage:
Size: 6.44 MB
Stars: 17
Watchers: 2
Forks: 4
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-topic-models - bnp - Cython reimplementation based on *online-hdp* following scikit-learn's API. (Models / Hierarchical Dirichlet Process (HDP) [:page_facing_up:](https://papers.nips.cc/paper/2004/file/fb4ab556bc42d6f0ee0f9e24ec4d1af0-Paper.pdf))

README

        [![Build Status](https://travis-ci.org/chyikwei/bnp.svg?branch=master)](https://travis-ci.org/chyikwei/bnp)

[![Build Status](https://circleci.com/gh/chyikwei/bnp.png?&style=shield)](https://circleci.com/gh/gh/chyikwei/bnp)

[![Coverage Status](https://coveralls.io/repos/github/chyikwei/bnp/badge.svg?branch=master)](https://coveralls.io/github/chyikwei/bnp?branch=master)

# Bayesian Nonparametric

Bayesian Nonparametric models with Python.

Models follow scikit-learn's API and can be used as its extension.

Current model:

--------------

- **Hierarchical Dirichlet Process**

   HDP is similar to LDA (Latent Direchlet Allocation) but assumes an "infinite" number of topics. This implementation is based on Chong Wang's online-hdp and optimized with cython.

  

Reference:

----------

- "Stochastic Variational Inference", Matthew D. Hoffman, David M. Blei, Chong Wang, John Paisley, 2013

- "Online Variational Inference for the Hierarchical Dirichlet Process", Chong Wang, John Paisley, David M. Blei, 2011

- Chong Wang's [online-hdp code](https://github.com/blei-lab/online-hdp).

Install:

--------

```

# clone repoisitory

git clone [email protected]:chyikwei/bnp.git

cd bnp

# install dependencies (cython, numpy, scipy, scikit-learn)

pip install -r requirements.txt

pip install .

```

Getting started:

----------------

In `bnp.utils` we proivde a function to generate fake document-word matrix with hidden topics. We will run our HDP model with it.

First, we can generate a document-word matrix with 5 hidden topics. (each topic has 10 uniuque words and each topic has 100 docs.)

```python

>>> from __future__ import print_function

>>> from bnp.online_hdp import HierarchicalDirichletProcess

>>> from bnp.utils import make_doc_word_matrix

>>> tf = make_doc_word_matrix(n_topics=5,

...                           words_per_topic=10,

...                           docs_per_topic=100,

...                           words_per_doc=20,

...                           shuffle=True,

...                           random_state=0)

>>> tf.shape

(500, 50)

```

For samples in the matrix, each row(document) only contains words from a specific topic (word 0 to 9: topic 1, 10 to 19: topic 2,...)

```python

>>> tf[0].toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 1, 4, 1, 2, 3, 3, 0, 0,

        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

        0, 0, 0, 0, 0, 0]])

>>> tf[1].toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

        0, 0, 0, 0, 0, 0, 0, 0, 3, 2, 3, 1, 3, 2, 1, 2, 0, 3, 0, 0, 0, 0,

        0, 0, 0, 0, 0, 0]])

```

Next we fit a HDP model with this matrix

```python

>>> hdp = HierarchicalDirichletProcess(n_topic_truncate=10,

...                                    n_doc_truncate=3,

...                                    max_iter=5,

...                                    random_state=0)

>>> hdp.fit(tf)

```

Then we can print out topic proportion and top topic words in HDP model.

```python

# print topic function

>>> def print_top_words(model, n_words):

...     topic_distr = model.topic_distribution()

...     for topic_idx in range(model.lambda_.shape[0]):

...         topic = model.lambda_[topic_idx, :]

...         message = "Topic %d (proportion: %.2f): " % (topic_idx, topic_distr[topic_idx])

...         message += " ".join([str(i) for i in topic.argsort()[:-n_words - 1:-1]])

...         print(message)

>>> print_top_words(hdp, 10)

Topic 0 (proportion: 0.20): 3 1 7 5 8 4 0 2 9 6

Topic 1 (proportion: 0.00): 49 12 22 21 20 19 18 17 16 15

Topic 2 (proportion: 0.04): 43 49 44 45 47 40 46 48 41 42

Topic 3 (proportion: 0.13): 14 18 10 15 16 12 17 19 11 13

Topic 4 (proportion: 0.07): 19 16 10 15 11 17 12 13 18 14

Topic 5 (proportion: 0.01): 23 29 28 20 21 25 26 24 27 22

Topic 6 (proportion: 0.01): 31 38 35 39 30 33 34 37 32 36

Topic 7 (proportion: 0.19): 35 31 39 30 33 38 32 34 36 37

Topic 8 (proportion: 0.16): 48 42 46 49 45 47 41 44 40 43

Topic 9 (proportion: 0.19): 21 29 28 23 20 24 26 27 25 22

```

Here HDP find 7 large topics (> 1%) and those can map to the hidden topics we generated before.

Examples

--------

In `bnp/examples` folder. (Will add ipython notebook soon)

Running Test:

-------------

```

python setup.py test

```

Uninstall:

----------

```

pip uninstall bnp

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/chyikwei/bnp

Awesome Lists containing this project

README