Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/vi3k6i5/GuidedLDA

semi supervised guided topic model with custom guidedLDA
https://github.com/vi3k6i5/GuidedLDA

data-science guided-topic-modeling guidedlda machine-learning seededlda topic-modeling

Last synced: 4 months ago
JSON representation

semi supervised guided topic model with custom guidedLDA

Host: GitHub
URL: https://github.com/vi3k6i5/GuidedLDA
Owner: vi3k6i5
License: mpl-2.0
Created: 2017-10-04T19:00:09.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2020-10-24T09:38:07.000Z (over 3 years ago)
Last Synced: 2024-01-09T06:54:05.732Z (6 months ago)
Topics: data-science, guided-topic-modeling, guidedlda, machine-learning, seededlda, topic-modeling
Language: Python
Homepage:
Size: 2.18 MB
Stars: 487
Watchers: 13
Forks: 103
Open Issues: 54
Metadata Files:
- Readme: README.rst
- Changelog: ChangeLog
- License: LICENSE

Lists

awesome-stars - vi3k6i5/GuidedLDA - semi supervised guided topic model with custom guidedLDA (Python)
awesome-topic-models - GuidedLDA - Python implementation that can be guided by setting some seed words per topic (using Gibbs sampling) [:page_facing_up:](https://www.aclweb.org/anthology/E12-1021.pdf) (Models / Miscellaneous topic models)
awesome-stars - vi3k6i5/GuidedLDA - semi supervised guided topic model with custom guidedLDA (Python)

README

        GuidedLDA: Guided Topic modeling with latent Dirichlet allocation

====================================================

.. image:: https://readthedocs.org/projects/guidedlda/badge/?version=latest

    :target: http://guidedlda.readthedocs.io/en/latest/?badge=latest

    :alt: Documentation Status

.. image:: https://badge.fury.io/py/guidedlda.svg

    :target: https://badge.fury.io/py/guidedlda

    :alt: Package version

``GuidedLDA`` OR ``SeededLDA`` implements latent Dirichlet allocation (LDA) using collapsed Gibbs sampling. ``GuidedLDA`` can be guided by setting some seed words per topic. Which will make the topics converge in that direction.

You can read more about guidedlda in `the documentation `_.

I published an article about it on `freecodecamp Medium blog `_.

Installation

------------

::

    pip install guidedlda

If pip install does not work, then try the next step:

::

    https://github.com/vi3k6i5/GuidedLDA

    cd GuidedLDA

    sh build_dist.sh

    python setup.py sdist

    pip install -e .

If the above step also does not work, please raise an `issue `_ with details of your workstation's OS version, Python version, architecture etc. and I will try my best to fix it ASAP.

Getting started

---------------

``guidedlda.GuidedLDA`` implements latent Dirichlet allocation (LDA). The interface follows

conventions found in scikit-learn_.

`Example Code `_.

The following demonstrates how to inspect a model of a subset of the NYT

news dataset. The input below, ``X``, is a document-term matrix (sparse matrices

are accepted).

.. code-block:: python

    >>> import numpy as np

    >>> import guidedlda

    

    >>> X = guidedlda.datasets.load_data(guidedlda.datasets.NYT)

    >>> vocab = guidedlda.datasets.load_vocab(guidedlda.datasets.NYT)

    >>> word2id = dict((v, idx) for idx, v in enumerate(vocab))

    

    >>> X.shape

    (8447, 3012)

    

    >>> X.sum()

    1221626

    >>> # Normal LDA without seeding

    >>> model = guidedlda.GuidedLDA(n_topics=5, n_iter=100, random_state=7, refresh=20)

    >>> model.fit(X)

    INFO:guidedlda:n_documents: 8447

    INFO:guidedlda:vocab_size: 3012

    INFO:guidedlda:n_words: 1221626

    INFO:guidedlda:n_topics: 5

    INFO:guidedlda:n_iter: 100

    WARNING:guidedlda:all zero column in document-term matrix found

    INFO:guidedlda:<0> log likelihood: -11489265

    INFO:guidedlda:<20> log likelihood: -9844667

    INFO:guidedlda:<40> log likelihood: -9694223

    INFO:guidedlda:<60> log likelihood: -9642506

    INFO:guidedlda:<80> log likelihood: -9617962

    INFO:guidedlda:<99> log likelihood: -9604031

    

    >>> topic_word = model.topic_word_

    >>> n_top_words = 8

    >>> for i, topic_dist in enumerate(topic_word):

    >>>     topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]

    >>>     print('Topic {}: {}'.format(i, ' '.join(topic_words)))

    Topic 0: company percent market business plan pay price increase

    Topic 1: game play team win player season second start

    Topic 2: life child write man school woman father family

    Topic 3: place open small house music turn large play

    Topic 4: official state government political states issue leader case

    

    >>> # Guided LDA with seed topics.

    >>> seed_topic_list = [['game', 'team', 'win', 'player', 'season', 'second', 'victory'],

    >>>                    ['percent', 'company', 'market', 'price', 'sell', 'business', 'stock', 'share'],

    >>>                    ['music', 'write', 'art', 'book', 'world', 'film'],

    >>>                    ['political', 'government', 'leader', 'official', 'state', 'country', 'american','case', 'law', 'police', 'charge', 'officer', 'kill', 'arrest', 'lawyer']]

    

    >>> model = guidedlda.GuidedLDA(n_topics=5, n_iter=100, random_state=7, refresh=20)

    

    >>> seed_topics = {}

    >>> for t_id, st in enumerate(seed_topic_list):

    >>>     for word in st:

    >>>         seed_topics[word2id[word]] = t_id

    

    >>> model.fit(X, seed_topics=seed_topics, seed_confidence=0.15)

    INFO:guidedlda:n_documents: 8447

    INFO:guidedlda:vocab_size: 3012

    INFO:guidedlda:n_words: 1221626

    INFO:guidedlda:n_topics: 5

    INFO:guidedlda:n_iter: 100

    WARNING:guidedlda:all zero column in document-term matrix found

    INFO:guidedlda:<0> log likelihood: -11486362

    INFO:guidedlda:<20> log likelihood: -9767277

    INFO:guidedlda:<40> log likelihood: -9663718

    INFO:guidedlda:<60> log likelihood: -9624150

    INFO:guidedlda:<80> log likelihood: -9601684

    INFO:guidedlda:<99> log likelihood: -9587803

    

    

    >>> n_top_words = 10

    >>> topic_word = model.topic_word_

    >>> for i, topic_dist in enumerate(topic_word):

    >>>     topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]

    >>>     print('Topic {}: {}'.format(i, ' '.join(topic_words)))

    Topic 0: game play team win season player second point start victory

    Topic 1: company percent market price business sell executive pay plan sale

    Topic 2: play life man music place write turn woman old book

    Topic 3: official government state political leader states issue case member country

    Topic 4: school child city program problem student state study family group

The document-topic distributions should be retrived as: ``doc_topic = model.transform(X)``.

.. code-block:: python

    >>> doc_topic = model.transform(X)

    >>> for i in range(9):

    >>>     print("top topic: {} Document: {}".format(doc_topic[i].argmax(), 

                                                      ', '.join(np.array(vocab)[list(reversed(X[i,:].argsort()))[0:5]])))

    top topic: 4 Document: plant, increase, food, increasingly, animal

    top topic: 3 Document: explain, life, country, citizen, nation

    top topic: 2 Document: thing, solve, problem, machine, carry

    top topic: 2 Document: company, authority, opera, artistic, director

    top topic: 3 Document: partner, lawyer, attorney, client, indict

    top topic: 2 Document: roll, place, soon, treat, rating

    top topic: 3 Document: city, drug, program, commission, report

    top topic: 1 Document: company, comic, series, case, executive

    top topic: 3 Document: son, scene, charge, episode, attack

Optionally, reduce the model by purging additional matrices:

.. code-block:: python

    >>> # Next step will lighten the model object

    >>> # This step will delete some matrices inside the model.

    >>> # you will be able to use model.transform(X) the same way as earlier.

    >>> # you wont be able to use model.fit_transform(X_new)

    >>> model.purge_extra_matrices()

Save the model for production or for running later:

.. code-block:: python

    >>> from six.moves import cPickle as pickle

    >>> with open('guidedlda_model.pickle', 'wb') as file_handle:

    >>>     pickle.dump(model, file_handle)

    >>> # load the model for prediction

    >>> with open('guidedlda_model.pickle', 'rb') as file_handle:

    >>>     model = pickle.load(file_handle)

    >>> doc_topic = model.transform(X)

Requirements

------------

Python 2.7 or Python 3.3+ is required. The following packages are required

- numpy_

- pbr_

Caveat

------

``guidedlda`` aims for Guiding LDA. More often then not the topics we get from a LDA model are not to our satisfaction. GuidedLDA can give the topics a nudge in the direction we want it to converge. We have production trained it for half a million documents (We have a big machine). We have run predictions on millions and manually checked topics for thousands (we are satisfied with the results).

If you are working with a very large corpus you may wish to use more sophisticated topic models such as those implemented in hca_ and MALLET_.  hca_ is written entirely in C and MALLET_ is written in Java. Unlike ``guidedlda``, hca_ can use more than one processor at a time. Both MALLET_ and hca_ implement topic models known to be more robust than standard latent Dirichlet allocation.

Notes

-----

Latent Dirichlet allocation is described in `Blei et al. (2003)`_ and `Pritchard

et al. (2000)`_. Inference using collapsed Gibbs sampling is described in

`Griffiths and Steyvers (2004)`_. And Guided LDA is described in `Jagadeesh Jagarlamudi, Hal Daume III and Raghavendra Udupa (2012)`_

Important links

---------------

- Documentation: http://guidedlda.readthedocs.org

- Source code: https://github.com/vi3k6i5/guidedlda/

- Issue tracker: https://github.com/vi3k6i5/guidedlda/issues

Other implementations

---------------------

- scikit-learn_'s `LatentDirichletAllocation `_ (uses online variational inference)

- `gensim `_ (uses online variational inference)

Credits

-------

I would like to thank the creators of `LDA project `_. I used the code from that LDA project as base to implement GuidedLDA on top of it.

Thanks to : `Allen Riddell `_ and `Tim Hopper `_. :)

License

-------

``guidedlda`` is licensed under Version 2.0 of the Mozilla Public License.

.. _Python: http://www.python.org/

.. _scikit-learn: http://scikit-learn.org

.. _hca: http://www.mloss.org/software/view/527/

.. _MALLET: http://mallet.cs.umass.edu/

.. _numpy: http://www.numpy.org/

.. _pbr: https://pypi.python.org/pypi/pbr

.. _Cython: http://cython.org

.. _Blei et al. (2003): http://jmlr.org/papers/v3/blei03a.html

.. _Pritchard et al. (2000): http://www.genetics.org/content/155/2/945.full

.. _Griffiths and Steyvers (2004): http://www.pnas.org/content/101/suppl_1/5228.abstract

.. _Jagadeesh Jagarlamudi, Hal Daume III and Raghavendra Udupa (2012): http://www.aclweb.org/anthology/E12-1021