{"id":23715058,"url":"https://github.com/vi3k6i5/guidedlda","last_synced_at":"2025-04-12T23:43:31.007Z","repository":{"id":53396756,"uuid":"105806605","full_name":"vi3k6i5/GuidedLDA","owner":"vi3k6i5","description":"semi supervised guided topic model with custom guidedLDA","archived":false,"fork":false,"pushed_at":"2020-10-24T09:38:07.000Z","size":2287,"stargazers_count":505,"open_issues_count":55,"forks_count":110,"subscribers_count":12,"default_branch":"master","last_synced_at":"2025-04-12T23:43:26.314Z","etag":null,"topics":["data-science","guided-topic-modeling","guidedlda","machine-learning","seededlda","topic-modeling"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vi3k6i5.png","metadata":{"files":{"readme":"README.rst","changelog":"ChangeLog","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-10-04T19:00:09.000Z","updated_at":"2025-03-29T08:13:03.000Z","dependencies_parsed_at":"2022-08-23T16:31:29.719Z","dependency_job_id":null,"html_url":"https://github.com/vi3k6i5/GuidedLDA","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vi3k6i5%2FGuidedLDA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vi3k6i5%2FGuidedLDA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vi3k6i5%2FGuidedLDA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vi3k6i5%2FGuidedLDA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vi3k6i5","download_url":"https://codeload.github.com/vi3k6i5/GuidedLDA/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248647256,"owners_count":21139081,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","guided-topic-modeling","guidedlda","machine-learning","seededlda","topic-modeling"],"created_at":"2024-12-30T20:52:40.447Z","updated_at":"2025-04-12T23:43:30.989Z","avatar_url":"https://github.com/vi3k6i5.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"GuidedLDA: Guided Topic modeling with latent Dirichlet allocation\n====================================================\n\n.. image:: https://readthedocs.org/projects/guidedlda/badge/?version=latest\n    :target: http://guidedlda.readthedocs.io/en/latest/?badge=latest\n    :alt: Documentation Status\n\n.. image:: https://badge.fury.io/py/guidedlda.svg\n    :target: https://badge.fury.io/py/guidedlda\n    :alt: Package version\n\n\n``GuidedLDA`` OR ``SeededLDA`` implements latent Dirichlet allocation (LDA) using collapsed Gibbs sampling. ``GuidedLDA`` can be guided by setting some seed words per topic. Which will make the topics converge in that direction.\n\nYou can read more about guidedlda in `the documentation \u003chttps://guidedlda.readthedocs.io\u003e`_.\n\nI published an article about it on `freecodecamp Medium blog \u003chttps://medium.freecodecamp.org/how-we-changed-unsupervised-lda-to-semi-supervised-guidedlda-e36a95f3a164\u003e`_.\n\nInstallation\n------------\n\n::\n\n    pip install guidedlda\n\nIf pip install does not work, then try the next step:\n\n::\n\n    https://github.com/vi3k6i5/GuidedLDA\n    cd GuidedLDA\n    sh build_dist.sh\n    python setup.py sdist\n    pip install -e .\n\nIf the above step also does not work, please raise an `issue \u003chttps://github.com/vi3k6i5/guidedlda/issues\u003e`_ with details of your workstation's OS version, Python version, architecture etc. and I will try my best to fix it ASAP.\n\nGetting started\n---------------\n\n``guidedlda.GuidedLDA`` implements latent Dirichlet allocation (LDA). The interface follows\nconventions found in scikit-learn_.\n\n`Example Code \u003chttps://github.com/vi3k6i5/GuidedLDA/blob/master/examples/example_seeded_lda.py\u003e`_.\n\n\nThe following demonstrates how to inspect a model of a subset of the NYT\nnews dataset. The input below, ``X``, is a document-term matrix (sparse matrices\nare accepted).\n\n.. code-block:: python\n\n    \u003e\u003e\u003e import numpy as np\n    \u003e\u003e\u003e import guidedlda\n    \n    \u003e\u003e\u003e X = guidedlda.datasets.load_data(guidedlda.datasets.NYT)\n    \u003e\u003e\u003e vocab = guidedlda.datasets.load_vocab(guidedlda.datasets.NYT)\n    \u003e\u003e\u003e word2id = dict((v, idx) for idx, v in enumerate(vocab))\n    \n    \u003e\u003e\u003e X.shape\n    (8447, 3012)\n    \n    \u003e\u003e\u003e X.sum()\n    1221626\n    \u003e\u003e\u003e # Normal LDA without seeding\n    \u003e\u003e\u003e model = guidedlda.GuidedLDA(n_topics=5, n_iter=100, random_state=7, refresh=20)\n    \u003e\u003e\u003e model.fit(X)\n    INFO:guidedlda:n_documents: 8447\n    INFO:guidedlda:vocab_size: 3012\n    INFO:guidedlda:n_words: 1221626\n    INFO:guidedlda:n_topics: 5\n    INFO:guidedlda:n_iter: 100\n    WARNING:guidedlda:all zero column in document-term matrix found\n    INFO:guidedlda:\u003c0\u003e log likelihood: -11489265\n    INFO:guidedlda:\u003c20\u003e log likelihood: -9844667\n    INFO:guidedlda:\u003c40\u003e log likelihood: -9694223\n    INFO:guidedlda:\u003c60\u003e log likelihood: -9642506\n    INFO:guidedlda:\u003c80\u003e log likelihood: -9617962\n    INFO:guidedlda:\u003c99\u003e log likelihood: -9604031\n    \n    \u003e\u003e\u003e topic_word = model.topic_word_\n    \u003e\u003e\u003e n_top_words = 8\n    \u003e\u003e\u003e for i, topic_dist in enumerate(topic_word):\n    \u003e\u003e\u003e     topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]\n    \u003e\u003e\u003e     print('Topic {}: {}'.format(i, ' '.join(topic_words)))\n    Topic 0: company percent market business plan pay price increase\n    Topic 1: game play team win player season second start\n    Topic 2: life child write man school woman father family\n    Topic 3: place open small house music turn large play\n    Topic 4: official state government political states issue leader case\n    \n    \u003e\u003e\u003e # Guided LDA with seed topics.\n    \u003e\u003e\u003e seed_topic_list = [['game', 'team', 'win', 'player', 'season', 'second', 'victory'],\n    \u003e\u003e\u003e                    ['percent', 'company', 'market', 'price', 'sell', 'business', 'stock', 'share'],\n    \u003e\u003e\u003e                    ['music', 'write', 'art', 'book', 'world', 'film'],\n    \u003e\u003e\u003e                    ['political', 'government', 'leader', 'official', 'state', 'country', 'american','case', 'law', 'police', 'charge', 'officer', 'kill', 'arrest', 'lawyer']]\n    \n    \u003e\u003e\u003e model = guidedlda.GuidedLDA(n_topics=5, n_iter=100, random_state=7, refresh=20)\n    \n    \u003e\u003e\u003e seed_topics = {}\n    \u003e\u003e\u003e for t_id, st in enumerate(seed_topic_list):\n    \u003e\u003e\u003e     for word in st:\n    \u003e\u003e\u003e         seed_topics[word2id[word]] = t_id\n    \n    \u003e\u003e\u003e model.fit(X, seed_topics=seed_topics, seed_confidence=0.15)\n    INFO:guidedlda:n_documents: 8447\n    INFO:guidedlda:vocab_size: 3012\n    INFO:guidedlda:n_words: 1221626\n    INFO:guidedlda:n_topics: 5\n    INFO:guidedlda:n_iter: 100\n    WARNING:guidedlda:all zero column in document-term matrix found\n    INFO:guidedlda:\u003c0\u003e log likelihood: -11486362\n    INFO:guidedlda:\u003c20\u003e log likelihood: -9767277\n    INFO:guidedlda:\u003c40\u003e log likelihood: -9663718\n    INFO:guidedlda:\u003c60\u003e log likelihood: -9624150\n    INFO:guidedlda:\u003c80\u003e log likelihood: -9601684\n    INFO:guidedlda:\u003c99\u003e log likelihood: -9587803\n    \n    \n    \u003e\u003e\u003e n_top_words = 10\n    \u003e\u003e\u003e topic_word = model.topic_word_\n    \u003e\u003e\u003e for i, topic_dist in enumerate(topic_word):\n    \u003e\u003e\u003e     topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]\n    \u003e\u003e\u003e     print('Topic {}: {}'.format(i, ' '.join(topic_words)))\n    Topic 0: game play team win season player second point start victory\n    Topic 1: company percent market price business sell executive pay plan sale\n    Topic 2: play life man music place write turn woman old book\n    Topic 3: official government state political leader states issue case member country\n    Topic 4: school child city program problem student state study family group\n\nThe document-topic distributions should be retrived as: ``doc_topic = model.transform(X)``.\n\n.. code-block:: python\n\n    \u003e\u003e\u003e doc_topic = model.transform(X)\n    \u003e\u003e\u003e for i in range(9):\n    \u003e\u003e\u003e     print(\"top topic: {} Document: {}\".format(doc_topic[i].argmax(), \n                                                      ', '.join(np.array(vocab)[list(reversed(X[i,:].argsort()))[0:5]])))\n    top topic: 4 Document: plant, increase, food, increasingly, animal\n    top topic: 3 Document: explain, life, country, citizen, nation\n    top topic: 2 Document: thing, solve, problem, machine, carry\n    top topic: 2 Document: company, authority, opera, artistic, director\n    top topic: 3 Document: partner, lawyer, attorney, client, indict\n    top topic: 2 Document: roll, place, soon, treat, rating\n    top topic: 3 Document: city, drug, program, commission, report\n    top topic: 1 Document: company, comic, series, case, executive\n    top topic: 3 Document: son, scene, charge, episode, attack\n\nOptionally, reduce the model by purging additional matrices:\n\n.. code-block:: python\n\n    \u003e\u003e\u003e # Next step will lighten the model object\n    \u003e\u003e\u003e # This step will delete some matrices inside the model.\n    \u003e\u003e\u003e # you will be able to use model.transform(X) the same way as earlier.\n    \u003e\u003e\u003e # you wont be able to use model.fit_transform(X_new)\n    \u003e\u003e\u003e model.purge_extra_matrices()\n\nSave the model for production or for running later:\n\n.. code-block:: python\n\n    \u003e\u003e\u003e from six.moves import cPickle as pickle\n    \u003e\u003e\u003e with open('guidedlda_model.pickle', 'wb') as file_handle:\n    \u003e\u003e\u003e     pickle.dump(model, file_handle)\n    \u003e\u003e\u003e # load the model for prediction\n    \u003e\u003e\u003e with open('guidedlda_model.pickle', 'rb') as file_handle:\n    \u003e\u003e\u003e     model = pickle.load(file_handle)\n    \u003e\u003e\u003e doc_topic = model.transform(X)\n\n\nRequirements\n------------\n\nPython 2.7 or Python 3.3+ is required. The following packages are required\n\n- numpy_\n- pbr_\n\nCaveat\n------\n\n``guidedlda`` aims for Guiding LDA. More often then not the topics we get from a LDA model are not to our satisfaction. GuidedLDA can give the topics a nudge in the direction we want it to converge. We have production trained it for half a million documents (We have a big machine). We have run predictions on millions and manually checked topics for thousands (we are satisfied with the results).\n\nIf you are working with a very large corpus you may wish to use more sophisticated topic models such as those implemented in hca_ and MALLET_.  hca_ is written entirely in C and MALLET_ is written in Java. Unlike ``guidedlda``, hca_ can use more than one processor at a time. Both MALLET_ and hca_ implement topic models known to be more robust than standard latent Dirichlet allocation.\n\nNotes\n-----\n\nLatent Dirichlet allocation is described in `Blei et al. (2003)`_ and `Pritchard\net al. (2000)`_. Inference using collapsed Gibbs sampling is described in\n`Griffiths and Steyvers (2004)`_. And Guided LDA is described in `Jagadeesh Jagarlamudi, Hal Daume III and Raghavendra Udupa (2012)`_\n\n\nImportant links\n---------------\n\n- Documentation: http://guidedlda.readthedocs.org\n- Source code: https://github.com/vi3k6i5/guidedlda/\n- Issue tracker: https://github.com/vi3k6i5/guidedlda/issues\n\nOther implementations\n---------------------\n- scikit-learn_'s `LatentDirichletAllocation \u003chttp://scikit-learn.org/dev/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html\u003e`_ (uses online variational inference)\n- `gensim \u003chttps://pypi.python.org/pypi/gensim\u003e`_ (uses online variational inference)\n\nCredits\n-------\nI would like to thank the creators of `LDA project \u003chttps://github.com/lda-project/lda\u003e`_. I used the code from that LDA project as base to implement GuidedLDA on top of it.\n\nThanks to : `Allen Riddell \u003chttps://twitter.com/ariddell\u003e`_ and `Tim Hopper \u003chttps://twitter.com/tdhopper\u003e`_. :)\n\nLicense\n-------\n\n``guidedlda`` is licensed under Version 2.0 of the Mozilla Public License.\n\n.. _Python: http://www.python.org/\n.. _scikit-learn: http://scikit-learn.org\n.. _hca: http://www.mloss.org/software/view/527/\n.. _MALLET: http://mallet.cs.umass.edu/\n.. _numpy: http://www.numpy.org/\n.. _pbr: https://pypi.python.org/pypi/pbr\n.. _Cython: http://cython.org\n.. _Blei et al. (2003): http://jmlr.org/papers/v3/blei03a.html\n.. _Pritchard et al. (2000): http://www.genetics.org/content/155/2/945.full\n.. _Griffiths and Steyvers (2004): http://www.pnas.org/content/101/suppl_1/5228.abstract\n.. _Jagadeesh Jagarlamudi, Hal Daume III and Raghavendra Udupa (2012): http://www.aclweb.org/anthology/E12-1021\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvi3k6i5%2Fguidedlda","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvi3k6i5%2Fguidedlda","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvi3k6i5%2Fguidedlda/lists"}