{"id":13696492,"url":"https://github.com/joewandy/hlda","last_synced_at":"2026-03-27T04:29:09.817Z","repository":{"id":20265301,"uuid":"69615317","full_name":"joewandy/hlda","owner":"joewandy","description":"Gibbs sampler for the Hierarchical Latent Dirichlet Allocation topic model","archived":false,"fork":false,"pushed_at":"2025-07-01T18:42:08.000Z","size":3677,"stargazers_count":153,"open_issues_count":13,"forks_count":39,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-09-17T23:15:21.962Z","etag":null,"topics":["gibbs-sampler","hierarchical-topic-models","lda","topic-hierarchies","topic-modeling"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/joewandy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-09-29T23:13:42.000Z","updated_at":"2025-08-07T12:21:33.000Z","dependencies_parsed_at":"2023-01-14T08:00:38.347Z","dependency_job_id":null,"html_url":"https://github.com/joewandy/hlda","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/joewandy/hlda","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joewandy%2Fhlda","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joewandy%2Fhlda/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joewandy%2Fhlda/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joewandy%2Fhlda/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/joewandy","download_url":"https://codeload.github.com/joewandy/hlda/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joewandy%2Fhlda/sbom","scorecard":{"id":527413,"data":{"date":"2025-08-11","repo":{"name":"github.com/joewandy/hlda","commit":"886c590c5e6f94eb59ba6b0b96fcbb56f45e18df"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":3.3,"checks":[{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Code-Review","score":0,"reason":"Found 0/1 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Dangerous-Workflow","score":10,"reason":"no dangerous workflow patterns detected","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Maintained","score":0,"reason":"1 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Token-Permissions","score":0,"reason":"detected GitHub workflow tokens with excessive permissions","details":["Warn: no topLevel permission defined: .github/workflows/python-package.yml:1","Info: no jobLevel write permissions found"],"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"SAST","score":0,"reason":"no SAST tool detected","details":["Warn: no pull requests merged into dev branch"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Pinned-Dependencies","score":0,"reason":"dependency not pinned by hash detected -- score normalized to 0","details":["Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/python-package.yml:9: update your workflow using https://app.stepsecurity.io/secureworkflow/joewandy/hlda/python-package.yml/main?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/python-package.yml:11: update your workflow using https://app.stepsecurity.io/secureworkflow/joewandy/hlda/python-package.yml/main?enable=pin","Warn: pipCommand not pinned by hash: .github/workflows/python-package.yml:16","Info:   0 out of   2 GitHub-owned GitHubAction dependencies pinned","Info:   0 out of   1 pipCommand dependencies pinned"],"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE.txt:0","Info: FSF or OSI recognized license: MIT License: LICENSE.txt:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":3,"reason":"branch protection is not maximal on development and all release branches","details":["Info: 'allow deletion' disabled on branch 'main'","Info: 'force pushes' disabled on branch 'main'","Info: 'branch protection settings apply to administrators' is required to merge on branch 'main'","Warn: could not determine whether codeowners review is allowed","Warn: no status checks found to merge onto branch 'main'","Warn: PRs are not required to make changes on branch 'main'; or we don't have data to detect it.If you think it might be the latter, make sure to run Scorecard with a PAT or use Repo Rules (that are always public) instead of Branch Protection settings"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Vulnerabilities","score":6,"reason":"4 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: PYSEC-2025-61 / GHSA-xg8h-j46f-w952","Warn: Project is vulnerable to: GHSA-9hjg-9r4m-mvj7","Warn: Project is vulnerable to: GHSA-48p4-8xcf-vxj5","Warn: Project is vulnerable to: GHSA-pq67-6m6q-mj2v"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}}]},"last_synced_at":"2025-08-20T04:46:45.088Z","repository_id":20265301,"created_at":"2025-08-20T04:46:45.088Z","updated_at":"2025-08-20T04:46:45.088Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31019070,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-27T03:51:26.850Z","status":"ssl_error","status_checked_at":"2026-03-27T03:51:09.693Z","response_time":164,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gibbs-sampler","hierarchical-topic-models","lda","topic-hierarchies","topic-modeling"],"created_at":"2024-08-02T18:00:41.212Z","updated_at":"2026-03-27T04:29:09.809Z","avatar_url":"https://github.com/joewandy.png","language":"Python","funding_links":[],"categories":["Models"],"sub_categories":["Hierarchical LDA (hLDA) [:page_facing_up:](https://dl.acm.org/doi/10.5555/2981345.2981348)"],"readme":"Hierarchical Latent Dirichlet Allocation\n----------------------------------------\n\n**Note: this repository should only be used for education purpose. For production use, I'd recommend using https://github.com/bab2min/tomotopy which is more production-ready**\n\n---\n\nHierarchical Latent Dirichlet Allocation (hLDA) addresses the problem of learning topic\nhierarchies from data. The model relies on a non‑parametric prior called the nested\nChinese restaurant process, which allows for arbitrarily large branching factors and\neasily accommodates growing data collections. The hLDA model combines this prior with a\nlikelihood based on a hierarchical variant of Latent Dirichlet Allocation.\n\nThe original papers describing the algorithm are:\n\n- [Hierarchical Topic Models and the Nested Chinese Restaurant Process](http://www.cs.columbia.edu/~blei/papers/BleiGriffithsJordanTenenbaum2003.pdf)\n- [The Nested Chinese Restaurant Process and Bayesian Nonparametric Inference of Topic Hierarchies](http://cocosci.berkeley.edu/tom/papers/ncrp.pdf)\n\n## Overview\n\nThis repository contains a pure Python implementation of the Gibbs sampler for hLDA.\nIt is intended for experimentation and as a reference implementation. The code follows\nthe approach used in the original [Mallet](http://mallet.cs.umass.edu/topics.php)\nimplementation but with a simplified interface and a fixed depth for the tree.\n\nKey features include:\n\n- **Python 3.11+** support with minimal third‑party dependencies.\n- A small set of example scripts demonstrating how to run the sampler.\n- Utilities for visualising the resulting topic hierarchy.\n- Test suite for verifying the sampler on synthetic data and a small BBC corpus.\n\n## Installation\n\nThe package can be installed directly from PyPI:\n\n```bash\npip install hlda\n```\n\nAlternatively, to develop locally, clone this repository and install it in editable mode:\n\n```bash\ngit clone https://github.com/joewandy/hlda.git\ncd hlda\npip install -e .\npre-commit install\n```\n\n## Usage\n\nThe easiest way to get started is by using the sample BBC dataset provided in the\n`data/` directory. You can run the full demonstration from the command line:\n\n```bash\npython examples/bbc_demo.py --data-dir data/bbc/tech --iterations 20\n```\n\nIf you installed the package from PyPI you can run the same demo via the\n`hlda-run` command:\n\n```bash\nhlda-run --data-dir data/bbc/tech --iterations 20\n```\n\nTo write the learned hierarchy to disk in JSON format, pass\n`--export-tree \u003cfile\u003e` when running the script:\n\n```bash\npython scripts/run_hlda.py --data-dir data/bbc/tech --export-tree tree.json\n```\n\nIf you make use of the BBC dataset, please cite the publication by Greene and\nCunningham (2006) as detailed in [`CITATION.cff`](CITATION.cff).\n\nExample scripts for the BBC dataset and synthetic data are available in the\n[`examples/`](examples) directory.\n\nWithin Python you can also construct the sampler directly:\n\n```python\nfrom hlda.sampler import HierarchicalLDA\n\ncorpus = [[\"word\", \"word\", ...], ...]  # list of tokenised documents\nvocab = sorted({w for doc in corpus for w in doc})\n\nhlda = HierarchicalLDA(corpus, vocab, alpha=1.0, gamma=1.0, eta=0.1,\n                       num_levels=3, seed=0)\nhlda.estimate(iterations=50, display_topics=10)\n```\n\n### Integration with scikit-learn\n\nThe package provides a `HierarchicalLDAEstimator` that follows the scikit-learn API. This allows using the sampler inside a standard `Pipeline`.\n\n```python\nfrom sklearn.feature_extraction.text import CountVectorizer\nfrom sklearn.preprocessing import FunctionTransformer\nfrom sklearn.pipeline import Pipeline\nfrom hlda.sklearn_wrapper import HierarchicalLDAEstimator\n\nvectorizer = CountVectorizer()\nprep = FunctionTransformer(\n    lambda X: (\n        [[i for i, c in enumerate(row) for _ in range(int(c))] for row in X.toarray()],\n        list(vectorizer.get_feature_names_out()),\n    ),\n    validate=False,\n)\n\npipeline = Pipeline([\n    (\"vect\", vectorizer),\n    (\"prep\", prep),\n    (\"hlda\", HierarchicalLDAEstimator(num_levels=3, iterations=10, seed=0)),\n])\n\npipeline.fit(documents)\nassignments = pipeline.transform(documents)\n```\n\n\n## Running the tests\n\nThe repository includes a small test suite that checks the sampler on both the BBC\ncorpus and synthetic data. After installing the development dependencies you can run:\n\n```bash\npytest -q\n```\n\nAll tests should pass in a few seconds.\n\n## License\n\nThis project is licensed under the terms of the MIT license. See\n[`LICENSE.txt`](LICENSE.txt) for details.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoewandy%2Fhlda","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjoewandy%2Fhlda","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoewandy%2Fhlda/lists"}