https://github.com/joewandy/hlda
Gibbs sampler for the Hierarchical Latent Dirichlet Allocation topic model
https://github.com/joewandy/hlda
gibbs-sampler hierarchical-topic-models lda topic-hierarchies topic-modeling
Last synced: 3 months ago
JSON representation
Gibbs sampler for the Hierarchical Latent Dirichlet Allocation topic model
- Host: GitHub
- URL: https://github.com/joewandy/hlda
- Owner: joewandy
- License: mit
- Created: 2016-09-29T23:13:42.000Z (over 9 years ago)
- Default Branch: main
- Last Pushed: 2025-07-01T18:42:08.000Z (12 months ago)
- Last Synced: 2025-09-17T23:15:21.962Z (9 months ago)
- Topics: gibbs-sampler, hierarchical-topic-models, lda, topic-hierarchies, topic-modeling
- Language: Python
- Size: 3.51 MB
- Stars: 153
- Watchers: 4
- Forks: 39
- Open Issues: 13
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
- awesome-topic-models - hlda - Python package based on *Mallet's* Gibbs sampler having a fixed depth on the nCRP tree (Models / Hierarchical LDA (hLDA) [:page_facing_up:](https://dl.acm.org/doi/10.5555/2981345.2981348))
README
Hierarchical Latent Dirichlet Allocation
----------------------------------------
**Note: this repository should only be used for education purpose. For production use, I'd recommend using https://github.com/bab2min/tomotopy which is more production-ready**
---
Hierarchical Latent Dirichlet Allocation (hLDA) addresses the problem of learning topic
hierarchies from data. The model relies on a non‑parametric prior called the nested
Chinese restaurant process, which allows for arbitrarily large branching factors and
easily accommodates growing data collections. The hLDA model combines this prior with a
likelihood based on a hierarchical variant of Latent Dirichlet Allocation.
The original papers describing the algorithm are:
- [Hierarchical Topic Models and the Nested Chinese Restaurant Process](http://www.cs.columbia.edu/~blei/papers/BleiGriffithsJordanTenenbaum2003.pdf)
- [The Nested Chinese Restaurant Process and Bayesian Nonparametric Inference of Topic Hierarchies](http://cocosci.berkeley.edu/tom/papers/ncrp.pdf)
## Overview
This repository contains a pure Python implementation of the Gibbs sampler for hLDA.
It is intended for experimentation and as a reference implementation. The code follows
the approach used in the original [Mallet](http://mallet.cs.umass.edu/topics.php)
implementation but with a simplified interface and a fixed depth for the tree.
Key features include:
- **Python 3.11+** support with minimal third‑party dependencies.
- A small set of example scripts demonstrating how to run the sampler.
- Utilities for visualising the resulting topic hierarchy.
- Test suite for verifying the sampler on synthetic data and a small BBC corpus.
## Installation
The package can be installed directly from PyPI:
```bash
pip install hlda
```
Alternatively, to develop locally, clone this repository and install it in editable mode:
```bash
git clone https://github.com/joewandy/hlda.git
cd hlda
pip install -e .
pre-commit install
```
## Usage
The easiest way to get started is by using the sample BBC dataset provided in the
`data/` directory. You can run the full demonstration from the command line:
```bash
python examples/bbc_demo.py --data-dir data/bbc/tech --iterations 20
```
If you installed the package from PyPI you can run the same demo via the
`hlda-run` command:
```bash
hlda-run --data-dir data/bbc/tech --iterations 20
```
To write the learned hierarchy to disk in JSON format, pass
`--export-tree ` when running the script:
```bash
python scripts/run_hlda.py --data-dir data/bbc/tech --export-tree tree.json
```
If you make use of the BBC dataset, please cite the publication by Greene and
Cunningham (2006) as detailed in [`CITATION.cff`](CITATION.cff).
Example scripts for the BBC dataset and synthetic data are available in the
[`examples/`](examples) directory.
Within Python you can also construct the sampler directly:
```python
from hlda.sampler import HierarchicalLDA
corpus = [["word", "word", ...], ...] # list of tokenised documents
vocab = sorted({w for doc in corpus for w in doc})
hlda = HierarchicalLDA(corpus, vocab, alpha=1.0, gamma=1.0, eta=0.1,
num_levels=3, seed=0)
hlda.estimate(iterations=50, display_topics=10)
```
### Integration with scikit-learn
The package provides a `HierarchicalLDAEstimator` that follows the scikit-learn API. This allows using the sampler inside a standard `Pipeline`.
```python
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from hlda.sklearn_wrapper import HierarchicalLDAEstimator
vectorizer = CountVectorizer()
prep = FunctionTransformer(
lambda X: (
[[i for i, c in enumerate(row) for _ in range(int(c))] for row in X.toarray()],
list(vectorizer.get_feature_names_out()),
),
validate=False,
)
pipeline = Pipeline([
("vect", vectorizer),
("prep", prep),
("hlda", HierarchicalLDAEstimator(num_levels=3, iterations=10, seed=0)),
])
pipeline.fit(documents)
assignments = pipeline.transform(documents)
```
## Running the tests
The repository includes a small test suite that checks the sampler on both the BBC
corpus and synthetic data. After installing the development dependencies you can run:
```bash
pytest -q
```
All tests should pass in a few seconds.
## License
This project is licensed under the terms of the MIT license. See
[`LICENSE.txt`](LICENSE.txt) for details.