https://github.com/ruanchaves/hdp
HDP + T-SNE + k-NN applied to topic modeling
https://github.com/ruanchaves/hdp
clustering colorization gensim hdp knn lda python tsne-algorithm tsne-plot
Last synced: about 2 months ago
JSON representation
HDP + T-SNE + k-NN applied to topic modeling
- Host: GitHub
- URL: https://github.com/ruanchaves/hdp
- Owner: ruanchaves
- Created: 2018-07-26T15:32:54.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2018-11-15T15:18:15.000Z (over 7 years ago)
- Last Synced: 2025-06-05T22:05:33.969Z (about 1 year ago)
- Topics: clustering, colorization, gensim, hdp, knn, lda, python, tsne-algorithm, tsne-plot
- Language: Jupyter Notebook
- Homepage:
- Size: 1.79 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## HDP clusters
Here is the output of t-distributed Stochastic Neighbor Embedding dimensionality reduction applied to 90-dimensional topic vectors produced by [gensim's Hierarchical Dirichlet Process](https://radimrehurek.com/gensim/models/hdpmodel.html). t-SNE is applied twice, once for 90-dimensions to 2D and once for 90-dimensions to 3D. 2D results are interpreted as x,y-coordinates and 3D results are interpreted as colors. Although a human can certainly see the clusters, a computer only knows colored x,y-points so it can't deliver the clusters upon request.

I solved this problem by applying k-nearest neighbors algorithm to the t-SNE result, and after kNN I deleted edges between vertices which had different colors according to a certain degree of tolerance. Then I looked for connected components and I repainted the points according to which connected component they belonged to.

Here are the connected components when the tolerance is a little bit lower.

Here are the connected components when the tolerance is even lower.

This means our users won't have to directly deal with this map. When they request a certain document, they'll get a list of similar documents, that is, a list of points sorted according to their distance from the chosen point. And then they'll be able to select cluster colors ( topic categories ) to filter out the results.
## Related sources
[Topic Modeling and t-SNE Visualization](https://shuaiw.github.io/2016/12/22/topic-modeling-and-tsne-visualzation.html)
[Plot Latent Dirichlet Allocation output using t-SNE?](https://stats.stackexchange.com/questions/305356/plot-latent-dirichlet-allocation-output-using-t-sne)