An open API service indexing awesome lists of open source software.

https://github.com/connor-mccarthy/nlp-visualization-of-statistical-learning-book

📙 End-to-end NLP and data visualization pipeline of the text from a machine learning textbook.
https://github.com/connor-mccarthy/nlp-visualization-of-statistical-learning-book

clustering glove glove-embeddings hdbscan linear-algebra nlp pca principal-component-analysis statistical-learning statistics word-vectors

Last synced: about 2 months ago
JSON representation

📙 End-to-end NLP and data visualization pipeline of the text from a machine learning textbook.

Awesome Lists containing this project

README

          


📙 NLP and Data Viz Pipeline with GloVe, HDBSCAN, and t-SNE




Python 3.7.10


Code style: black


This project uses NLP and unsupervised learning to visualize the text of the canonical machine learning book [_The Elements of Statistical Learning_](https://web.stanford.edu/~hastie/Papers/ESLII.pdf).


Click here to explore the data yourself.

The pipeline represents the text of the book with GloVe embeddings, clusters it with HDBSCAN, and visualizes it with t-SNE.

## Pipeline steps:
1) Make HTTP request to obtain PDF
2) Convert single PDF file to array of PNG files
3) Use OCR to convert image to text
4) Apply rule-based pipeline to extract n-grams of theoretically unlimited length n if rules are met for all tokens in n-gram
5) Map tokens to GloVe embeddings (averaging where n-gram has n > 1)
6) Normalize vector embeddings
7) Cluster using HDBSCAN
8) Reduce dimensionality with PCA from dimensions (300,) --> (50,) for computational efficiency in subsequent t-SNE step
9) Reduce dimensionality further with t-SNE from dimensions (50,) --> (3,)
10) Plot vectors