https://github.com/connor-mccarthy/nlp-visualization-of-statistical-learning-book
📙 End-to-end NLP and data visualization pipeline of the text from a machine learning textbook.
https://github.com/connor-mccarthy/nlp-visualization-of-statistical-learning-book
clustering glove glove-embeddings hdbscan linear-algebra nlp pca principal-component-analysis statistical-learning statistics word-vectors
Last synced: about 2 months ago
JSON representation
📙 End-to-end NLP and data visualization pipeline of the text from a machine learning textbook.
- Host: GitHub
- URL: https://github.com/connor-mccarthy/nlp-visualization-of-statistical-learning-book
- Owner: connor-mccarthy
- Created: 2021-02-08T02:10:05.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2021-04-19T15:36:45.000Z (over 4 years ago)
- Last Synced: 2025-03-17T05:34:28.613Z (7 months ago)
- Topics: clustering, glove, glove-embeddings, hdbscan, linear-algebra, nlp, pca, principal-component-analysis, statistical-learning, statistics, word-vectors
- Language: HTML
- Homepage:
- Size: 1.19 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
This project uses NLP and unsupervised learning to visualize the text of the canonical machine learning book [_The Elements of Statistical Learning_](https://web.stanford.edu/~hastie/Papers/ESLII.pdf).
Click here to explore the data yourself.
The pipeline represents the text of the book with GloVe embeddings, clusters it with HDBSCAN, and visualizes it with t-SNE.
## Pipeline steps:
1) Make HTTP request to obtain PDF
2) Convert single PDF file to array of PNG files
3) Use OCR to convert image to text
4) Apply rule-based pipeline to extract n-grams of theoretically unlimited length n if rules are met for all tokens in n-gram
5) Map tokens to GloVe embeddings (averaging where n-gram has n > 1)
6) Normalize vector embeddings
7) Cluster using HDBSCAN
8) Reduce dimensionality with PCA from dimensions (300,) --> (50,) for computational efficiency in subsequent t-SNE step
9) Reduce dimensionality further with t-SNE from dimensions (50,) --> (3,)
10) Plot vectors