https://github.com/anurima-saha/topic_modelling_lda_hdbscan
Using unsupervised learning to group reddit text and identify major conspiracy theories using NLP, LDA, spacy, SVD, SBert embedding and HDBSCAN.
https://github.com/anurima-saha/topic_modelling_lda_hdbscan
hdbscan latent-dirichlet-allocation natural-language-processing sbert spacy topic-modeling unsupervised-learning
Last synced: 7 months ago
JSON representation
Using unsupervised learning to group reddit text and identify major conspiracy theories using NLP, LDA, spacy, SVD, SBert embedding and HDBSCAN.
- Host: GitHub
- URL: https://github.com/anurima-saha/topic_modelling_lda_hdbscan
- Owner: anurima-saha
- Created: 2024-08-06T19:41:37.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-08-06T20:30:02.000Z (about 1 year ago)
- Last Synced: 2025-02-26T23:13:01.455Z (7 months ago)
- Topics: hdbscan, latent-dirichlet-allocation, natural-language-processing, sbert, spacy, topic-modeling, unsupervised-learning
- Language: HTML
- Homepage:
- Size: 4.21 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Topic Modelling
## Indentifying major conspiracy theories from reddit text
This project uses unsupervised learning to group reddit text and identify major conspiracy theories using NLP, LDA, spacy, SVD, SBert embedding and HDSCAN.## Data:
Shape - (17155, 2)
Columns – date (in type datetime[ns]), text (in type object)
Date Range - 01-01-2015 to 17-02-2023## Process Flow:
#### Data Cleaning:
* Removing leading spaces
>
* Removing emojis and any other component that is not a word or a number
>#### Term Extraction:
**Spacy** model "en_core_web_sm" has been used for term extract along with the Matcher feature it provides.
We are trying to detect a pattern that begins with an adjective or a noun followed by singular/ plural common nouns or proper nouns along with hypen.
From the extracted terms it is noticed that there are terms in text than do not contribute to our analysis like
>
So, we invoke a second round of Data Cleaning that removes words like
>
This is followed by term extraction using C values with theta = 100. The list of ten most common terms and 20 least common terms has been provided below

#### Tokenization:
We create tokens from text data using Spacy pipeline incorporating the terms as created above
## Topic Modelling:
We have used the LDA model from tomotopy with the following features

The twenty topics are plotted on a 2D space as below:

Looking at top 50 words from each topic we label them as shown in the table below
>
## Clustering:
* The “text” data has been cleaned to remove emojis and unnecessary text and punctuation expect “.” as required for sentence tokenization.
* We have removed all data that has less than 5 words.
* We have removed data that begins with "your post" or “please contact” to remove reddit submission messages.#### Sentence Splitter:
We have used sentence splitter from Spacy to take a batch of 5 sentences at a time from each post
#### Embedding:
###### SVD:
We have used TruncatedSVD to transform the data into a *20 dimensional vector*.
###### Spacy:
We have used the "en_core_web_lg" from Spacy to transform the cleaned text into a *300-dimension vector*. Further dimension reduction has been done using TruncatedSVD to bring it down to 20 dimension.
###### SBERT:
Dynamic embedding was done using SBERT sentence transformer from "all-mpnet-base-v2" to obtain a *768 dimension vector*.Further dimension reduction has been done using TruncatedSVD to bring it down to 20 dimensions.
###### Perplexity:
As seen above, with Perplexity=50, the SBERT model clearly produces better results as compared to LSA and Spacy. Thus, we tried to plot the vector projections with perplexity 100, 150 and 200, to compare results and select best model for clustering

Perplexity of 200 seems to provide optimal results for the analysis#### HDBSCAN
We have considered a cluster size of 20 and fitted the SBERT vectors.

Highlighted terms in each cluster after incorporating c-values gives us the following topics as focus
## Results:
#### Dynamic Bokeh Plot
For interactive feature and further details please refer to notebook and [project reoprt](https://github.com/anurima-saha/topic-modelling/blob/main/Project%20Report.pdf). A few examples have been highlighted below.
1. Presidential Election Fraud

2. Evolution Theory vs Religious Beliefs
3. Zionists – Israel and Palestine Crisis
4. Flat Earth

5. Ukraine War
6. Anti – Maskers (COVID 19)
