https://github.com/andrewmsilva/insightoverflow
A bachelor's thesis focusing on making an exploratory analysis from Stack Overflow posts, making general and user-centric analyses on discussed topics.
https://github.com/andrewmsilva/insightoverflow
author-topic-model extraction latent-dirichlet-allocation machine-learning natural-language-processing nlp stack-overflow-posts topic-modeling
Last synced: 8 months ago
JSON representation
A bachelor's thesis focusing on making an exploratory analysis from Stack Overflow posts, making general and user-centric analyses on discussed topics.
- Host: GitHub
- URL: https://github.com/andrewmsilva/insightoverflow
- Owner: andrewmsilva
- License: mit
- Created: 2020-04-16T16:03:55.000Z (about 6 years ago)
- Default Branch: develop
- Last Pushed: 2021-06-15T02:35:46.000Z (almost 5 years ago)
- Last Synced: 2025-04-24T06:29:31.376Z (12 months ago)
- Topics: author-topic-model, extraction, latent-dirichlet-allocation, machine-learning, natural-language-processing, nlp, stack-overflow-posts, topic-modeling
- Language: Python
- Homepage:
- Size: 199 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Insight Overflow
An exploratory analysis employing topic modeling: Tracking evolution and loyalty from Stack Overflow users' interests
Running this experiment requires downloading Stack Overflow posts from the [data dump](https://archive.org/download/stackexchange/stackoverflow.com-Posts.7z) and extract the `.7z` file into ```src/data/```. As this algorithm employs Redis database for extraction step, installing, configuring, and starting Redis is essential (a tutorial is found [here](https://redis.io/topics/quickstart)).
## Extraction
```sh
Extraction started
Extracted: 49598818
Ignored: 739023
Total: 50337841
Execution time: 04:11:27.56
```
## Pre-processing
```
Pre-processing started
Execution time: 102:39:36.14
```
## Topic modeling
```
Topic modeling started
Corpus built: 00:00:01.65
Experiment done: k=20 i=10 | p=4133.9019, cv=0.4946
Experiment done: k=20 i=100 | p=1433.5471, cv=0.6330
Experiment done: k=20 i=200 | p=1388.5460, cv=0.6343
Experiment done: k=20 i=500 | p=1365.3670, cv=0.6341
Experiment done: k=40 i=10 | p=5503.5514, cv=0.5449
Experiment done: k=40 i=100 | p=1448.7289, cv=0.6046
Experiment done: k=40 i=200 | p=1379.5958, cv=0.6051
Experiment done: k=40 i=500 | p=1330.4556, cv=0.6072
Experiment done: k=60 i=10 | p=6675.3963, cv=0.5221
Experiment done: k=60 i=100 | p=1448.0626, cv=0.5874
Experiment done: k=60 i=200 | p=1349.6507, cv=0.5940
Experiment done: k=60 i=500 | p=1290.6926, cv=0.5880
Experiment done: k=80 i=10 | p=7576.2664, cv=0.5115
Experiment done: k=80 i=100 | p=1457.7716, cv=0.5800
Experiment done: k=80 i=200 | p=1351.4062, cv=0.5866
Experiment done: k=80 i=500 | p=1288.1277, cv=0.5892
Experiment done: k=100 i=10 | p=8093.3122, cv=0.5114
Experiment done: k=100 i=100 | p=1448.3062, cv=0.5762
Experiment done: k=100 i=200 | p=1341.3547, cv=0.5787
Experiment done: k=100 i=500 | p=1272.4512, cv=0.5794
Execution time: 00:54:22.32
```
## Post-processing
```
Post-processing started
Extracting topics
Creating coherence chart
Creating perplexity chart
Computing general popularity
Posts covered: 49573604
Number of posts with empty topics: 36085
Computed metrics: 4410
Creating general popularity charts
Computing user popularity
Posts covered: 49573604
Number of users: 4943206
Number of posts with empty topics: 36085
Computed metrics: 534554010
Creating user popularity charts
Execution time: 12:57:57.99
```