https://github.com/andrewmsilva/insightoverflow

A bachelor's thesis focusing on making an exploratory analysis from Stack Overflow posts, making general and user-centric analyses on discussed topics.
https://github.com/andrewmsilva/insightoverflow

author-topic-model extraction latent-dirichlet-allocation machine-learning natural-language-processing nlp stack-overflow-posts topic-modeling

Last synced: 10 months ago
JSON representation

A bachelor's thesis focusing on making an exploratory analysis from Stack Overflow posts, making general and user-centric analyses on discussed topics.

Host: GitHub
URL: https://github.com/andrewmsilva/insightoverflow
Owner: andrewmsilva
License: mit
Created: 2020-04-16T16:03:55.000Z (about 6 years ago)
Default Branch: develop
Last Pushed: 2021-06-15T02:35:46.000Z (almost 5 years ago)
Last Synced: 2025-04-24T06:29:31.376Z (about 1 year ago)
Topics: author-topic-model, extraction, latent-dirichlet-allocation, machine-learning, natural-language-processing, nlp, stack-overflow-posts, topic-modeling
Language: Python
Homepage:
Size: 199 KB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Insight Overflow

An exploratory analysis employing topic modeling: Tracking evolution and loyalty from Stack Overflow users' interests

Running this experiment requires downloading Stack Overflow posts from the [data dump](https://archive.org/download/stackexchange/stackoverflow.com-Posts.7z) and extract the `.7z` file into ```src/data/```. As this algorithm employs Redis database for extraction step, installing, configuring, and starting Redis is essential (a tutorial is found [here](https://redis.io/topics/quickstart)).

## Extraction

```sh

Extraction started

  Extracted: 49598818

  Ignored: 739023

  Total: 50337841

Execution time: 04:11:27.56

```

## Pre-processing

```

Pre-processing started

Execution time: 102:39:36.14

```

## Topic modeling

```

Topic modeling started

  Corpus built: 00:00:01.65

  Experiment done: k=20 i=10 | p=4133.9019, cv=0.4946

  Experiment done: k=20 i=100 | p=1433.5471, cv=0.6330

  Experiment done: k=20 i=200 | p=1388.5460, cv=0.6343

  Experiment done: k=20 i=500 | p=1365.3670, cv=0.6341

  Experiment done: k=40 i=10 | p=5503.5514, cv=0.5449

  Experiment done: k=40 i=100 | p=1448.7289, cv=0.6046

  Experiment done: k=40 i=200 | p=1379.5958, cv=0.6051

  Experiment done: k=40 i=500 | p=1330.4556, cv=0.6072

  Experiment done: k=60 i=10 | p=6675.3963, cv=0.5221

  Experiment done: k=60 i=100 | p=1448.0626, cv=0.5874

  Experiment done: k=60 i=200 | p=1349.6507, cv=0.5940

  Experiment done: k=60 i=500 | p=1290.6926, cv=0.5880

  Experiment done: k=80 i=10 | p=7576.2664, cv=0.5115

  Experiment done: k=80 i=100 | p=1457.7716, cv=0.5800

  Experiment done: k=80 i=200 | p=1351.4062, cv=0.5866

  Experiment done: k=80 i=500 | p=1288.1277, cv=0.5892

  Experiment done: k=100 i=10 | p=8093.3122, cv=0.5114

  Experiment done: k=100 i=100 | p=1448.3062, cv=0.5762

  Experiment done: k=100 i=200 | p=1341.3547, cv=0.5787

  Experiment done: k=100 i=500 | p=1272.4512, cv=0.5794

Execution time: 00:54:22.32

```

## Post-processing

```

Post-processing started

  Extracting topics

  Creating coherence chart

  Creating perplexity chart

  Computing general popularity

    Posts covered: 49573604

    Number of posts with empty topics: 36085

    Computed metrics: 4410

  Creating general popularity charts

  Computing user popularity

    Posts covered: 49573604

    Number of users: 4943206

    Number of posts with empty topics: 36085

    Computed metrics: 534554010

  Creating user popularity charts

Execution time: 12:57:57.99

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/andrewmsilva/insightoverflow

Awesome Lists containing this project

README