https://github.com/eeddaann/data-science-topic-modeling

Using data science for explaining what is data science..
https://github.com/eeddaann/data-science-topic-modeling

clustering data-science gensim lda nlp pyldavis stack-exchange topic-modeling

Last synced: 11 months ago
JSON representation

Using data science for explaining what is data science..

Host: GitHub
URL: https://github.com/eeddaann/data-science-topic-modeling
Owner: eeddaann
Created: 2018-03-08T20:11:31.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2018-03-09T11:58:04.000Z (over 8 years ago)
Last Synced: 2025-03-23T22:18:58.127Z (over 1 year ago)
Topics: clustering, data-science, gensim, lda, nlp, pyldavis, stack-exchange, topic-modeling
Language: Jupyter Notebook
Size: 464 KB
Stars: 0
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # data-science topic modeling - Using data science for explaining what is data science..

**note:** for viewing and playing with the results, [click here](http://nbviewer.jupyter.org/github/eeddaann/data-science-knowledge-representation/blob/86722160a5bf2bf4e278ab45a875f028000c187b/Untitled.ipynb)

### Data collection

This project is based on "[Data Science Stack Exchange](https://datascience.stackexchange.com/)" - website which dedicated to questions and answers  about data science.

And "[Cross Validated](https://stats.stackexchange.com/)" which is more focused on statistics.

To extract the tags from all the posts there I ran the following query in stack exchange's Data Explorer:

``` sql

SELECT Tags 

FROM Posts

WHERE Tags IS NOT NULL

```

The query result looks like this:

```

```

Where each row represents a post.

### extract transform load

Convert the data into list of lists:

(we use 2 data sources: "[Data Science Stack Exchange](https://datascience.stackexchange.com/)" and "[Cross Validated](https://stats.stackexchange.com/)")

``` python

lst = []

reader = csv.reader(open('QueryResults.csv'))

for line in reader:

    lst.append(unicode(line)[3:-3].split('><'))

reader2 = csv.reader(open('QueryResults2.csv'))

for line in reader2:

    lst.append(unicode(line)[3:-3].split('><'))

```

After we converted the data into list of lists, we used ```gensim``` to format the data :

``` python

dictionary = gensim.corpora.Dictionary(lst)

corpus = [dictionary.doc2bow(gen_doc) for gen_doc in lst]

lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=8)

```

**The most important parameter here is the ```num_topics``` which determine for how many topics we want to divide the model** - too many topics will result in very narrow topics but too few may lead to ambiguous topics..  

### visualization

For visualization we used  ```pyLDAvis```

![](Capture.png)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/eeddaann/data-science-topic-modeling

Awesome Lists containing this project

README